Skip to content

Table of Contents

cs.CL [Back]

[1] Large Language Models in the Travel Domain: An Industrial Experience

Sergio Di Meglio,Aniello Somma,Luigi Libero Lucio Starace,Fabio Scippacercola,Giancarlo Sperlì,Sergio Di Martino

Main category: cs.CL

TL;DR: This study explores the use of LLMs (Mistral 7B and Mixtral 8x7B) on a hotel booking platform, finding that Mixtral 8x7B offers higher-quality output but at a much higher computational cost.

Details Motivation: The motivation stems from the problem of inconsistent and incomplete accommodation data from third-party sources, which affects user satisfaction and market retention on property booking platforms. Method: The study evaluates two LLMs, Mistral 7B (fine-tuned with QLoRA) and Mixtral 8x7B (used with a refined system prompt), in terms of their ability to generate accurate and consistent accommodation descriptions on the CALEIDOHOTELS platform. Result: Mixtral 8x7B outperformed Mistral 7B in completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and lower hallucination rate (1.2% vs. 4%), while producing shorter content (249 vs. 277 words). However, Mixtral required significantly more resources (50GB VRAM and $1.61/hour vs. 5GB and $0.16/hour). Conclusion: The study concludes that Mixtral 8x7B provides better performance in generating consistent and concise accommodation descriptions, but it comes with higher computational costs, making the choice of model dependent on the balance between quality and resource efficiency. Abstract: Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and $1.61/hour versus 5GB and $0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.

[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang,Qingke Peng,Haozhou Li,Zeyuan Zeng,Qinfeng Song,Kaixuan Yang,Jiangbo Zhang,Yaoying Wang,Ruimeng Li,Biyi Zhou

Main category: cs.CL

TL;DR: ElectriQ is introduced as the first benchmark for evaluating and improving LLMs in electric power marketing, showing that smaller, domain-enhanced models can outperform larger general models.

Details Motivation: Current electric power marketing systems struggle with response times, flexibility, and accuracy, while general LLMs lack domain expertise and empathy in this field. Method: The study introduces ElectriQ, a benchmark with a dialogue dataset and four evaluation metrics, incorporating a domain-specific knowledge base and a knowledge augmentation method. Result: Experiments on 13 LLMs show that smaller models like LLama3-8B can surpass GPT-4o in professionalism and user-friendliness when fine-tuned and augmented. Conclusion: ElectriQ provides a foundation for developing LLMs tailored to power marketing services, showing that smaller models can outperform larger ones when fine-tuned and knowledge-augmented. Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China's 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.

[3] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

Navid Yazdanjue,Morteza Rakhshaninejad,Hossein Yazdanjouei,Mohammad Sadegh Khorshidi,Mikko S. Niemela,Fang Chen,Amir H. Gandomi

Main category: cs.CL

TL;DR: 本文提出了一种用于检测和分类非法市场内容的分层分类框架,并展示了其在多个数据集上的优越表现。

Details Motivation: 非法市场越来越多地转向互联网的隐蔽部分,检测和分类此类内容仍然具有挑战性。 Method: 提出了一种结合微调语言模型和半监督集成学习策略的分层分类框架,用于检测和分类非法市场内容。 Result: 模型在三个数据集上的实验结果显示了其优于多个基线模型的表现。 Conclusion: 研究结果表明,该框架在准确性、F1分数和TMCC方面均优于现有模型,证明其在非法内容检测中的有效性和鲁棒性。 Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.

[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

Jinyu Liu,Xiaoying Song,Diana Zhang,Jason Thomale,Daqing He,Lingzi Hong

Main category: cs.CL

TL;DR: A hybrid framework combining ML models and LLMs improves the accuracy and reliability of subject term prediction by guiding LLM generations and reducing hallucinations.

Details Motivation: LLMs are underexplored for subject analysis, while traditional ML models struggle with unseen cases, necessitating a more reliable approach. Method: A hybrid framework was developed that uses ML models to predict the number of LCSH labels and post-edit LLM predictions to align with actual LCSH terms. Result: The hybrid approach resulted in more controlled and vocabulary-aligned subject term predictions compared to LLMs alone. Conclusion: The hybrid framework combining ML models and LLMs improves subject analysis by controlling LLM outputs and reducing hallucinations. Abstract: Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.

[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

Victor Eiti Yamamoto,Hideaki Takeda

Main category: cs.CL

TL;DR: This paper proposes a novel method for integrating knowledge graphs by combining label and triple matching techniques, addressing the underexplored challenge of context matching.

Details Motivation: Context matching in knowledge graphs remains underexplored despite its importance in integrating diverse and complex real-world KGs that vary in source, size, and information density. Method: The method involves two main steps: label matching using string manipulation, fuzzy matching, and vector similarity; and triple matching to identify comparable information mappings. Result: The approach achieves high accuracy across diverse test cases and performs competitively in the OAEI competition compared to supervised methods. Conclusion: The proposed KG integration method effectively enhances entity-matching accuracy by incorporating label and triple matching, showing competitive performance compared to leading systems. Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.

[6] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

Esmail Gumaan

Main category: cs.CL

TL;DR: This paper provides a comprehensive theoretical and practical framework for understanding and addressing hallucinations in Large Language Models, including definitions, detection strategies, mitigation techniques, and evaluation protocols.

Details Motivation: Hallucinations in Large Language Models (LLMs) pose a critical challenge by generating content that is not faithful to inputs or real-world facts. This work aims to provide a rigorous theoretical foundation and practical strategies to detect, mitigate, and evaluate hallucinations in LLMs. Method: The paper uses learning-theoretic frameworks (PAC-Bayes and Rademacher complexity) to derive bounds on hallucination risk and surveys existing detection and mitigation strategies. It also proposes a unified workflow and evaluation protocols. Result: The paper formally defines hallucination risk, distinguishes between intrinsic and extrinsic hallucinations, derives theoretical bounds on risk, surveys detection and mitigation techniques, and proposes a unified workflow and evaluation protocols for hallucination management. Conclusion: This paper concludes that hallucination in LLMs is a significant challenge that can be theoretically analyzed and practically addressed through a combination of detection and mitigation strategies. It emphasizes the importance of rigorous evaluation and proposes a unified workflow for managing hallucinations. Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.

[7] Reading Between the Timelines: RAG for Answering Diachronic Questions

Kwun Hang Lau,Ruiyuan Zhang,Weijie Shi,Xiaofang Zhou,Xiaojun Cheng

Main category: cs.CL

TL;DR: This paper proposes a new RAG framework infused with temporal logic to better handle longitudinal queries, yielding significant accuracy improvements on a new benchmark called ADQAB.

Details Motivation: RAG systems struggle with longitudinal queries requiring tracking over time due to limitations in gathering temporally coherent evidence. Method: The methodology involves disentangling user queries into core subjects and temporal windows, employing a retriever that balances semantic and temporal relevance. Result: Empirical results on the ADQAB benchmark show that the approach improves answer accuracy by 13% to 27% compared to standard RAG implementations. Conclusion: The proposed framework enhances RAG systems by infusing temporal logic, enabling nuanced evolutionary analysis for complex real-world questions. Abstract: While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at https://github.com/kwunhang/TA-RAG.

[8] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Daniel Son,Sanjana Rathore,Andrew Rufail,Adrian Simon,Daniel Zhang,Soham Dave,Cole Blondin,Kevin Zhu,Sean O'Brien

Main category: cs.CL

TL;DR: This study finds that despite size differences, Gemma-2 language models show feature universality, particularly in middle layers, suggesting that large language models interpret the world in broadly similar ways.

Details Motivation: To understand if models with significant scale differences converge on similar internal concepts, which has implications for cross-model interpretability. Method: The study uses Sparse Autoencoder (SAE) dictionary-learning on residual-stream activations of Gemma-2-2B and Gemma-2-9B models. Feature alignment is performed via activation correlation, and the feature spaces are compared using SVCCA and RSA. Result: Middle layers of the models showed strong feature overlap, while early and late layers were less similar. Preliminary experiments also showed that semantically similar multi-token subspaces interact similarly with language models. Conclusion: Large language models, despite size differences, develop broadly similar and interpretable internal features, supporting the concept of feature universality. Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model's residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.

[9] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Qixuan Hu,Xumou Zhang,Jinman Kim,Florence Bourgeois,Adam G. Dunn

Main category: cs.CL

TL;DR: 研究开发了预测临床试验安全性结果的模型,利用ClinicalTrials.gov注册信息,通过迁移学习和预训练语言模型进行特征提取,结合下游模型进行预测,使用滑动窗口方法保持长文本的语义表示,结果显示模型在预测SAE率和比例方面表现良好,具有改善试验设计和标记预期与报告结果差异的潜力。

Details Motivation: 通过准确估计预期的安全性结果,临床试验可以设计以避免终止并限制参与者暴露于不必要的风险中。 Method: 研究使用了来自ClinicalTrials.gov的22,107个两臂平行干预临床试验的结构化摘要结果,开发了两种预测模型:一种分类器预测实验组是否比对照组有更高的SAE率,另一种回归模型预测对照组中的SAE比例。使用了迁移学习方法和预训练语言模型(如ClinicalT5,BioBERT)进行特征提取,并结合下游模型进行预测。为了保持在局部语言模型输入限制内长试验文本的语义表示,开发了一种滑动窗口方法进行嵌入提取。 Result: 最佳模型(ClinicalT5+Transformer+MLP)在预测哪个试验组有更高比例的SAE患者方面达到了77.6%的AUC。在预测对照组中经历SAE的参与者比例时,同一模型达到了18.6%的RMSE。滑动窗口方法始终优于没有它的方法。在12个分类器中,平均绝对AUC增加2.00%;在12个回归器中,平均绝对RMSE减少1.58%。 Conclusion: 预测模型在临床试验的安全性结果上表现良好,特别是使用ClinicalT5+Transformer+MLP模型和滑动窗口方法。这些发现表明,利用注册信息预测临床试验的安全性结果具有改善试验设计和标记预期与报告安全性结果之间差异的潜力。 Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from ClinicalTrials.gov with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at ClinicalTrials.gov remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.

[10] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

Jindong Li,Yali Fu,Jiahong Liu,Linxiao Cao,Wei Ji,Menglin Yang,Irwin King,Ming-Hsuan Yang

Main category: cs.CL

TL;DR: This paper presents the first comprehensive survey of discrete tokenization methods for large language models, offering a taxonomy of techniques, analysis of challenges, and directions for future research.

Details Motivation: The rapid advancement of large language models has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing, but there is a lack of comprehensive surveys examining VQ techniques in the context of LLM-based systems. Method: The paper presents a structured taxonomy and analysis of discrete tokenization methods designed for LLMs, categorizing 8 representative VQ variants and analyzing their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Result: The paper provides a first structured taxonomy of discrete tokenization methods for LLMs, discusses existing research across different application paradigms, identifies key challenges like codebook collapse and unstable gradient estimation, and highlights emerging research directions. Conclusion: This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.

[11] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

Lee Harris

Main category: cs.CL

TL;DR: This paper proposes the LMC algorithm to improve the efficiency and accuracy of language models while reducing hallucinations.

Details Motivation: Language models are costly and prone to hallucinations, which can lead to wasted resources if the generated information is incorrect. Method: Proposing, implementing, and applying the Language Model Chain (LMC) algorithm to address the cost and hallucination issues of language models. Result: Using the LMC algorithm to extract patient dates of birth from medical documents increased prediction speed and accuracy while reducing hallucinations. Conclusion: The LMC algorithm contributes significantly to the knowledge extraction field and should be further explored in the future. Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.

[12] Predicting stock prices with ChatGPT-annotated Reddit sentiment

Mateusz Kmak,Kamil Chmurzyński,Kamil Matejuk,Paweł Kotzbach,Jan Kocoń

Main category: cs.CL

TL;DR: 该研究探讨了社交媒体情绪是否能预测股市走势,通过分析Reddit的r/wallstreetbets数据和使用不同的情绪分析方法,发现社交媒体情绪与股价之间只有微弱的相关性,而简单的指标如评论量和谷歌搜索趋势表现出更强的预测信号。

Details Motivation: 2021年GameStop空头挤压引发的社交媒体零售投资者活动激增,引发了对在线情绪是否能有意义地预测股市走势的疑问。 Method: 使用Reddit的r/wallstreetbets数据,分析与GameStop和AMC相关的社交媒体情绪,采用两种现有的基于文本的情绪分析方法,并引入第三个由ChatGPT注释和微调的RoBERTa模型,使用相关性和因果关系指标评估这些模型的预测能力。 Result: 研究结果表明,社交媒体情绪与股价之间只有微弱的相关性。 Conclusion: 传统的情绪分析可能无法完全捕捉市场在线讨论的细微差别,而简单的指标如评论量和谷歌搜索趋势表现出更强的预测信号。 Abstract: The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit's r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment's role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models' predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.

[13] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

Aman Gupta,Yingying Zhuang,Zhou Yu,Ziji Zhang,Anurag Beniwal

Main category: cs.CL

TL;DR: 本文研究了多语言系统中优化提示策略对知识共享和任务表现的积极影响。

Details Motivation: 尽管大型语言模型在多语言能力方面取得了进展,但其在不同语言和任务中的表现仍有显著差异。多语言检索增强生成系统中存在知识库语言与上下文语言不一致的问题,而提示策略的影响尚不明确。 Method: 系统评估了不同提示翻译策略对基于RAG的多语言系统中文本分类任务的影响。 Result: 实验结果表明优化的提示策略可以显著提高知识共享效果和下游任务表现。 Conclusion: 研究发现优化的提示策略可以显著提高多语言系统中知识共享的效果,从而提升下游分类任务的表现,倡导更广泛地利用多语言资源共享和跨语言提示优化,尤其是针对低资源语言。 Abstract: Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.

[14] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

Brittney Exline,Melanie Duffin,Brittany Harbison,Chrissa da Gomez,David Joyner

Main category: cs.CL

TL;DR: 研究美国CS研究生课程中母语为英语与非英语学生同行反馈体验的差异。

Details Motivation: 随着美国CS研究生项目中非美国学生人数的增加,许多学生在非母语环境下学习,本文旨在研究母语为英语与非英语学生对同行反馈体验的影响。 Method: 使用Twitter-roBERTa-based模型分析500名学生的随机样本的同行评论的情感。 Result: 母语为英语的学生对反馈的评价较差,非母语者写的反馈更积极,但收到的反馈积极性较低。 Conclusion: 语言背景在塑造同行反馈体验方面起到了适度但复杂的作用。 Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master's degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students' language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.

[15] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Haoran Sun,Shaoning Zeng

Main category: cs.CL

TL;DR: This paper proposes a Hierarchical Memory (H-MEM) architecture for LLM Agents that improves long-term memory organization and retrieval efficiency by using a multi-level structure and index-based routing, outperforming existing methods in dialogue tasks.

Details Motivation: Current memory mechanisms in LLM Agents struggle with structured memory organization and efficient retrieval, limiting their reasoning capabilities and contextual coherence in long-term interactions. Method: The authors introduce a Hierarchical Memory (H-MEM) architecture that organizes memory in multiple levels based on semantic abstraction. Each memory vector includes a positional index encoding to point to related sub-memories in lower layers, and retrieval is conducted layer-by-layer using an index-based routing mechanism. Result: The experimental results on the LoCoMo dataset show that the H-MEM approach consistently outperforms five baseline methods in long-term dialogue scenarios. Conclusion: The proposed H-MEM architecture effectively enhances the long-term memory capabilities of LLM Agents by organizing memory in a hierarchical structure and enabling efficient retrieval through an index-based routing mechanism. Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.

[16] Multi-Relation Extraction in Entity Pairs using Global Context

Nilesh,Atul Gupta,Avinash C Panday

Main category: cs.CL

TL;DR: 本文提出了一种新的输入嵌入方法,通过捕捉文档中实体的全局关系和多句推理,提高文档级关系抽取的准确性。

Details Motivation: 在文档级关系抽取中,实体可能在文档中多次出现,其关系可能随着上下文的变化而变化。准确预测文档中两个实体之间的关系需要构建跨越所有相关句子的全局上下文。以前的方法仅关注实体被提及的句子,无法捕捉完整的文档上下文。 Method: 提出了一种新的输入嵌入方法,通过将实体表示为独立于其在文档中位置的独立片段,来捕捉文档中实体的位置和全局关系。 Result: 实验结果表明,该方法在文档级关系抽取中能够准确预测实体之间的关系,并在三个基准数据集上表现良好。 Conclusion: 本文提出了一种新的输入嵌入方法,用于捕捉文档中实体的全局关系和多句推理,从而提高文档级关系抽取的准确性。该方法在DocRED、Re-DocRED和REBEL三个基准数据集上进行了测试,实验结果表明其在文档级关系抽取中具有良好的性能。 Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.

[17] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan,Yihan Jiao,Dan Yang,Lei Liu,Jie Feng,Duolin Sun,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu

Main category: cs.CL

TL;DR: This paper introduces Placeholder-RAG-Benchmark, a fine-grained framework for evaluating how effectively LLMs use external knowledge in RAG systems, revealing weaknesses in current models and offering a path toward more reliable systems.

Details Motivation: Most existing RAG benchmarks focus on overall system performance and lack a systematic, granular evaluation of LLM-specific capabilities. This work aims to fill that gap by introducing a benchmark that provides a nuanced understanding of how LLMs utilize external knowledge. Method: The researchers designed a multi-level, fine-grained benchmark called Placeholder-RAG-Benchmark, which evaluates LLMs across dimensions such as multi-level filtering abilities, combination abilities, and reference reasoning. They used a placeholder-based approach to isolate the impact of parametric and external knowledge. Result: Experiments revealed limitations in current LLMs' RAG generation capabilities, particularly in error resilience and context faithfulness, highlighting areas for improvement in RAG system design. Conclusion: The study concludes that the introduced Placeholder-RAG-Benchmark offers a reproducible framework for evaluating and developing more reliable and efficient RAG systems by decoupling the contributions of parametric and external knowledge. Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.

[18] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen,Aske Plaat,Niki van Stein

Main category: cs.CL

TL;DR: 本文通过稀疏自编码器和激活修补技术,首次对链式思维(CoT)提示的可信度进行了因果研究,发现CoT能提高大型语言模型内部计算的模块化程度和可解释性。

Details Motivation: 链式思维(CoT)提示虽然提高了大型语言模型在多步骤任务上的准确性,但其生成的“思维”是否反映模型的真实推理过程仍不清楚。 Method: 结合稀疏自编码器与激活修补技术,从使用CoT和无CoT提示解决数学问题的Pythia模型中提取单义特征,并通过交换特征观察对答案概率的影响。 Result: 在2.8B模型中,将CoT推理特征交换到无CoT运行中显著提高了答案的对数概率,而在70M模型中效果不明显;此外,CoT提示提高了大模型的激活稀疏性和特征可解释性得分。 Conclusion: 链式思维提示可以诱导高容量语言模型形成更可解释的内部结构,从而验证了其作为结构化提示方法的作用。 Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

[19] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu

Main category: cs.CL

TL;DR: EH-Benchmark 是一个用于评估和缓解医学大语言模型幻觉的新眼科基准,其多代理框架显著提高了模型的准确性、可解释性和可靠性。

Details Motivation: 由于医学大语言模型(MLLMs)的幻觉问题限制了眼科诊断的准确性,因此需要一个能有效评估和缓解幻觉的新基准。 Method: EH-Benchmark 包括一个基于任务和错误类型的分类系统,以及一个以代理为中心的三阶段框架,包括知识级检索、任务级案例研究和结果级验证。 Result: 实验结果显示,EH-Benchmark 的多代理框架显著减少了幻觉,并提高了医学大语言模型的准确性、可解释性和可靠性。 Conclusion: EH-Benchmark 是一个用于评估 MLLM 幻觉的新眼科基准,其多代理框架显著减轻了幻觉,提高了准确性、可解释性和可靠性。 Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

[20] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra,Suparna De,Nishanth Sastry,Saeed Fadaei

Main category: cs.CL

TL;DR: This paper presents a methodology for creating synthetic datasets of PII-revealing text to support reproducible research while preserving user privacy, with the resulting dataset being indistinguishable from real data and publicly available for further study.

Details Motivation: The lack of open-source labeled datasets for research into the identification and retrieval of risky self-disclosures of Personal Information Identifiers (PIIs) on social platforms like Reddit motivated the development of a synthetic dataset to foster reproducible research in this area. Method: The authors developed a methodology to create synthetic PII-revealing data using a taxonomy of 19 categories and generated a dataset using three large language models (Llama2-7B, Llama3-8B, and zephyr-7b-beta) with sequential instruction prompting. They evaluated the dataset's utility using metrics such as reproducibility equivalence, unlinkability, and indistinguishability. Result: The authors successfully generated a synthetic PII-labeled multi-text span dataset that meets the criteria of reproducibility equivalence, unlinkability, and indistinguishability, and they released the dataset and code to promote further research into PII privacy risks. Conclusion: The paper concludes that their methodology for generating synthetic PII-revealing data is effective, as demonstrated by the successful creation of a labeled dataset that is both indistinguishable from real data and unlinkable to original users, thus promoting reproducible research in PII privacy risks. Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

[21] Enhancing RAG Efficiency with Adaptive Context Compression

Shuyu Guo,Zhaochun Ren

Main category: cs.CL

TL;DR: ACC-RAG dynamically compresses RAG contexts based on input complexity, improving inference speed by up to 4x while maintaining accuracy.

Details Motivation: Existing RAG context compression methods apply fixed rates, leading to over- or under-compression. A dynamic approach is needed to optimize efficiency and accuracy. Method: ACC-RAG combines a hierarchical compressor with a context selector to dynamically adjust compression rates based on input complexity. Result: ACC-RAG outperforms fixed-rate methods, achieving up to 4x faster inference speed on Wikipedia and five QA datasets. Conclusion: ACC-RAG maintains or improves accuracy while significantly enhancing inference efficiency compared to standard RAG methods. Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.

[22] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Baptiste Lefort,Eric Benhamou,Beatrice Guez,Jean-Jacques Ohana,Ethan Setrouk,Alban Etienne

Main category: cs.CL

TL;DR: This paper introduces a hierarchical framework combining LLMs and DRL for portfolio optimization, achieving superior financial performance through sentiment and market data integration.

Details Motivation: The motivation is to enhance portfolio optimization by combining sentiment signals from financial news with traditional market indicators, aiming for improved stability and performance. Method: The method involves a three-tier architecture using RL agents for processing hybrid data, meta-agents for aggregating decisions, and a super-agent for merging decisions based on market data and sentiment analysis. Result: The framework achieves a 26% annualized return and a Sharpe ratio of 1.2, evaluated on data from 2018 to 2024. Conclusion: The paper concludes that the proposed hierarchical framework for portfolio optimization, which integrates LLMs with DRL, outperforms traditional benchmarks in terms of annualized return and Sharpe ratio. Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.

[23] Augmented Vision-Language Models: A Systematic Review

Anthony C Davis,Burhan Sadiq,Tianmin Shu,Chien-Ming Huang

Main category: cs.CL

TL;DR: This paper reviews how combining visual-language models with symbolic systems can improve reasoning, interpretability, and adaptability without extensive retraining.

Details Motivation: The motivation stems from the limitations of current visual-language machine learning models, which cannot provide interpretable explanations, require retraining for new information, are resource-intensive, and struggle with logical reasoning. Method: The paper conducts a systematic literature review to categorize techniques that improve visual-language understanding through interaction with external symbolic information systems. Result: The result is a categorization of techniques that leverage pre-trained Vision-Language Models integrated with external symbolic systems to improve reasoning, memory, and interpretability. Conclusion: The paper concludes that integrating neural networks with external symbolic information systems can enhance reasoning and memory abilities and provides more interpretable explanations. This approach allows for the assimilation of new information without extensive retraining. Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.

[24] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Jingwei Zhao,Yuhua Wen,Qifei Li,Minchi Hu,Yingying Zhou,Jingyao Xue,Junyang Wu,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li

Main category: cs.CL

TL;DR: This paper surveys deep learning methods for intent recognition, highlighting the shift from unimodal to multimodal approaches and the impact of Transformer-based models, while providing insights for future research.

Details Motivation: The motivation for the paper is the growing demand for natural human-computer interaction, which has driven the evolution of intent recognition through deep learning and multimodal approaches incorporating data from multiple sources. Method: The paper presents a survey of deep learning methods in intent recognition, covering the evolution from unimodal to multimodal techniques, datasets, methodologies, and applications. Result: The result is a comprehensive overview of the advancements in intent recognition, including insights into the latest developments in multimodal intent recognition (MIR) and directions for future research. Conclusion: The paper concludes that the field of intent recognition has significantly advanced with the adoption of deep learning and multimodal approaches, particularly with the introduction of Transformer-based models. It emphasizes the importance of these advancements for future research and applications. Abstract: Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

[25] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Kathleen Mealey,Jonathan A. Karr Jr.,Priscila Saboia Moreira,Paul R. Brenner,Charles F. Vardeman II

Main category: cs.CL

TL;DR: The paper evaluates NLP and LLM tools for operational and maintenance intelligence in the aircraft industry, highlighting performance limitations and challenges in trusted environments, and offers an open-source dataset for further testing.

Details Motivation: The motivation is to derive operational intelligence from organizational data repositories despite challenges such as data confidentiality vs data integration and limitations of NLP tools in domain-specific contexts like operations and maintenance. Method: The paper evaluates sixteen NLP tools and Large Language Models (LLMs) in a controlled environment, focusing on their zero-shot performance in the operational and maintenance intelligence use case for the aircraft industry. The study uses a baseline dataset derived from a public domain dataset from the US Federal Aviation Administration. Result: The paper observes significant performance limitations in NLP and LLM tools within a controlled, confidential environment and highlights challenges related to their trustworthiness and Technical Readiness Level for use in mission-critical industries. Conclusion: The paper concludes with recommendations to enhance trust in NLP and LLM tools and provides an open-source curated dataset for further baseline testing and evaluation. Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

[26] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Md Talha Mohsin

Main category: cs.CL

TL;DR: This study compares five leading LLMs (GPT, Claude, Perplexity, Gemini, and DeepSeek) in financial analysis tasks using 10-K filings from major tech companies. GPT performs best overall, while Gemini and DeepSeek show more variability.

Details Motivation: The study aims to fill the gap in systematic comparisons among widely used Large Language Models (LLMs) in Financial Natural Language Processing (FinNLP), especially given their growing influence in financial analysis. Method: A comparative evaluation of five LLMs (GPT, Claude, Perplexity, Gemini, and DeepSeek) was conducted using domain-specific prompts and 10-K filings from the 'Magnificent Seven' technology companies. Three evaluation methodologies were employed: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). Result: GPT outperforms other models in coherence, semantic alignment, and contextual relevance, followed by Claude and Perplexity. Gemini and DeepSeek show more variability and less agreement. Output similarity and stability vary across companies and time, indicating sensitivity to prompt formulation and source material. Conclusion: GPT demonstrates the best performance in coherence, semantic alignment, and contextual relevance, followed by Claude and Perplexity, while Gemini and DeepSeek show higher variability and lower agreement in their outputs. Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

[27] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Jinkun Zhao,Yuanshuai Wang,Xingjian Zhang,Ruibo Chen,Xingchuang Liao,Junle Wang,Lei Huang,Kui Zhang,Wenjun Wu

Main category: cs.CL

TL;DR: 本文提出了一种新的协作专家框架(CoE-Ops),结合了检索增强生成机制,以提高AIOps任务的性能和准确性。

Details Motivation: 由于单个模型受限于特定领域的知识,难以应对复杂的AIOps任务,因此需要结合多个模型来提高性能。 Method: 提出了一种协作专家框架(CoE-Ops),并引入了检索增强生成机制,以提高框架在处理高级和低级AIOps任务方面的能力。 Result: 实验结果显示,CoE-Ops在路由准确性上比现有CoE方法提高了72%,在DevOps问题解决中比单个AIOps模型提高了8%,在准确性上超过了更大规模的Mixture-of-Experts(MoE)模型14%。 Conclusion: CoE-Ops通过结合多个专家模型和检索增强生成机制,在AIOps领域实现了更高的准确性和效率,优于现有的单模型和其他CoE方法。 Abstract: With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework's capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.

[28] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

Sumit Soman,H. G. Ranjani,Sujoy Roychowdhury,Venkata Dharma Surya Narayana Sastry,Akshat Jain,Pranav Gangrade,Ayaaz Khan

Main category: cs.CL

TL;DR: 本文研究提出了一种利用视觉大语言模型生成流程图的图表示,并将其结合到基于文本的检索系统中,以提高电信领域技术文档问答的性能。

Details Motivation: 基于文本的检索增强生成系统在处理技术文档中的图像(如流程图)时存在局限,因此需要一种更有效的方法来回答涉及图像的问题。 Method: 从处理技术文档、分类图像类型、构建图表示,到将图表示与文本嵌入流水线结合以实现高效检索,研究团队展示了端到端的方法。 Result: 实验表明,使用微调后的视觉大语言模型生成的图表示在编辑距离上更接近真实情况,同时结合文本嵌入模型实现了良好的检索性能。 Conclusion: 该研究提出了一种利用视觉大语言模型生成流程图的图表示,并将其结合到基于文本的检索增强生成系统中的方法,从而在电信领域实现高效的问答检索。 Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

[29] PARROT: An Open Multilingual Radiology Reports Dataset

Bastien Le Guellec,Kokou Adambounou,Lisa C Adams,Thibault Agripnidis,Sung Soo Ahn,Radhia Ait Chalal,Tugba Akinci D Antonoli,Philippe Amouyel,Henrik Andersson,Raphael Bentegeac,Claudio Benzoni,Antonino Andrea Blandino,Felix Busch,Elif Can,Riccardo Cau,Armando Ugo Cavallo,Christelle Chavihot,Erwin Chiquete,Renato Cuocolo,Eugen Divjak,Gordana Ivanac,Barbara Dziadkowiec Macek,Armel Elogne,Salvatore Claudio Fanni,Carlos Ferrarotti,Claudia Fossataro,Federica Fossataro,Katarzyna Fulek,Michal Fulek,Pawel Gac,Martyna Gachowska,Ignacio Garcia Juarez,Marco Gatti,Natalia Gorelik,Alexia Maria Goulianou,Aghiles Hamroun,Nicolas Herinirina,Krzysztof Kraik,Dominik Krupka,Quentin Holay,Felipe Kitamura,Michail E Klontzas,Anna Kompanowska,Rafal Kompanowski,Alexandre Lefevre,Tristan Lemke,Maximilian Lindholz,Lukas Muller,Piotr Macek,Marcus Makowski,Luigi Mannacio,Aymen Meddeb,Antonio Natale,Beatrice Nguema Edzang,Adriana Ojeda,Yae Won Park,Federica Piccione,Andrea Ponsiglione,Malgorzata Poreba,Rafal Poreba,Philipp Prucker,Jean Pierre Pruvo,Rosa Alba Pugliesi,Feno Hasina Rabemanorintsoa,Vasileios Rafailidis,Katarzyna Resler,Jan Rotkegel,Luca Saba,Ezann Siebert,Arnaldo Stanzione,Ali Fuat Tekin,Liz Toapanta Yanchapaxi,Matthaios Triantafyllou,Ekaterini Tsaoulia,Evangelia Vassalou,Federica Vernuccio,Johan Wasselius,Weilang Wang,Szymon Urban,Adrian Wlodarczak,Szymon Wlodarczak,Andrzej Wysocki,Lina Xu,Tomasz Zatonski,Shuhang Zhang,Sebastian Ziegelmayer,Gregory Kuchcinski,Keno K Bressem

Main category: cs.CL

TL;DR: PARROT is a comprehensive multilingual dataset of fictional radiology reports designed to advance natural language processing applications in radiology without privacy issues.

Details Motivation: To develop and validate PARROT, a large, multicenter, open-access dataset of fictional radiology reports in multiple languages for testing natural language processing applications in radiology. Method: Radiologists contributed fictional radiology reports with metadata including anatomical region, imaging modality, clinical context, and English translations for non-English reports. A human vs. AI differentiation study was conducted. Result: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Participants achieved 53.9% accuracy in distinguishing between human and AI-generated reports, with radiologists performing significantly better. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints. Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.

[30] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Rui Jiao,Yue Zhang,Jinku Li

Main category: cs.CL

TL;DR: This paper introduces RELIANCE, a new framework for improving the factual accuracy of reasoning in Large Language Models, which significantly enhances factual robustness while maintaining strong performance on complex tasks.

Details Motivation: The motivation stems from the vulnerability in Large Language Models where factual inaccuracies in intermediate reasoning steps can mislead users into dangerous decisions, especially in high-stakes domains like healthcare and legal analysis. Method: The study introduces RELIANCE, which combines a fact-checking classifier, Group Relative Policy Optimization (GRPO) reinforcement learning, and a mechanistic interpretability module to enhance reasoning accuracy in LLMs. Result: Evaluation across ten state-of-the-art models showed that even leading models like Claude-3.7 and GPT-o1 had reasoning factual accuracy of only 81.93% and 82.57%, respectively. RELIANCE improved factual robustness by up to 49.90% while maintaining or enhancing performance on challenging benchmarks. Conclusion: The study concludes that RELIANCE effectively improves the factual robustness of reasoning in Large Language Models (LLMs), offering insights into how future training methodologies can target factual accuracy through activation-guided optimization. Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

[31] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

Paul Minchella,Loïc Verlingue,Stéphane Chrétien,Rémi Vaucher,Guillaume Metzler

Main category: cs.CL

TL;DR: SigBERT是一个时间生存分析框架,通过处理临床报告的时间戳数据,提取句子嵌入的几何特征,以提高生存模型的性能。

Details Motivation: 现有的生存分析方法难以有效处理文本数据的复杂性,尤其是其序列形式。 Method: SigBERT通过应用粗路径理论中的签名提取来处理句子嵌入坐标的时间序列,从而为每位患者提取几何特征。这些特征随后被整合进一个LASSO惩罚的Cox模型中,以估计患者特定的风险评分。 Result: 在来自Léon Bérard中心的实际肿瘤学数据集上训练和评估了该模型,在独立测试队列上的C-index得分为0.75(sd 0.014). Conclusion: SigBERT通过整合顺序医疗数据提高了风险估计的准确性,推动了基于叙事的生存分析。 Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L\'eon B\'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

[32] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Shirley V Wang,Georg Hahn,Sushama Kattinakere Sreedhara,Mufaddal Mahesri,Haritha S. Pillai,Rajendra Aldis,Joyce Lii,Sarah K. Dutcher,Rhoda Eniafe,Jamal T. Jones,Keewan Kim,Jiwei He,Hana Lee,Sengwee Toh,Rishi J Desai,Jie Yang

Main category: cs.CL

TL;DR: A faster and more efficient method for validating code-based algorithms in claims databases is proposed, using NLP and adaptive sampling to significantly reduce manual review time while maintaining precision.

Details Motivation: Validating measurement characteristics of code-based algorithms is crucial for robust analyses in large claims databases, but traditional methods requiring manual chart reviews are time-consuming and resource-intensive. Method: An expedited validation process using natural language processing (NLP) to reduce chart review time and a multi-wave adaptive sampling approach with a pre-defined stopping rule to achieve sufficient precision efficiently was described. Result: The NLP-assisted annotation process reduced chart review time by 40%, and the adaptive sampling approach with a stopping rule could have prevented reviewing 77% of charts with minimal impact on precision. Conclusion: The approach described can make validation of code-based algorithms more routine, improving the understanding of the reliability of findings from database studies. Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.

[33] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Naomi Omeonga wa Kayembe

Main category: cs.CL

TL;DR: This paper redefines arbitrariness as a foundational functional mechanism in human systems, extending its principle beyond language to law and social dynamics, and posits it as a neutral operator central to control and care, providing a new pathway for analyzing explainability in AI systems.

Details Motivation: The paper aims to redefine arbitrariness not as a normative flaw or symptom of domination but as a semiotic trait that enables systems to operate effectively while withholding their internal rationale, diverging from critical traditions that conflate arbitrariness with injustice. Method: Building on Ferdinand de Saussure's concept of l'arbitraire du signe and drawing on Shannon's entropy model, the paper extends the principle of arbitrariness beyond language to demonstrate cross-domain applicability, particularly in law and social dynamics. Result: The paper introduces the "Motivation -> Constatability -> Contestability" chain, positing that motivation functions as a crucial interface rendering an act's logic vulnerable to intersubjective contestation, and formalizes arbitrariness as A = H(L|M) (conditional entropy). Conclusion: This paper concludes that arbitrariness is a foundational functional mechanism in human systems and interactions, acting as a neutral operator central to control and care, and it can also be used to analyze explainability in advanced artificial intelligence systems. Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure's concept of l'arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the "Motivation -> Constatability -> Contestability" chain, arguing that motivation functions as a crucial interface rendering an act's logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like "immotivization" or "Conflict Lateralization" (exemplified by "the blur of the wolf drowned in the fish"), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon's entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.

[34] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma,Wei Tao,Yiwen Guo

Main category: cs.CL

TL;DR: The paper presents a benchmark dataset and evaluation method to address the challenges faced by Spoken Dialogue Models in understanding and emulating human conversations, particularly due to complexities inherent in spoken dialogue such as ambiguity and context-dependency.

Details Motivation: The motivation behind this research is the growing popularity of Spoken Dialogue Models (SDMs) and the lack of comprehensive research into their effectiveness in understanding and mimicking human conversations, particularly in comparison to text-based Large Language Models (LLMs). Method: The paper introduces a benchmark dataset with 1,079 instances in English and Chinese, along with an LLM-based evaluation method designed to align closely with human judgment to assess SDM performance. Result: The result is the creation of a benchmark dataset and evaluation method that allows for a comprehensive exploration of how well SDMs can handle the complexities of spoken dialogue. Conclusion: The paper concludes that SDMs face significant challenges in comprehending and emulating human conversations, especially when compared to text-based LLMs, and emphasizes the importance of the proposed benchmark dataset and evaluation method for future SDM development. Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

[35] Math Natural Language Inference: this should be easy!

Valeria de Paiva,Qiyue Gao,Hai Hu,Pavel Kovalev,Yikang Liu,Lawrence S. Moss,Zhiheng Qian

Main category: cs.CL

TL;DR: LLMs show mixed results in mathematical natural language inference: promising when used collectively but still limited in understanding math texts.

Details Motivation: To evaluate whether modern LLMs can effectively perform natural language inference on mathematical texts, an area not extensively explored. Method: Constructed a corpus of Math NLI pairs using human-labeled hypotheses and tested LLM performance and inter-group consistency. Also evaluated LLM-generated hypotheses. Result: Positive: Majority vote of LLMs can approximate human-labeled data. Negative: LLMs struggle with mathematical language and basic inferences, though less prone to hypothesis-only errors than previous models. Conclusion: LLMs show potential in Math NLI tasks with majority voting, but still struggle with mathematical language and basic inferences. The study provides a new corpus for future research. Abstract: We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only "inference" in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.

[36] Exploring In-Context Learning for Frame-Semantic Parsing

Diego Garat,Guillermo Moncecchi,Dina Wonsever

Main category: cs.CL

TL;DR: This paper explores using Large Language Models with In-Context Learning for Frame Semantic Parsing tasks, achieving high performance without model fine-tuning.

Details Motivation: The research aims to explore the potential of In-Context Learning with Large Language Models for Frame Semantic Parsing, eliminating the need for model fine-tuning which can be resource-intensive. Method: The paper proposes a method that automatically generates task-specific prompts for Frame Identification and Frame Semantic Role Labeling using the FrameNet database, without requiring model fine-tuning. Result: The experiments achieved F1 scores of 94.3% for Frame Identification and 77.4% for Frame Semantic Role Labeling, demonstrating competitive results. Conclusion: The study concludes that In-Context Learning with Large Language Models serves as a practical and effective alternative to traditional fine-tuning methods for domain-specific Frame Semantic Parsing tasks. Abstract: Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.

[37] Context-aware Rotary Position Embedding

Ali Veisi,Delaram Fartoot,Hamidreza Amirzadeh

Main category: cs.CL

TL;DR: 本文提出了一种新的位置编码方法CARoPE,通过动态生成与输入相关的频率模式,改进了传统RoPE的静态模式,从而在保持计算效率的同时提升了模型对上下文敏感关系的建模能力。

Details Motivation: 传统RoPE使用静态的、与输入无关的正弦频率模式,限制了其对上下文敏感关系的建模能力。因此,需要一种更灵活、更高效的位置编码方法。 Method: CARoPE通过使用对token嵌入的有界变换来动态生成与输入相关的相位偏移,并将其集成到旋转机制中,实现了头特定的、上下文感知的频率模式。 Result: 在FineWeb-Edu-10B数据集上的实验表明,CARoPE在next-token预测任务中始终优于RoPE和其他常见的位置编码基线方法,即使在更长的上下文长度下也具有显著更低的困惑度,同时训练吞吐量更高且模型稳定性不受影响。 Conclusion: CARoPE提供了一种可扩展、表达能力强且高效的Transformer模型位置编码升级方案。 Abstract: Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.

[38] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Ishani Mondal,Meera Bharadwaj,Ayush Roy,Aparna Garimella,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: SMART-Editor improves global coherence in editing tasks using reward-guided methods, outperforming existing models in both structured and unstructured domains.

Details Motivation: Prior models perform local edits without preserving global coherence, which SMART-Editor aims to improve. Method: SMART-Editor uses Reward-Refine for inference-time refinement and RewardDPO for training-time preference optimization. Result: SMART-Editor achieves better performance than InstructPix2Pix and HIVE, with RewardDPO improving structured settings by 15% and Reward-Refine working well on natural images. Conclusion: SMART-Editor outperforms other models in structured and unstructured editing tasks, showing the effectiveness of reward-guided planning. Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.

[39] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

Jeffrey Eben,Aitzaz Ahmad,Stephen Lau

Main category: cs.CL

TL;DR: The paper proposes a scalable, fine-tuning-free method for improving text-to-SQL systems by leveraging metadata and semantic decomposition, enabling effective deployment in enterprise environments.

Details Motivation: The motivation is to overcome the limitations of existing LLM-based natural language interfaces for databases, which struggle with scalability in enterprise-level data catalogs and often rely on domain-specific fine-tuning while neglecting semantic context from metadata. Method: The authors propose a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, which are indexed separately for targeted retrieval. Their approach integrates column-level information while keeping the number of retrieved tables manageable. Result: Experiments show that the proposed method achieves high recall and accuracy, outperforming baselines on large databases with varied structures and metadata availability. Conclusion: The paper concludes that their component-based retrieval architecture effectively improves table identification and allows practical deployment of text-to-SQL systems in diverse enterprise settings without domain-specific fine-tuning. Abstract: Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.

[40] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Xinwei Wu,Haojie Li,Hongyu Liu,Xinyu Ji,Ruohan Li,Yule Chen,Yigeng Zhang

Main category: cs.CL

TL;DR: 本文研究了大型语言模型在处理中文歧义文本时的表现,发现其存在显著脆弱性,并提出了一个基准数据集用于进一步研究。

Details Motivation: 研究大型语言模型(LLMs)在面对歧义叙述文本时的表现,尤其是中文文本歧义问题。 Method: 创建了一个基准数据集,并对模型的行为进行了实验分析。 Result: 发现LLMs在处理歧义时表现出显著脆弱性,包括无法可靠区分歧义文本与非歧义文本、过度自信地将歧义文本解释为单一意义,以及在理解多种可能意义时表现出过度思考。 Conclusion: 研究揭示了当前大型语言模型在处理语言歧义方面的根本性缺陷,这对现实世界应用中的部署具有重要意义。 Abstract: In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.

[41] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Ananya Sadana,Yash Kumar Lal,Jiawei Zhou

Main category: cs.CL

TL;DR: ISO-Bench reveals that state-of-the-art multimodal models struggle to understand causal relationships between visual and textual data, with performance significantly below that of humans.

Details Motivation: Understanding causal relationships across different data modalities, such as vision and text, is essential for the development of more effective and intelligent multimodal models that can operate successfully in real-world environments. Method: The researchers introduced ISO-Bench, a benchmark designed to evaluate the ability of models to infer causal dependencies between visual observations and procedural text. The evaluation involved ten advanced vision-language models. Result: Results showed that the best zero-shot performance achieved an F1 score of only 0.57, and chain-of-thought reasoning improved this only modestly to 0.62 F1, significantly below human performance (0.98 F1). Conclusion: The study concludes that there is a significant gap in the causal understanding capabilities of current multimodal models compared to humans, and it highlights specific areas for improvement. Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.

[42] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Yuhan Liu,Michael J. Q. Zhang,Eunsol Choi

Main category: cs.CL

TL;DR: 这项研究探讨了从用户与语言模型的交互日志中收集隐式用户反馈的方法,以改善模型性能,特别是在简短问题上的表现。

Details Motivation: 长期部署的语言模型需要根据用户的反馈不断进化,但直接请求用户反馈可能会造成干扰。因此,研究从交互日志中收集隐式用户反馈的方法。 Method: 在两个用户-LM交互数据集(WildChat和LMSYS)中研究隐式用户反馈。首先分析用户-LLM对话轨迹中的用户反馈,然后研究从这些隐式用户反馈中获取学习信号。 Result: 研究发现,用户反馈的内容(例如用户需要澄清),不仅仅是极性(例如用户对先前模型响应不满意),可以改善模型在简短的人类设计问题(MTBench)上的表现,但在更长和复杂的问题(WildBench)上则不行。此外,用户反馈的有用性很大程度上取决于用户初始提示的质量。 Conclusion: 研究提供了对隐式用户反馈的深入分析,展示了其潜力和局限性。 Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

[43] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Jizhou Guo

Main category: cs.CL

TL;DR: LENS是一种通过分析模型内部表示来估计模型置信度的集成方法,这种方法可以更细致地加权模型预测,并且在多项选择和布尔问答任务上表现出色。

Details Motivation: 现有的集成方法往往依赖于简单技术,忽略了不同上下文中模型的可变置信度和可靠性。 Method: 提出了一种名为LENS的方法,该方法通过训练轻量级线性置信度预测器来估计模型的置信度。 Result: 实验结果表明,LENS在多项选择和布尔问答任务上显著优于传统集成方法。 Conclusion: LENS是一个有效的集成方法,通过分析内部表示来估计模型的置信度,从而更细致地加权模型预测。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.

[44] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Jianghui Wang,Vinay Joshi,Saptarshi Majumder,Xu Chao,Bin Ding,Ziqiong Liu,Pratik Prabhanjan Brahma,Dong Li,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: 本文介绍了GEAK框架,该框架利用前沿的LLM生成高效的Triton代码,以提高AMD GPU上的性能,并通过推理时计算扩展和反馈机制优化代码生成。

Details Motivation: 随着深度学习工作负载的复杂性和多样性增加,自动化低级内核开发以满足性能和生产力需求变得至关重要。 Method: 本文提出了GEAK框架,利用前沿的LLM生成高效的Triton代码,并采用推理时计算扩展和类似Reflexion的反馈机制优化代码生成。 Result: 在两个评估基准上,GEAK显著优于直接提示前沿LLM和基于Reflexion的生成管道的基线,正确率高达63%,执行速度提高了2.59倍。 Conclusion: GEAK框架展示了代理式代码生成在加速多样化硬件平台采用和普及专家级内核性能方面的潜力。 Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

[45] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Yunhao Liang,Ruixuan Ying,Takuya Taniguchi,Zhe Cui

Main category: cs.CL

TL;DR: This paper proposes a method to improve few-shot in-context learning by utilizing Negative samples to better select Positive examples, resulting in enhanced performance compared to methods that only use Positive samples.

Details Motivation: While existing research focuses on leveraging Positive samples for in-context learning, the additional information within Negative samples is overlooked, which can be utilized to improve learning performance. Method: The method involves constructing Positive and Negative sample corpora based on Zero-Shot-Cot. During inference, similar examples are selected from both corpora based on semantic similarity to the query. Further retrieval of Positive examples is done based on semantic similarity to Negative examples, which are then concatenated to serve as ICL demonstrations. Result: Experimental results show that the proposed method outperforms approaches that rely solely on the most similar Positive examples, validating the usefulness of Negative samples in enhancing ICL performance. Conclusion: Negative samples can provide additional information to enhance the performance of few-shot in-context learning by improving the selection of Positive samples. Abstract: Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.

[46] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Carolina Zheng,Nicolas Beltran-Velez,Sweta Karlekar,Claudia Shi,Achille Nazaret,Asif Mallik,Amir Feder,David M. Blei

Main category: cs.CL

TL;DR: The paper introduces Mechanistic Topic Models (MTMs), which use sparse autoencoders to operate on interpretable features, enabling deeper conceptual theme revelation and controllable text generation.

Details Motivation: Traditional topic models struggle to capture semantically abstract features due to their reliance on bag-of-words representations, and some neural variants are limited by expressing topics as word lists. Method: Mechanistic Topic Models (MTMs) operate on interpretable features learned by sparse autoencoders (SAEs), defining topics over a semantically rich space. Result: MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by the LLM-based evaluation framework 'topic judge', and enable effective steering of LLM outputs. Conclusion: MTMs enable controllable text generation using topic-based steering vectors and are preferred over traditional topic models in terms of coherence and evaluation metrics. Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

[47] Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

Sophie Kearney,Shu Yang,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason Moore,Marylyn Ritchie,Li Shen

Main category: cs.CL

TL;DR: 这项研究提出了一种新的框架 TAP-GPT,利用TableGPT2进行阿尔茨海默病诊断,通过使用结构化生物医学数据的上下文学习示例构建少样本表格提示,并对TableGPT2进行微调,结果表明其性能优于更先进的通用大语言模型和表格基础模型。

Details Motivation: 阿尔茨海默病(AD)是一种复杂的神经退行性疾病,其早期和准确诊断需要分析异构生物标志物,这些标志物通常以表格形式表示。大语言模型(LLMs)凭借其灵活的少样本推理、多模态整合和基于自然语言的可解释性,为使用结构化生物医学数据进行预测提供了前所未有的机会。 Method: 研究提出了一种新的框架 TAP-GPT,利用了TableGPT2(一个最初为商业智能任务开发的多模态表格专用LLM),通过使用结构化生物医学数据的上下文学习示例构建少样本表格提示,并使用参数高效qLoRA适配对TableGPT2进行微调,用于临床AD或认知正常(CN)的二分类任务。 Result: TAP-GPT框架利用TableGPT2强大的表格理解能力和LLMs编码的先验知识,在AD诊断任务中优于更先进的通用LLMs和一个为预测任务开发的表格基础模型(TFM)。 Conclusion: TAP-GPT 是首个将大语言模型(LLMs)应用于使用表格生物标志物数据预测任务的框架,为未来LLM驱动的生物医学信息学多智能体框架铺平了道路。 Abstract: Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

[48] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

Sneha Oram,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 研究评估了大型语言模型在心理健康领域中的推理能力,并引入了新的数据集和任务,同时比较了模型在心理健康污名问题上的表现。

Details Motivation: 当前在心理健康领域中,大型语言模型的可解释性和对话推理能力尚未被充分探索。 Method: 引入了 P-ReMe 数据集,并修改了心理健康领域中的会话含义和预设的定义,设计了相关任务并使用多个模型进行实验评估。 Result: 实验结果显示 Mistral 和 Qwen 在推理任务中表现良好,而 Claude-3.5-haiku 在应对心理健康污名方面更为负责。 Conclusion: Claude-3.5-haiku 展现出比其他大型语言模型更有效地处理心理健康领域中的污名问题。 Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.

[49] Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Shimanto Bhowmik,Tawsif Tashwar Dipto,Md Sazzad Islam,Sheryl Hsu,Tahsin Reasat

Main category: cs.CL

TL;DR: This paper investigates challenges in Bengali NLP due to the lack of standardized benchmarks, evaluates recent LLMs, and identifies performance gaps and failure modes, emphasizing the need for better multilingual evaluation methods.

Details Motivation: Bengali is an underrepresented language in NLP research, and it presents unique challenges due to its linguistic structure and computational constraints. The lack of standardized evaluation benchmarks further hinders progress in this area. Method: The authors evaluated 10 recent open-source Large Language Models (LLMs) on 8 translated datasets and conducted a comprehensive error analysis to identify failure modes. They also analyzed the relationship between tokenization efficiency and model accuracy. Result: The study found consistent performance gaps for Bengali NLP compared to English, with certain model families performing worse. It also identified more robust architectures like DeepSeek. An inverse relationship between tokenization efficiency and LLM accuracy was observed. Conclusion: The paper concludes that there are significant performance gaps in NLP for the Bengali language compared to English, particularly in smaller models and certain model families like Mistral. It highlights the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.

[50] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su,Qingyuan Li,Hao Zhang,YuLei Qian,Yuchen Xie,Kehong Yuan

Main category: cs.CL

TL;DR: This study identifies and investigates Super Experts (SEs) in MoE-based large language models, revealing their critical role in model performance and attention mechanisms. Pruning SEs significantly degrades performance, especially in mathematical reasoning, while the findings enhance understanding of how SEs contribute to attention sinks and overall model functionality.

Details Motivation: The motivation stems from the inefficiency in existing MoE LLM compression methods, which rely on empirical criteria to identify critical experts without deeply understanding the heterogeneous importance of experts. This study aims to explore the underlying mechanisms and importance of specific experts, termed Super Experts (SEs), to improve the efficiency and performance understanding of MoE LLMs. Method: The researchers conducted a comprehensive analysis of SEs in open-source MoE LLMs, examining their activation patterns, impact on performance through pruning experiments, and role in attention mechanisms. They also assessed the effects of SE compression. Result: The study identified a distinct subset of experts, termed Super Experts (SEs), which are crucial for the forward inference mechanisms in MoE LLMs. Key findings include: (i) SEs exhibit rare but extreme activation outliers that significantly impact hidden states between decoder layers, and their distribution is model-specific. (ii) Pruning SEs leads to a substantial decline in performance, especially in mathematical reasoning tasks. (iii) SEs are essential for inducing attention sinks, which are vital for attention score distribution, and their pruning disrupts these mechanisms. Conclusion: The study concludes that Super Experts (SEs) are crucial components in MoE LLMs, significantly impacting model performance, particularly in mathematical reasoning. SEs induce attention sinks that are vital for attention score distribution, and their pruning disrupts this mechanism. Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

[51] What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara,Sergio Picascia,Laura Pinnavaia,Vojimir Ranitovic,Elisabetta Rocchetti,Alice Tuveri

Main category: cs.CL

TL;DR: 研究发现 GPT-4o-mini 在无明确指令情况下仍能隐式审查敏感内容,并在零样本设置下有效分类敏感语言。

Details Motivation: 现有研究多集中于显式训练模型进行内容审查,而对大语言模型在没有明确指令下是否隐式地进行内容净化的探索较少。 Method: 通过实证分析,研究 GPT-4o-mini 在改写敏感内容时的审查行为,并评估其零样本分类敏感内容的能力。 Result: 实验表明,GPT-4o-mini 在改写敏感内容时系统性地使其向低敏感类别转移,显著减少了贬损和禁忌语言。此外,其在零样本分类任务中表现优于传统方法。 Conclusion: GPT-4o-mini 展示了隐式的敏感内容审查行为,其在改写敏感内容时倾向于降低敏感类别,同时具备强大的零样本敏感度分类能力。 Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

[52] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Daeyong Kwon,SeungHeon Doh,Juhan Nam

Main category: cs.CL

TL;DR: 本文提出了MusT-RAG框架,通过检索增强生成技术,将通用大语言模型适应于音乐领域的问答任务,以解决音乐特定知识不足的问题。

Details Motivation: 由于训练数据中音乐特定知识的比例相对较小,大型语言模型在音乐相关应用中的效果受到限制。 Method: 提出了MusT-RAG框架,包括MusWikiDB音乐专用向量数据库和在推理和微调过程中使用上下文信息的方法。 Result: MusT-RAG在提升大型语言模型的音乐领域适应能力方面显著优于传统微调方法,并且MusWikiDB比通用维基百科语料库更有效。 Conclusion: MusT-RAG框架有效地将通用大语言模型转化为音乐特定模型,提高了音乐问答任务的性能。 Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

[53] Text-to-SQL Task-oriented Dialogue Ontology Construction

Renato Vukovic,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Hsien-Chin Lin,Shutong Feng,Nurul Lubis,Milica Gasic

Main category: cs.CL

TL;DR: TeQoDO利用大型语言模型自动生成任务导向型对话系统本体,提升可解释性与性能。

Details Motivation: 为了提高大型语言模型在任务导向型对话系统中的可解释性和可控性,避免依赖人工标注或监督训练。 Method: 利用大型语言模型的SQL编程能力和提示中的对话理论,自动构建任务导向型对话系统本体。 Result: TeQoDO在下游对话状态跟踪任务中表现优异,且其构建的本体具有可扩展性,并通过消融实验验证了对话理论的重要性。 Conclusion: TeQoDO是一个无需监督即可自主构建面向任务对话本体的方法,提高了大型语言模型的可解释性。 Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

[54] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Yiyan Ji,Haoran Chen,Qiguang Chen,Chengyue Wu,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 论文提出了多模态规划与复杂约束(MPCC)基准,用于评估多模态大语言模型(MLLM)在处理复杂约束下的规划能力,并发现当前模型在多约束场景下表现不佳。

Details Motivation: 当前的基准测试无法直接评估多模态现实世界规划能力,且缺乏跨模态的显性或隐性约束,因此需要一个新的系统来解决这些挑战。 Method: 论文引入了多模态规划与复杂约束(MPCC)这一新基准,通过三个现实任务:飞行规划、日历规划和会议规划进行评估,并引入了复杂约束(如预算、时间、空间),将难度等级分为简单、中等和困难。 Result: 在13个先进MLLM上的实验表明,封闭源模型仅能生成21.3%的可行计划,而开源模型的平均表现低于11%。此外,MLLM对约束复杂性高度敏感,传统的多模态提示策略在多约束场景下失效。 Conclusion: 该论文强调了多模态约束规划的重要性,并提出了一个严格的评估框架,突出了现实世界中MLLM应用对约束感知推理的需求。 Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

[55] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Ailiang Lin,Zhuoyun Li,Kotaro Funakoshi

Main category: cs.CL

TL;DR: Causal2Vec enhances decoder-only LLMs for text embedding by using a lightweight pre-encoder and optimizing token pooling without altering the model or increasing computational costs.

Details Motivation: Existing methods either remove the causal attention mask in LLMs, undermining their semantic extraction ability, or rely on extra input text, increasing computational costs. This work aims to enhance LLMs' performance without altering their architecture or increasing overhead. Method: Causal2Vec employs a lightweight BERT-style model to pre-encode input text into a Contextual token, which is then appended to the LLM's input sequence. It also concatenates the last hidden states of the Contextual and EOS tokens to mitigate recency bias. Result: Causal2Vec achieves state-of-the-art performance on the MTEB benchmark among models trained on publicly available retrieval datasets, reducing sequence length by up to 85% and inference time by up to 82% compared to leading methods. Conclusion: Causal2Vec improves the performance of decoder-only LLMs by enhancing their ability to encode semantic information without modifying their architecture or adding significant computational costs. Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

[56] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Peter Sandrini

Main category: cs.CL

TL;DR: This paper explores the viability of using open-source, locally deployable language models as an alternative to cloud-based AI chatbots in translation, emphasizing benefits in data privacy and control while contributing to the broader accessibility of AI technology.

Details Motivation: The motivation behind this study is to explore alternative deployment models for AI chatbots in translation studies due to concerns regarding data privacy, security, and equitable access with commercial, cloud-based solutions. Method: The study evaluates three open-source models installed on CPU-based platforms and compares them against commercially available online chatbots, focusing on functional performance. Result: The findings indicate that local deployment of open-source models is feasible and offers advantages in terms of data privacy and control, despite introducing its own set of challenges. Conclusion: The study concludes that locally deployable, free language models can serve as a viable alternative to proprietary, cloud-based AI solutions, offering benefits such as enhanced data control, improved privacy, and reduced dependency on cloud services, contributing to the democratization of AI technology. Abstract: The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.

[57] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Yongbing Zhang,Fang Nan,Shengxiang Gao,Yuxin Huang,Kaiwen Tan,Zhengtao Yu

Main category: cs.CL

TL;DR: This paper proposes MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization, which effectively models complex sentence relationships and generates concise, high-quality summaries without requiring labeled data.

Details Motivation: The motivation is to overcome the limitations of existing multi-document summarization methods that rely on single-relational graphs and require a predefined number of clusters, which restricts their ability to model complex relationships and reduce information redundancy effectively. Method: MRGSEM-Sum constructs a multi-relational graph integrating semantic and discourse relations between sentences and uses a two-dimensional structural entropy minimization algorithm for clustering. It also employs a position-aware compression mechanism to generate summaries. Result: Extensive experiments on four benchmark datasets show that MRGSEM-Sum consistently outperforms previous unsupervised methods and achieves performance comparable to supervised models and large language models. Human evaluation confirms the high quality of the generated summaries. Conclusion: The proposed MRGSEM-Sum framework demonstrates superior performance in multi-document summarization, outperforming previous unsupervised methods and achieving results comparable to supervised and large language models, while generating coherent and concise summaries. Abstract: The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.

[58] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Salah Eddine Bekhouche,Azeddine Benlamoudi,Yazid Bounab,Fadi Dornaika,Abdenour Hadid

Main category: cs.CL

TL;DR: This paper introduces an improved Dense Passage Retrieval framework for Arabic using Attentive Relevance Scoring to enhance retrieval accuracy.

Details Motivation: Arabic's complexity and underrepresentation in NLP research motivate the development of specialized frameworks for improved Arabic passage retrieval. Method: The method involves a novel Attentive Relevance Scoring (ARS) integrated with pre-trained Arabic language models to enhance retrieval accuracy. Result: The proposed method significantly increases ranking accuracy when answering Arabic questions. Conclusion: The paper concludes that the proposed enhanced DPR framework with ARS significantly improves Arabic passage retrieval performance. Abstract: Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.

[59] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Ante Wang,Yujie Lin,Jingyao Liu,Suhang Wu,Hao Liu,Xinyan Xiao,Jinsong Su

Main category: cs.CL

TL;DR: 本研究提出主动批判性思维范式,通过构建新基准和强化学习方法,显著提升AI在不完整信息下的推理与用户协作能力。

Details Motivation: 当前AI系统在面对不完整或有偏见的信息时,往往缺乏主动获取信息或澄清问题的能力,仅停留在被动拒绝层面。 Method: 引入主动批判性思维范式,并构建GSM-MC和GSM-MCE两个新基准来评估数学推理能力,同时采用强化学习(RL)提升模型表现。 Result: 实验表明,尽管Qwen3和Llama系列模型在传统推理任务中表现出色,但在主动批判性思维任务中表现较差,尤其是小型模型;通过强化学习,Qwen3-1.7B在GSM-MC上的准确率从0.15%提升至73.98%。 Conclusion: 通过增强的强化学习算法,可以显著提升模型的主动批判性思维能力,从而更有效地与用户协作解决问题。 Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

[60] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri,Yerulan Kongrat,Adrian Santosh,Ruslan Tasmukhanov,Josemaria Vera,Muhammad Dehan Al Kautsar,Fajri Koto

Main category: cs.CL

TL;DR: 本文探讨了如何通过微调大型语言模型(LLMs)以使其响应反映不同组织角色的访问权限,并提出了三种建模策略和两个用于评估的数据集。

Details Motivation: 随着大型语言模型在企业环境中的广泛应用,根据用户角色控制模型行为变得至关重要。现有的安全方法通常假设统一访问权限,缺乏对特定角色访问限制的考虑。 Method: 本文研究了三种方法:基于BERT的分类器、基于LLM的分类器以及基于角色条件的生成方法。同时构建了两个互补数据集进行评估,一个通过聚类和角色标记从现有指令调优语料库中改编,另一个则是合成生成的反映真实企业场景的数据集。 Result: 文章对不同组织结构下的模型性能进行了评估,并分析了其在面对提示注入、角色不匹配和越狱尝试时的鲁棒性。 Conclusion: 本文验证了大型语言模型可以通过微调来反映不同组织角色的访问权限,并展示了不同建模策略的有效性和鲁棒性。 Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

[61] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang,Zhihui Tang,Huaxia Yang,Qiuhong Gong,Tiantian Gu,Hongyang Ma,Yongxin Wang,Wubin Sun,Zeliang Lian,Kehang Mao,Yinan Jiang,Zhicheng Huang,Lingyun Ma,Wenjie Shen,Yajie Ji,Yunhui Tan,Chunbo Wang,Yunlu Gao,Qianling Ye,Rui Lin,Mingyu Chen,Lijuan Niu,Zhihao Wang,Peng Yu,Mengran Lang,Yue Liu,Huimin Zhang,Haitao Shen,Long Chen,Qiguang Zhao,Si-Xuan Liu,Lina Zhou,Hua Gao,Dongqiang Ye,Lingmin Meng,Youtao Yu,Naixin Liang,Jianxiong Wu

Main category: cs.CL

TL;DR: 本研究开发了临床安全-有效性双轨基准(CSEDB),用于评估大型语言模型在临床决策支持中的安全性和有效性,并发现领域特定模型优于通用模型。

Details Motivation: 大型语言模型(LLMs)在临床决策支持中具有潜力,但在安全评估和有效性验证方面面临重大挑战,因此需要一个标准化的评估框架。 Method: 开发了一个基于临床专家共识的多维临床安全-有效性双轨基准(CSEDB),包含30个标准,涵盖危重疾病识别、指南依从性和用药安全等关键领域,并由32名专科医生开发和审核2069个开放式问答项目。 Result: 对6个LLM的基准测试显示整体表现中等(平均总分57.2%,安全54.7%,有效性62.3%),在高风险场景中表现显著下降13.3%(p < 0.0001)。领域特定医学LLM在安全性和有效性方面表现更优。 Conclusion: CSEDB提供了评估医学LLM临床应用的标准化指标,有助于比较分析、风险识别和改进方向,并促进LLM在医疗环境中的安全有效部署。 Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[62] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Keer Lu,Zheng Liang,Youquan Li,Jiejun Tan,Da Pan,Shusen Zhang,Guosheng Dong,Huang Leng

Main category: cs.CL

TL;DR: Med-R³是一种通过渐进式强化学习优化医学检索与推理协同的新框架,在多个模型上超越现有技术,取得了显著性能提升。

Details Motivation: 现有方法多专注于单独优化检索或推理,缺乏对两者联合优化的关注,并且依赖监督微调限制了模型的泛化能力。此外,通用领域的强化学习奖励设计无法满足医疗领域需求。 Method: 引入Med-R³框架,通过强化学习逐步优化模型的逻辑推理和检索能力,并进行联合优化。 Result: 实验表明,LLaMA3.1-8B-Instruct和Qwen2.5-14B模型在Med-R³加持下分别超越GPT-4o-mini 3.93%和13.53%。 Conclusion: Med-R³通过渐进式强化学习框架,有效结合了医学检索与推理能力,实现了在医疗领域的卓越表现。 Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.

[63] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Alva West,Luodan Zhang,Liuliu Zhang,Minjun Zhu,Yixuan Weng,Yue Zhang

Main category: cs.CL

TL;DR: T-Detect是一种基于学生t分布的文本检测方法,能够更有效地处理对抗性文本。

Details Motivation: 现有零样本检测器通常依赖于假设高斯分布的统计度量,这一前提在面对对抗性或非英语文本的重尾统计特征时失效。 Method: 使用基于学生t分布的重尾差异评分替代传统的高斯归一化方法。 Result: T-Detect在RAID和HART数据集上验证,提升了AUROC高达3.9%,在Books域达到了0.926的AUROC。 Conclusion: T-Detect是一个基于统计的文本检测方法,具有较强的鲁棒性,适用于对抗性文本检测。 Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student's t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

[64] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Main category: cs.CL

TL;DR: 本文提出DiffLoRA,一种基于差分注意力机制的参数高效微调方法,旨在结合差分注意力的性能优势与LoRA的高效性。虽然在多数任务中表现一般,但在HumanEval任务中显著优于LoRA。

Details Motivation: 差分注意力机制可以有效消除Transformer模型中的噪声,但现有参数高效微调方法(如LoRA)并未充分利用其潜力,因此提出DiffLoRA以结合差分注意力机制的性能优势与LoRA的高效性。 Method: 提出DiffLoRA,基于低秩适配器改进Transformer模型中的差分注意力机制,分别在正负注意力项上引入低秩适配器,并在多种NLP任务中进行评估。 Result: DiffLoRA在多数评估任务中表现不如其他参数高效微调方法,但在HumanEval任务中比LoRA高出11个百分点。 Conclusion: DiffLoRA虽然在多数任务中表现不如其他参数高效微调方法,但在某些领域(如HumanEval)表现突出,这表明其在特定场景下具有潜力。 Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.

[65] Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

Salam Thabet Doghmash,Motaz Saad

Main category: cs.CL

TL;DR: 本文研究了阿拉伯语社交媒体中仇恨言论的识别与清理,取得了高准确率和相对较好的清理效果。

Details Motivation: 近年来,社交媒体中的仇恨言论识别变得越来越重要,本文旨在解决阿拉伯语文本中的仇恨言论检测与清理问题。 Method: 使用深度学习模型和transformers进行仇恨言论检测实验,并将文本清理问题视为机器翻译任务。 Result: 仇恨言论检测的最佳模型达到了92%的Macro F1分数和95%的准确率;文本清理实验中,最佳模型在1-gram上的BLEU得分达到了0.3。 Conclusion: 本文提出了一个有效的方法来识别和清理阿拉伯语社交媒体中的仇恨言论,通过深度学习模型和机器翻译任务实现了高F1分数和准确率的仇恨言论检测以及相对较好的文本清理效果。 Abstract: Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

[66] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Nasim Shirvani-Mahdavi,Devin Wingfield,Amin Ghasemi,Chengkai Li

Main category: cs.CL

TL;DR: This paper explores the use of large language models to generate natural language explanations for logical rules in knowledge graphs, showing promising results in explanation quality while identifying areas for future improvement.

Details Motivation: Logical rules in knowledge graphs can enhance reasoning, error detection, and pattern discovery, but their complexity and domain-specific labeling make them hard to understand. The paper investigates how large language models can generate natural language explanations to address this issue. Method: The authors used the AMIE 3.5.1 rule discovery algorithm to extract logical rules from three datasets (FB15k-237, FB-CVT-REV, and FB+CVT-REV). They experimented with various prompting strategies (e.g., zero-shot, few-shot, chain-of-thought) to generate explanations and conducted human evaluations on correctness, clarity, and hallucination. They also explored using large language models as automatic evaluators. Result: The results show that large language models can generate explanations that are largely correct and clear, although some challenges remain, particularly regarding hallucination and consistency. Conclusion: The paper concludes that large language models show promise in generating understandable natural language explanations for logical rules in knowledge graphs, although some challenges remain. Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

[67] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Yunxiang Yan,Tomohiro Sawada,Kartik Goyal

Main category: cs.CL

TL;DR: A new cascaded question disclosure framework improves the evaluation of LLMs' problem-solving abilities by providing a more accurate and scalable method compared to traditional QA benchmarks.

Details Motivation: QA benchmark performance is an indirect method for evaluating LLMs' problem-solving capabilities, necessitating a more accurate and scalable approach. Method: The framework collects model responses in a stagewise manner, revealing partial information about the question to elicit generalized reasoning. Result: The approach provides better model comparisons, induces improved intermediate traces, narrows performance gaps observed in standard QA settings, and is validated through ablation studies. Conclusion: The proposed cascaded question disclosure framework offers a more accurate and holistic evaluation of LLMs' problem-solving capabilities compared to standard QA paradigms. Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.

cs.CV [Back]

[68] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

Ruslan Khrulev

Main category: cs.CV

TL;DR: 本文提出了EGE-Math基准,用于评估视觉-语言模型理解学生手写数学解答并按标准评分的能力。

Details Motivation: 现有基准测试主要关注问题解决能力,而缺乏对理解学生解答、识别错误和根据固定标准评分能力的评估。 Method: 构建了一个包含122份俄罗斯高考数学解答的新型数据集EGE-Math,并在三种推理模式下评估了七种现代视觉-语言模型的效果。 Result: 实验结果揭示了当前视觉-语言模型在数学推理和与人类评分标准对齐方面的局限性。 Conclusion: 当前模型在数学推理和与人类评分标准对齐方面仍存在局限性,这为人工智能辅助评估开辟了新的研究方向。 Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math

[69] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

Zhensheng Yuan,Haozhi Huang,Zhen Xiong,Di Wang,Guanghua Yang

Main category: cs.CV

TL;DR: 本论文提出了一种高效的重建和渲染框架,解决城市规模场景中的多视角外观变化问题,具有更高的效率和质量。

Details Motivation: 为了在城市规模场景中实现高效且高质量的重建与渲染,同时解决多视角捕捉中的外观变化问题。 Method: 该方法从场景分割开始,采用基于可见性的图像选择策略优化训练效率,并通过可控制的细节层次策略调节高斯密度,同时使用外观变换模块减轻外观不一致的影响,并结合深度正则化、尺度正则化和抗锯齿等增强模块提高重建保真度。 Result: 实验结果表明,该方法在效率和质量上均优于之前的方法,并能有效重建城市规模场景。 Conclusion: 该论文提出了一种高效的框架,能够实现城市规模场景的快速重建和实时渲染,同时对多视角捕捉中的外观变化具有鲁棒性。 Abstract: We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.

[70] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella,Vittorio Cuculo,Alessandro D'Amelio,Marcella Cornia,Giuseppe Boccignone,Rita Cucchiara

Main category: cs.CV

TL;DR: 本文提出了一种新的扫描路径预测架构ScanDiff,该架构结合了扩散模型和视觉Transformer,并通过显式建模扫描路径的变异性以及引入任务驱动的生成方式,提高了预测的多样性和准确性。

Details Motivation: 现有的深度学习模型在扫描路径预测中生成平均化的行为,无法捕捉到人类视觉探索的变异性。 Method: 提出了一种结合扩散模型和视觉Transformer的新型架构ScanDiff,通过利用扩散模型的随机性显式建模扫描路径的变异性,并引入文本条件以实现任务驱动的扫描路径生成。 Result: ScanDiff在基准数据集上的实验结果显示,其在自由浏览和任务驱动场景中均超越了最先进的方法,生成了更多样化且准确的扫描路径。 Conclusion: ScanDiff能够更好地捕捉人类视觉行为的复杂性,推动了注视预测研究的发展。 Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.

[71] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu,Om Prabhu,Annu,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: This study demonstrates that deep learning-based super-resolution techniques, especially SRResNet, can enhance low-quality echocardiographic images, significantly improving diagnostic model accuracy and making AI-assisted care more feasible in resource-constrained settings.

Details Motivation: The motivation for this study is to improve automated cardiac interpretation in resource-constrained settings where poor-quality echocardiographic imaging hampers the effectiveness of diagnostic models. Although SR techniques have been successful in enhancing MRI and CT scans, their application to echocardiography remains underexplored. Method: The researchers used deep learning-based super-resolution (SR) techniques, specifically SRGAN and SRResNet, to enhance low-quality 2D echocardiograms from the CAMUS dataset. They evaluated the impact of these techniques on two classification tasks: 2CH vs. 4CH view classification and ED vs. ES phase classification. Result: The application of SR techniques, particularly SRResNet, resulted in significant gains in classification accuracy for both tasks. SRResNet was also computationally efficient, making it especially suitable for practical deployment. Conclusion: The study concludes that super-resolution techniques, particularly SRResNet, can effectively recover diagnostic value in degraded echocardiographic images, making them a viable tool for AI-assisted care in resource-constrained settings. Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

[72] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

Ranxi Lin,Canming Yao,Jiayi Li,Weihang Liu,Xin Lou,Pingqiang Zhou

Main category: cs.CV

TL;DR: This paper proposes PATA, a spike-based Neural Radiance Fields (NeRF) framework with dynamic time step training, which significantly reduces computational demands and power consumption while maintaining high rendering quality.

Details Motivation: NeRF-based models are computationally expensive due to dense point sampling, limiting their use in resource-constrained scenarios. Spiking Neural Networks (SNNs) offer an energy-efficient alternative, prompting the development of a more efficient NeRF framework. Method: The authors proposed a spike-based NeRF framework with a dynamic time step training strategy called Pretrain-Adaptive Time-step Adjustment (PATA), anchored on the Instant-NGP architecture. Result: Experimental results show that PATA reduces inference time steps by 64% and running power by 61.55% while preserving rendering fidelity across diverse datasets. Conclusion: PATA enables scene-adaptive inference with variable time steps, significantly reducing computational resources and power consumption while maintaining rendering quality. Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64\% and running power by 61.55\%.

[73] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

Santosh Patapati,Trisanth Srinivasan

Main category: cs.CV

TL;DR: NovaDrive is a novel, efficient architecture for autonomous driving that improves performance and safety while reducing computational demands.

Details Motivation: Autonomous vehicles require rapid decision-making while interpreting complex environmental data, which current systems struggle to manage efficiently. Method: NovaDrive uses a single-branch vision-language architecture with cross-attention blocks and a smoothness loss to process multiple modalities without recurrent memory. Result: NovaDrive achieves an 84% success rate, improves path efficiency (SPL) to 0.66, and reduces collision frequency from 2.6% to 1.2% on the MD-NEX benchmark. Conclusion: NovaDrive provides a streamlined, efficient solution for autonomous driving, improving safety and efficiency while reducing computational overhead. Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

[74] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

Alexandru Buburuzan

Main category: cs.CV

TL;DR: This paper introduces MObI and AnydoorMed, two diffusion-based synthetic data generation methods for autonomous driving and medical imaging, enabling realistic and controllable object inpainting across multimodal data while maintaining semantic and structural coherence.

Details Motivation: Safety-critical applications like autonomous driving and medical image analysis require extensive testing with realistic and controllable data. Synthetic data generation is preferred due to the cost and complexity of acquiring real-world data, motivating the development of methods like MObI and AnydoorMed. Method: The paper proposes two novel methods, MObI for autonomous driving and AnydoorMed for medical image analysis. Both methods use diffusion-based models to achieve reference-guided inpainting, focusing on maintaining semantic consistency, structural integrity, and multimodal coherence. Result: MObI enables realistic and controllable object insertion in multimodal scenes for autonomous driving, while AnydoorMed achieves detailed and semantically coherent inpainting of anomalies in mammography scans. Both demonstrate the adaptability of foundation models to different perceptual modalities. Conclusion: The paper concludes that foundation models for reference-guided inpainting in natural images can be effectively adapted to various perceptual modalities, enabling the creation of highly realistic, controllable, and multimodal counterfactual scenarios. Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly's structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.

[75] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati,Trisanth Srinivasan,Murari Ambati

Main category: cs.CV

TL;DR: XYZ-Drive是一种新的视觉语言模型,结合多种信息实现高效自主驾驶,提高成功率并降低碰撞率。

Details Motivation: 自主汽车需要几何精度和语义理解来导航复杂环境,但大多数系统分别处理这两者。 Method: 提出了一种名为XYZ-Drive的单一分视觉语言模型,结合了前摄像头帧、航拍地图和下一个航路点信息,通过一个轻量级的目标中心交叉注意层和部分微调的LLaMA-3.2 11B模型输出转向和速度。 Result: 在MD-NEX户外驾驶基准测试中,XYZ-Drive的成功率达到95%,路径长度加权成功率为0.80,超越PhysNav-DG 15%,碰撞率减半,并通过仅使用一个分支显著提高了效率。 Conclusion: XYZ-Drive实现了高效的自主驾驶,通过早期、标记级的意图和地图布局融合,实现了准确、透明、实时的驾驶。 Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[76] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov,Zaigham Zaheer,Omkar Thawakar,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: This paper introduces Enriched-FineR (E-FineR), a training-free method for fine-grained image classification that leverages large language models and vision-language models to enable open-set recognition without predefined labels, achieving state-of-the-art results with greater interpretability and applicability in real-world scenarios.

Details Motivation: The motivation behind this research is to overcome the limitations of traditional approaches to fine-grained image classification, such as their reliance on fixed vocabularies and closed-set classification paradigms. The researchers aim to develop a scalable and adaptable method that can handle novel classes in real-world settings without requiring predefined labels or extensive training. Method: The method involves a training-free approach called Enriched-FineR (E-FineR), which combines large language models (LLMs) with vision-language models (VLMs) to enable open-set recognition without predefined class labels. It focuses on enhancing the classification phase by thoroughly analyzing and refining the class names guessed by an LLM. Result: The result of the research is the development of the Enriched-FineR (E-FineR) method, which achieves state-of-the-art performance in fine-grained visual recognition. It demonstrates strong results in zero-shot and few-shot classification tasks without requiring training or human intervention, while also offering enhanced interpretability. Conclusion: The paper concludes that the proposed training-free method, Enriched-FineR (E-FineR), achieves state-of-the-art results in fine-grained visual recognition while offering greater interpretability and supporting a shift from rigid label prediction to flexible, language-driven understanding in real-world applications. Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.

[77] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

Sanghun Jung,Jingjing Zheng,Ke Zhang,Nan Qiao,Albert Y. C. Chen,Lu Xia,Chi Liu,Yuyin Sun,Xiao Zeng,Hsiang-Wei Huang,Byron Boots,Min Sun,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: 本文提出了一种结合多种互补概念并优化关键挑战的开放词汇3D实例分割方法,在提议生成和分类阶段分别引入Alpha-CLIP和SMS分数,实现了在ScanNet200和S3DIS上的性能突破。

Details Motivation: 开放词汇3D实例分割通常利用视觉-语言模型生成和分类3D实例提议,但现有方法中的不同概念往往是互补而非互斥的。论文旨在通过整合这些概念并优化关键问题,提出一种更优的解决方案。 Method: 论文采用两阶段方案:3D提议生成和实例分类。在提议生成阶段,使用基于3D跟踪的鲁棒提议聚合方法,通过迭代合并/删除去除重叠或部分提议。在分类阶段,使用Alpha-CLIP替代标准CLIP模型,并引入标准化最大相似度(SMS)分数来提升分类精度。 Result: 所提框架在ScanNet200和S3DIS数据集上所有AP和AR指标均达到最先进的性能,甚至优于端到端的闭合词汇方法。 Conclusion: 该论文提出了一种新的开放词汇3D实例分割解决方案,通过结合多种互补概念并优化关键挑战,在ScanNet200和S3DIS数据集上实现了最先进的性能,甚至超越了端到端闭合词汇方法。 Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.

[78] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Xiaochen Zhao,Hongyi Xu,Guoxian Song,You Xie,Chenxu Zhang,Xiu Li,Linjie Luo,Jinli Suo,Yebin Liu

Main category: cs.CV

TL;DR: X-NeMo is a new zero-shot diffusion-based portrait animation method that solves critical issues like identity leakage and expression capture, providing highly expressive and identity-preserving animations.

Details Motivation: The motivation is to overcome key issues in prior portrait animation approaches, including identity leakage and the difficulty in capturing subtle and extreme facial expressions, by introducing a novel zero-shot diffusion-based method. Method: X-NeMo uses a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from a driving image, employing cross-attention for motion control during image generation. It also utilizes a dual GAN decoder and spatial/color augmentations to enhance expressiveness and disentangle motion from identity cues. Result: X-NeMo outperforms state-of-the-art baselines in generating expressive animations while preserving the identity of the subject, as demonstrated by extensive experiments. Conclusion: X-NeMo effectively addresses key challenges in portrait animation, such as identity leakage and capturing subtle expressions, leading to expressive animations that maintain identity resemblance. Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

[79] Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Xu Cao,Takafumi Taketomi

Main category: cs.CV

TL;DR: 该论文提出了一种新的神经逆渲染方法,能够在不需要光源校准或中间线索的情况下,从多视角图像中联合重建几何形状、反射率和光照条件。

Details Motivation: 该论文旨在解决传统多视角光度立体方法需要光源校准或中间线索(如每视角法线图)的问题,实现从原始图像中联合优化所有场景参数。 Method: 该论文使用神经隐式场表示几何和反射率,并应用阴影感知体积渲染。首先,一个空间网络预测每个场景点的符号距离和反射率潜在编码;然后,一个反射率网络根据潜在编码和角度编码的表面法线、视角和光源方向估计反射率值。 Result: 该方法能够从不同方向光照下捕捉的多视角图像中共同重建几何形状、空间变化的反射率和光照条件,并在处理具有复杂几何和反射特性的物体时表现出色。 Conclusion: 该论文提出的神经逆渲染方法在形状和光照估计准确性方面优于现有的法线引导方法,能够推广到视角不对齐的多光图像,并处理具有挑战性的几何形状和反射率。 Abstract: We propose a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. Unlike prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, our method jointly optimizes all scene parameters from raw images in a single stage. We represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. A spatial network first predicts the signed distance and a reflectance latent code for each scene point. A reflectance network then estimates reflectance values conditioned on the latent code and angularly encoded surface normal, view, and light directions. The proposed method outperforms state-of-the-art normal-guided approaches in shape and lighting estimation accuracy, generalizes to view-unaligned multi-light images, and handles objects with challenging geometry and reflectance.

[80] CNN-based solution for mango classification in agricultural environments

Beatriz Díaz Peón,Jorge Torres Gómez,Ariel Fajardo Márquez

Main category: cs.CV

TL;DR: This paper presents a fruit detection and classification system using CNNs and cascade detectors for automated mango classification, aimed at improving farm inventory management.

Details Motivation: The motivation is to develop an automated system for assessing fruit quality in farm inventory management, specifically for mango classification, to ensure accuracy and efficiency. Method: A method for mango fruit classification was developed using image processing techniques. The Resnet-18 architecture was used for classification, and a cascade detector was employed for detection, ensuring a balance between execution speed and computational resource consumption. Result: Detection and classification results were successfully displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction and demonstrating the effectiveness of the proposed method. Conclusion: The integration of CNNs and cascade detectors offers a reliable solution for fruit classification and detection with potential applications in agricultural quality control. Abstract: This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control.

[81] Single Image Rain Streak Removal Using Harris Corner Loss and R-CBAM Network

Jongwook Si,Sungyoung Kim

Main category: cs.CV

TL;DR: This study proposes an advanced network for removing rain streaks from images, yielding better performance than existing methods.

Details Motivation: The motivation is to improve single-image rain streak removal while preserving fine structural details and visual quality. Method: A novel image restoration network with a Corner Loss and a Residual Convolutional Block Attention Module (R-CBAM) was introduced to enhance restoration quality. Result: The method achieved a PSNR of 33.29 dB on the Rain100L dataset and 26.16 dB on the Rain100H dataset. Conclusion: The proposed method for single-image rain streak removal outperforms previous approaches, achieving higher PSNR on tested datasets. Abstract: The problem of single-image rain streak removal goes beyond simple noise suppression, requiring the simultaneous preservation of fine structural details and overall visual quality. In this study, we propose a novel image restoration network that effectively constrains the restoration process by introducing a Corner Loss, which prevents the loss of object boundaries and detailed texture information during restoration. Furthermore, we propose a Residual Convolutional Block Attention Module (R-CBAM) Block into the encoder and decoder to dynamically adjust the importance of features in both spatial and channel dimensions, enabling the network to focus more effectively on regions heavily affected by rain streaks. Quantitative evaluations conducted on the Rain100L and Rain100H datasets demonstrate that the proposed method significantly outperforms previous approaches, achieving a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H.

[82] Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou

Main category: cs.CV

TL;DR: This paper proposes a four-modal framework (text, audio, video, motion) for fine-grained motion retrieval using sequence-level contrastive learning, showing significant performance improvements over existing methods.

Details Motivation: Existing motion retrieval methods rely on contrastive learning for a unified embedding space but lack intuitive interaction modes and overlook sequential representation for better retrieval performance. The paper aims to address these limitations by introducing a more effective and user-friendly approach. Method: The paper proposes a framework that aligns four modalities—text, audio, video, and motion—in a fine-grained joint embedding space using sequence-level contrastive learning. They also create two multi-modal motion retrieval datasets by augmenting existing datasets with synthetic audio recordings. Result: Experimental results show superior performance over state-of-the-art methods, including a 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. The four-modal framework also outperforms its three-modal counterpart. Conclusion: The paper concludes that their proposed four-modal framework significantly improves motion retrieval performance over existing methods and highlights the potential of multi-modal approaches in motion acquisition. Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

[83] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery

Youngsun Jang,Dongyoun Kim,Chulwoo Pack,Kwanghee Won

Main category: cs.CV

TL;DR: 这篇论文介绍了一个用于分割卫星图像中洪水区域的新数据集,发现现有模型在该任务上表现有限,表明需要多模态和时间学习方法的进一步研究。

Details Motivation: 作者在回顾了77个使用卫星图像的现有基准数据集后,发现缺乏适用于洪水区域分割的合适数据集,因此提出了这个新的数据集来填补这一空白。 Method: 论文的方法包括构建一个新的洪水分割数据集,收集了2019年美国中西部洪水的Planet Labs卫星图像,选取五个州的十处地点,每处地点包含10张卫星图像,涵盖洪水和非洪水区域,并通过统一的分辨率和调整大小进行数据预处理。此外,作者测试了计算机视觉和遥感领域的最先进模型,并通过不同窗口大小的消融实验来捕捉时间特征。 Result: 实验结果显示,现有最先进的模型在该数据集上的语义分割性能一般,表明需要进一步研究多模态和时间学习方法。 Conclusion: 该论文得出的结论是,当前最先进的模型在新引入的洪水分割数据集上表现一般,表明需要开发多模态和时间学习策略来提高性能。 Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on .

[84] Adversarial-Guided Diffusion for Multimodal LLM Attacks

Chengwei Xia,Fan Ma,Ruijie Quan,Kun Zhan,Yi Yang

Main category: cs.CV

TL;DR: This paper introduces AGD, a new method for generating adversarial images using diffusion models to effectively attack MLLMs while remaining robust to defenses.

Details Motivation: The motivation is to create an effective adversarial attack on multimodal large language models (MLLMs) using diffusion models, while maintaining the visual integrity of the original image and resisting common defenses. Method: The paper proposes an adversarial-guided diffusion (AGD) approach, which introduces adversarial-guided noise into the reverse diffusion process to generate adversarial images that can deceive MLLMs. Result: Extensive experiments show that AGD achieves superior attack performance and robustness against various defenses compared to state-of-the-art methods. Conclusion: The paper concludes that AGD is an effective and robust method for attacking MLLMs, outperforming existing methods in both attack performance and resistance to defenses. Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

[85] Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images

Xiaoyu Ji,Ali Shakouri,Fengqing Zhu

Main category: cs.CV

TL;DR: The paper proposes a method for improved classification and segmentation of food crystal agglomeration in 2D microscopic images, using a supervised baseline model, an instance classification model, and a post-processing module.

Details Motivation: Food crystal agglomeration affects food product quality, and manual annotation of agglomeration in 2D microscopic images is challenging due to the transparency of water bonding and limited perspective. This necessitates an automated solution. Method: The method involves a supervised baseline model for generating segmentation pseudo-labels, an instance classification model that performs pixel-wise segmentation, and a post-processing module to preserve crystal properties. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. Result: The method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. It is evaluated under two confidence levels of manual annotations and successfully classifies potential agglomerated instances. Conclusion: The proposed method successfully classifies potential agglomerated instances and improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Abstract: Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.

[86] YOLO-ROC: A High-Precision and Ultra-Lightweight Model for Real-Time Road Damage Detection

Zicheng Lin,Weichao Pan

Main category: cs.CV

TL;DR: 本文提出了一种高精度、轻量化的道路损坏检测模型YOLO-ROC,通过改进模块和优化策略,显著提升了小目标检测性能和计算效率。

Details Motivation: 现有的深度学习模型在道路损坏检测中存在多尺度特征提取不足、小目标漏检率高以及计算复杂度高的问题,限制了其实际部署和应用。 Method: 设计了BMS-SPPF模块以增强多尺度特征提取能力,并采用分层通道压缩策略降低计算复杂度。 Result: 在RDD2022_China_Drone数据集上,YOLO-ROC的mAP50达到67.6%,相较YOLOv8n提高了2.11%,小目标D40类别的mAP50提升了16.8%,模型大小仅为2.0 MB,同时计算参数和GFLOPs显著降低。 Conclusion: YOLO-ROC实现了高精度和轻量化的道路损坏检测,通过改进的模块和策略提升了小目标检测效果,并展现出优秀的泛化能力。 Abstract: Road damage detection is a critical task for ensuring traffic safety and maintaining infrastructure integrity. While deep learning-based detection methods are now widely adopted, they still face two core challenges: first, the inadequate multi-scale feature extraction capabilities of existing networks for diverse targets like cracks and potholes, leading to high miss rates for small-scale damage; and second, the substantial parameter counts and computational demands of mainstream models, which hinder their deployment for efficient, real-time detection in practical applications. To address these issues, this paper proposes a high-precision and lightweight model, YOLO - Road Orthogonal Compact (YOLO-ROC). We designed a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module to enhance multi-scale feature extraction and implemented a hierarchical channel compression strategy to reduce computational complexity. The BMS-SPPF module leverages a bidirectional spatial-channel attention mechanism to improve the detection of small targets. Concurrently, the channel compression strategy reduces the parameter count from 3.01M to 0.89M and GFLOPs from 8.1 to 2.6. Experiments on the RDD2022_China_Drone dataset demonstrate that YOLO-ROC achieves a mAP50 of 67.6%, surpassing the baseline YOLOv8n by 2.11%. Notably, the mAP50 for the small-target D40 category improved by 16.8%, and the final model size is only 2.0 MB. Furthermore, the model exhibits excellent generalization performance on the RDD2022_China_Motorbike dataset.

[87] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience

Yanming Xiu

Main category: cs.CV

TL;DR: 本文研究了增强现实(AR)技术中虚拟内容的安全性和可信度问题,并提出了检测有害AR内容的系统以及未来的研究方向。

Details Motivation: 随着增强现实(AR)技术日益融入日常生活,确保其虚拟内容的安全性和可信度变得至关重要。特别是那些阻碍关键信息或微妙地操纵用户感知的任务有害的AR内容。 Method: 开发了两个系统ViDDAR和VIM-Sense,使用视觉-语言模型(VLMs)和多模态推理模块来检测AR内容中的攻击行为。 Result: 提出了三个未来的研究方向:自动化的、感知对齐的虚拟内容质量评估;多模态攻击的检测;以及针对AR设备的高效和用户中心化部署的VLMs适应性改进。 Conclusion: 研究旨在建立一个可扩展的、与人类对齐的AR体验保护框架,并寻求关于感知建模、多模态AR内容实现和轻量级模型适应的反馈。 Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.

[88] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning

Fan Lyu,Linglan Zhao,Chengyan Liu,Yinying Mei,Zhang Zhang,Jian Zhang,Fuyuan Hu,Liang Wang

Main category: cs.CV

TL;DR: This paper introduces Generalized Semi-FSCIL (Gsemi-FSCIL), where unlabeled data includes both base and all previously seen novel classes. It proposes an Ambiguity-guided Learnable Distribution Calibration (ALDC) method to address the resulting challenges, achieving superior performance over existing approaches.

Details Motivation: Current Semi-FSCIL methods rely on a narrow assumption about unlabeled data being from only the current session's novel classes, which does not align with real-world scenarios. This work redefines the problem as Generalized Semi-FSCIL (Gsemi-FSCIL) to better reflect practical settings. Method: An Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy is proposed to dynamically correct biased feature distributions for few-shot novel classes using abundant base samples. Result: Experiments on three benchmark datasets demonstrate that the proposed ALDC method outperforms existing approaches and sets new state-of-the-art results. Conclusion: The proposed ALDC strategy effectively addresses the challenges in Gsemi-FSCIL by leveraging base samples to improve performance on few-shot novel classes, achieving state-of-the-art results. Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.

[89] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Sungguk Cha,DongWook Kim,Taeseung Hahn,Mintae Kim,Youngsub Han,Byoung-Ki Jeon

Main category: cs.CV

TL;DR: RL-QR is a reinforcement learning-based method for query rewriting in RAG systems that improves retrieval performance without requiring annotated data, showing promise for real-world applications but still facing challenges in semantic retrieval contexts.

Details Motivation: Effective query formulation is crucial for RAG systems, but optimizing queries for diverse, unstructured documents is challenging, especially without annotated data. Method: RL-QR uses a reinforcement learning framework with synthesized scenario-question pairs and Generalized Reward Policy Optimization (GRPO) to train retriever-specific query rewriters without human-annotated datasets. Result: Experiments showed significant improvements in retrieval performance, with RL-QR achieving an 11% relative gain in NDCG@3 for multi-modal RAG and a 9% gain for lexical retrievers, though no improvements were seen for semantic and hybrid retrievers. Conclusion: RL-QR has the potential to revolutionize query optimization for RAG systems by providing a scalable, annotation-free solution for real-world retrieval tasks, although further refinement is needed for semantic retrieval contexts. Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

[90] Automated Mapping the Pathways of Cranial Nerve II, III, V, and VII/VIII: A Multi-Parametric Multi-Stage Diffusion Tractography Atlas

Lei Xie,Jiahao Huang,Jiawei Zhang,Jianzhong He,Yiang Pan,Guoqiang Xie,Mengjun Li,Qingrun Zeng,Mingchu Li,Yuanjing Feng

Main category: cs.CV

TL;DR: This study presents the first comprehensive diffusion tractography atlas for automated mapping of cranial nerve pathways in the human brain, enabling more efficient and accurate visualization of complex brain structures and their spatial relationships.

Details Motivation: Cranial nerves (CNs) are crucial for essential brain functions, and mapping their pathways using diffusion MRI (dMRI) offers valuable preoperative insights into their spatial relationships with surrounding tissues. However, creating a comprehensive and detailed CN atlas is challenging due to the unique anatomical structures of each CN pair and the complexity of the skull base environment. This study aims to address these challenges by developing an automated and comprehensive diffusion tractography atlas for mapping CN pathways. Method: The researchers used multi-parametric fiber tractography to generate streamlines from dMRI data of 50 subjects from the Human Connectome Project (HCP). They applied a multi-stage fiber clustering strategy to analyze approximately 1,000,000 streamlines, creating a detailed cranial nerve (CN) atlas. This atlas was validated through quantitative and visual experiments across multiple datasets, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset, and two clinical cases involving pituitary adenoma patients. Result: The proposed CN atlas automatically identified 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve (CN II), oculomotor nerve (CN III), trigeminal nerve (CN V), and facial-vestibulocochlear nerve (CN VII/VIII). The atlas demonstrated high spatial correspondence with expert manual annotations across multiple acquisition sites, including the HCP dataset, the MDM dataset, and two clinical cases of pituitary adenoma patients. Experimental results confirmed the robustness of the CN atlas. Conclusion: The study successfully developed a comprehensive diffusion tractography atlas for automated mapping of cranial nerve pathways, identifying 8 fiber bundles associated with 5 pairs of CNs and demonstrating robustness and spatial correspondence with expert annotations across multiple datasets. Abstract: Cranial nerves (CNs) play a crucial role in various essential functions of the human brain, and mapping their pathways from diffusion MRI (dMRI) provides valuable preoperative insights into the spatial relationships between individual CNs and key tissues. However, mapping a comprehensive and detailed CN atlas is challenging because of the unique anatomical structures of each CN pair and the complexity of the skull base environment.In this work, we present what we believe to be the first study to develop a comprehensive diffusion tractography atlas for automated mapping of CN pathways in the human brain. The CN atlas is generated by fiber clustering by using the streamlines generated by multi-parametric fiber tractography for each pair of CNs. Instead of disposable clustering, we explore a new strategy of multi-stage fiber clustering for multiple analysis of approximately 1,000,000 streamlines generated from the 50 subjects from the Human Connectome Project (HCP). Quantitative and visual experiments demonstrate that our CN atlas achieves high spatial correspondence with expert manual annotations on multiple acquisition sites, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset and two clinical cases of pituitary adenoma patients. The proposed CN atlas can automatically identify 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve CN II, oculomotor nerve CN III, trigeminal nerve CN V and facial-vestibulocochlear nerve CN VII/VIII, and its robustness is demonstrated experimentally. This work contributes to the field of diffusion imaging by facilitating more efficient and automated mapping the pathways of multiple pairs of CNs, thereby enhancing the analysis and understanding of complex brain structures through visualization of their spatial relationships with nearby anatomy.

[91] A Deep Dive into Generic Object Tracking: A Survey

Fereshteh Aghaee Meibodi,Shadi Alijani,Homayoun Najjaran

Main category: cs.CV

TL;DR: This paper reviews various object tracking methods in computer vision, particularly focusing on transformer-based approaches, and highlights their advancements due to robust spatio-temporal modeling capabilities.

Details Motivation: Generic object tracking is an important and challenging task in computer vision due to complex spatio-temporal dynamics and issues like occlusions and appearance variations. The paper aims to provide a comprehensive review of all tracking categories, especially focusing on the rapidly evolving transformer-based methods. Method: The paper provides a comprehensive review of object tracking paradigms, including Siamese-based trackers, discriminative trackers, and transformer-based approaches, using both qualitative and quantitative comparisons. Result: The paper presents a novel categorization of tracking methods, offers unified visual and tabular comparisons of representative approaches, and summarizes major evaluation benchmarks, particularly emphasizing advancements in transformer-based tracking. Conclusion: The paper concludes that transformer-based tracking methods have significantly advanced the field of generic object tracking, owing to their robust spatio-temporal modeling capabilities. Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.

[92] Towards Measuring and Modeling Geometric Structures in Time Series Forecasting via Image Modality

Mingyang Yu,Xiahui Guo,Peng chen,Zhenkai Li,Yang Shu

Main category: cs.CV

TL;DR: This paper proposes SATL, a new loss function for time series forecasting that improves structural modeling by combining three components: first-order difference loss, frequency domain loss, and perceptual feature loss.

Details Motivation: Traditional metrics like MSE fail to evaluate the geometric structure of time series data, which is essential for understanding temporal dynamics. Method: The paper introduces SATL, a multi-component loss function that includes a first-order difference loss, a frequency domain loss, and a perceptual feature loss to enhance time series structure modeling. Result: SATL achieves superior performance in both MSE and TGSI metrics across multiple datasets. Conclusion: Models trained with SATL outperform baseline methods in MSE and TGSI metrics without increasing inference computational cost. Abstract: Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.

[93] Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels

Haoxian Ruan,Zhihua Xu,Zhijing Yang,Guang Ma,Jieming Xie,Changxiang Fan,Tianshui Chen

Main category: cs.CV

TL;DR: This paper proposes SATL, a new method for multi-label image recognition with partial labels that dynamically learns category-specific thresholds and improves pseudo-label accuracy, leading to better performance on large-scale datasets.

Details Motivation: Traditional MLR-PL methods use fixed thresholds to generate pseudo-labels without considering varying score distributions across categories, leading to inaccurate and incomplete pseudo-labels. This work aims to address this limitation. Method: The study introduces the Semantic-Aware Threshold Learning (SATL) algorithm, which calculates score distributions for positive and negative samples in each category to determine and dynamically update category-specific thresholds. It also employs a differential ranking loss to improve discrimination between these samples. Result: Experiments on large-scale datasets like Microsoft COCO and VG-200 show that the SATL algorithm significantly improves performance in scenarios with limited labels compared to traditional methods. Conclusion: The proposed SATL algorithm significantly improves the performance of multi-label image recognition with partial labels by dynamically determining category-specific thresholds and enhancing discrimination between positive and negative samples. Abstract: Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.

[94] PixNerd: Pixel Neural Field Diffusion

Shuai Wang,Ziteng Gao,Chenhui Zhu,Weilin Huang,Limin Wang

Main category: cs.CV

TL;DR: PixelNerd offers a single-stage, end-to-end diffusion model that operates directly in pixel space using neural fields, achieving high performance without complex pipelines or VAEs.

Details Motivation: Current diffusion transformers rely on compressed latent spaces from pre-trained VAEs, which introduce accumulated errors and decoding artifacts. Alternatives that return to pixel space often result in complex cascade pipelines and increased token complexity. There is a need for a more efficient and simplified approach. Method: PixelNerd models patch-wise decoding using a neural field, enabling efficient representation and direct operation in pixel space without relying on pre-trained VAEs or multi-stage training paradigms. Result: PixelNerd achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without any cascade pipeline or VAE. In text-to-image applications, PixNerd-XXL/16 achieved 0.73 on the GenEval benchmark and 80.9 on the DPG benchmark. Conclusion: The proposed PixelNerd framework addresses the limitations of current diffusion transformers by offering a single-scale, single-stage, and end-to-end solution that avoids the need for complex cascade pipelines or VAEs. It achieves strong performance on image generation tasks and extends effectively to text-to-image applications. Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

[95] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2

Solha Kang,Eugene Kim,Joris Vankerschaver,Utku Ozbulak

Main category: cs.CV

TL;DR: 本文探讨了如何利用零样本模型SAM2进行低资源输入的乳腺MRI三维肿瘤分割,通过单一切片的边界框标注实现整个3D体积的分割,并提出了三种切片跟踪策略,其中中心向外传播的效果最佳。尽管SAM2未专门训练医学影像数据,但其在极少监督下表现良好,为资源受限环境提供了一种可行的替代方案。

Details Motivation: 乳腺MRI提供高分辨率的体积成像,对肿瘤评估和治疗计划至关重要,但手动解释3D扫描图像仍费时费力且主观性强。AI工具虽然有望加速医学图像分析,但在低收入和中等收入国家的采用率仍然有限。 Method: 研究使用SAM2模型,通过单一切片上的边界框标注,采用三种切片跟踪策略(从上到下、从下到上和中心向外)传播分割预测到整个3D体积,并对这些策略进行了评估。 Result: 研究发现,中心向外传播策略产生的分割最为一致且准确。尽管SAM2是一个未针对体积医学数据训练的零样本模型,在极少监督下其分割表现仍然良好。 Conclusion: 通用基础模型如SAM2可以在极少监督下支持3D医学图像分析,为资源受限环境提供了一种可访问且经济的替代方案。 Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.

[96] iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang,Seungtae Nam,Xiangyu Sun,Sameh Khamis,Abdelrahman Mohamed,Eunbyung Park

Main category: cs.CV

TL;DR: iLRM 通过高效的迭代机制和优化策略,解决了现有3D重建方法在计算成本和可扩展性方面的局限性,实现了高质量、高效率的3D重建。

Details Motivation: 现有基于Transformer架构的3D重建方法由于依赖于多视图图像token之间的全注意力机制,导致计算成本高昂,难以扩展。 Method: 引入了一种迭代的大规模3D重建模型(iLRM),通过解耦场景表示与输入视图图像、将全注意力多视图交互分解为两阶段注意力机制、以及在每一层注入高分辨率信息,来生成3D高斯表示。 Result: 在RE10K和DL3DV等数据集上的实验表明,iLRM在重建质量和速度方面均优于现有方法,并且在相似计算成本下能够通过利用更多输入视图实现更高的重建质量。 Conclusion: iLRM 模型通过迭代优化机制和三个核心原则,实现了比现有方法更优的重建质量和速度,并且具有良好的可扩展性。 Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

[97] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang,Chenwei Xie,Xiaoyi Bao,Tingyu Weng,Pandeng Li,Yun Zheng,Liwei Wang

Main category: cs.CV

TL;DR: UniLIP通过两阶段训练和自我蒸馏策略,以及双条件架构,有效扩展了CLIP的应用范围,使其在理解、生成和编辑任务中均表现出色。

Details Motivation: 之前的基于CLIP的统一方法通常需要额外的扩散解码器或量化来支持重建和生成任务,导致重建不一致或原始理解性能下降。 Method: 引入了一个两阶段训练方案和一个自我蒸馏策略,逐步将重建能力集成到CLIP中,并提出了一种双条件架构来连接MLLM和扩散变压器。 Result: 在文本到图像生成任务中,UniLIP在GenEval和WISE基准上的得分分别为0.87和0.53,超过了所有先前类似规模的统一模型。在图像编辑中,UniLIP在ImgEdit基准上的得分也达到了3.62,超过了BAGEL和UniWorld-V1等最新模型。 Conclusion: UniLIP有效地扩展了CLIP的应用范围,使其不仅在理解任务中表现出色,而且在生成和编辑任务中也具有很强的竞争力。 Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM's strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.

[98] Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko,Ji Soo Lee,Minhyuk Choi,Zihang Meng,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: The paper proposes BLiM and CPN to address candidate prior bias in multi-modal retrieval, achieving better performance by leveraging bidirectional likelihood and score calibration.

Details Motivation: The authors aim to solve the candidate prior bias issue in applying MLLMs to Text-Video Retrieval, where models favor candidates with higher inherent priors over relevant ones. Method: The paper introduces BLiM, which uses bidirectional likelihood estimation, and CPN, a training-free calibration method to mitigate prior bias. Result: BLiM with CPN outperforms previous models by 6.4 R@1 on average across four benchmarks. Conclusion: The proposed BLiM framework with CPN effectively addresses candidate prior bias in Text-Video Retrieval, achieving superior performance over existing methods. Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[99] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung

Main category: cs.CV

TL;DR: The paper introduces LED, a new benchmark for evaluating document layout models, which addresses limitations in current metrics by detecting critical structural errors through a synthetic dataset and specialized tasks.

Details Motivation: Despite improvements in Document Layout Analysis using Large Language Models and Multimodal Models, structural errors like region merging, splitting, and missing content remain challenging. Traditional metrics like IoU and mAP are insufficient for detecting these issues. Method: The authors proposed Layout Error Detection (LED), a novel benchmark with eight standardized error types and three tasks: error existence detection, error type classification, and element-wise error type classification. They also created LED-Dataset, a synthetic dataset with realistic structural errors. Result: Experimental results show that LED successfully differentiates structural understanding capabilities across LMMs and exposes previously undetected modality biases and performance trade-offs. Conclusion: LED effectively evaluates the structural robustness of document layout predictions, revealing modality biases and performance trade-offs that traditional metrics cannot detect. Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.

[100] Training-free Geometric Image Editing on Diffusion Models

Hanshen Zhu,Zhen Zhu,Kaile Zhang,Yiming Gong,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: FreeFine improves geometric image editing by decoupling the editing process into distinct stages, delivering superior results on complex transformations.

Details Motivation: To address the difficulty of handling large or structurally complex transformations in geometric image editing using existing diffusion-based methods that perform all subtasks in a single step. Method: A decoupled pipeline that separates object transformation, source region inpainting, and target region refinement was proposed, implementing inpainting and refinement with a training-free diffusion approach called FreeFine. Result: FreeFine achieved better performance in image fidelity and edit precision on the GeoBench benchmark, which includes 2D and 3D editing scenarios. Conclusion: FreeFine, a training-free diffusion approach, outperforms state-of-the-art alternatives in geometric image editing tasks, especially under demanding transformations. Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine

[101] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection

Xihang Hu,Fuming Sun,Jiazhe Liu,Feilong Xu,Xiaoli Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为ST-SAM的半监督伪装目标检测框架,通过自我训练策略和混合提示机制,在仅需少量标注数据的情况下实现了高性能。

Details Motivation: 现有的基于教师-学生框架的半监督伪装目标检测方法在监督稀缺的情况下存在严重的预测偏差和错误传播问题,同时多网络架构带来了高计算开销和有限的可扩展性。 Method: ST-SAM采用了一种自我训练策略来动态筛选和扩展高置信度的伪标签,并将伪标签转化为包含领域特定知识的混合提示,以利用Segment Anything Model的潜力。 Result: 实验表明,ST-SAM在COD基准数据集上仅使用1%的标注数据就达到了最先进的性能,超过了现有的半监督伪装目标检测方法,甚至与全监督方法相媲美。 Conclusion: ST-SAM通过自我训练策略和混合提示机制,在仅需1%标注数据的情况下实现了最先进的性能,且不依赖特定模型或损失函数。 Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.

[102] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

Xuewei Tang,Mengmeng Yang,Tuopu Wen,Peijin Jia,Le Cui,Mingshang Luo,Kehua Sheng,Bo Zhang,Diange Yang,Kun Jiang

Main category: cs.CV

TL;DR: PriorFusion是一种利用语义、几何和生成先验信息提升自动驾驶道路元素感知的新框架。

Details Motivation: 自动驾驶车辆需要在复杂且缺乏高清地图支持的环境中独立解析周围道路元素,但现有方法未能充分利用道路元素中的结构先验信息,导致预测不规则且不准确。 Method: PriorFusion引入了实例感知的注意力机制,并构建了一个数据驱动的形状模板空间,利用扩散模型生成基于先验锚点的预测。 Result: 实验表明,该方法在大规模自动驾驶数据集上显著提升了感知准确性,特别是在具有挑战性的条件下。可视化结果显示,预测结果更加准确、规则且连贯。 Conclusion: PriorFusion通过整合语义、几何和生成先验,提升了自动驾驶中道路元素感知的准确性与完整性。 Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.

[103] Forgetting of task-specific knowledge in model merging-based continual learning

Timm Hess,Gido M van de Ven,Tinne Tuytelaars

Main category: cs.CV

TL;DR: This paper explores model merging in continual learning, showing that it preserves shared knowledge and outperforms parallel training merging when done incrementally.

Details Motivation: The motivation of the paper is to understand how merging models impacts knowledge preservation and performance in continual learning scenarios. Method: Computer vision experiments using controlled visual cues were conducted to investigate the effects of merging models in continual learning. Result: The results show that merging models in continual learning largely preserves or enhances shared knowledge while unshared task-specific knowledge rapidly degrades, and merging models from an incremental training process performs better than merging models trained in parallel. Conclusion: Merging models in continual learning can preserve or enhance shared knowledge, while unshared knowledge degrades quickly, and incremental training process merging outperforms parallel training merging. Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.

[104] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara,Sergio Picascia,Elisabetta Rocchetti

Main category: cs.CV

TL;DR: 本文研究了基于Transformer的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念,发现模型在没有显式监督的情况下自然发展出对内容和风格区别的理解。

Details Motivation: 研究扩散模型如何在没有显式指导的情况下编码绘画中的内容和风格概念。 Method: 利用跨注意力热图将生成图像中的像素归因于特定的提示词,从而区分内容描述词和风格描述词对图像区域的影响。 Result: 发现内容词主要影响物体相关区域,而风格词影响背景和纹理区域,表明模型具备内容与风格分离的能力。 Conclusion: 扩散模型在生成艺术作品时表现出不同程度的内容与风格分离,表明其对内容和风格区别的理解是自然发展的,无需显式监督。 Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

[105] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Sarbajit Pal,Amitabha Das

Main category: cs.CV

TL;DR: 分析了多种轻量级深度学习模型在不同超参数设置下的性能,发现余弦学习率衰减和可调批处理大小能显著提升准确性和收敛速度,同时保持低延迟,适合实时图像处理。

Details Motivation: 轻量级卷积和基于Transformer的模型在资源受限应用中(如嵌入式系统和边缘设备)对实时图像分类至关重要,因此需要研究超参数调整对模型性能的影响。 Method: 在一致训练设置下,对七个高效深度学习架构进行综合消融实验,分析关键超参数对准确性和收敛行为的影响,并评估其在实时应用场景中的适用性。 Result: 实验结果表明,余弦学习率衰减和可调批处理大小显著提升了模型的准确性和收敛速度,且保持了低延迟和内存成本;其中RepVGG-A2在Top-1准确率超过80%,在准确性和部署成本之间实现了良好平衡。 Conclusion: 论文得出调整超参数(如余弦学习率衰减和可调批处理大小)显著提高准确性和收敛速度,同时保持低延迟和内存成本,为实时图像处理管道构建资源高效模型提供了实用指南。 Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.

[106] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao,Qizhe Zhang,Peidong Jia,Xuhui Zhao,Bo Lan,Xiaoan Zhang,Xiaobao Wei,Sixiang Chen,Zhuo Li,Yang Wang,Liyun Li,Xianming Liu,Ming Lu,Shanghang Zhang

Main category: cs.CV

TL;DR: 本文提出FastDriveVLA,在自主驾驶中高效剪枝VLA模型视觉标记,提升计算效率且保持性能。

Details Motivation: 现有VLA模型视觉标记冗余导致计算成本高,传统剪枝方法在自主驾驶任务中效果不佳,需针对驾驶场景优化。 Method: 提出ReconPruner剪枝框架,使用MAE风格的像素重建和对抗性前景-背景重建策略,结合nuScenes-FG数据集进行训练。 Result: 在nuScenes闭环规划基准中,该方法在不同剪枝比例下均达到SOTA性能。 Conclusion: FastDriveVLA实现自主驾驶场景下的视觉标记剪枝,通过保留前景信息提高决策效率,且无需重新训练即可适用于不同VLA模型。 Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.

[107] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang,Hongbin Lin,Yueru Luo,Suzhong Fu,Chao Zheng,Xinrui Yan,Shuqi Mei,Kun Tang,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM是一种新的车道段拓扑推理框架,通过引入潜世界模型和快慢系统并行监督设计,有效提升了自动驾驶感知性能。

Details Motivation: 现有车道拓扑推理方法难以充分利用时间信息,且存在对历史查询的过度依赖、姿态估计失败的脆弱性以及时间传播不足的问题。 Method: FASTopoWM引入了基于动作潜变量的潜查询和BEV世界模型,并采用快慢系统并行监督的框架设计。 Result: 在OpenLane-V2基准测试中,FASTopoWM在车道段检测(mAP 37.4%)和中心线感知(OLS 46.3%)任务上均优于现有技术。 Conclusion: FASTopoWM克服了现有方法的局限性,提高了车道拓扑推理的性能,为自动驾驶系统提供了更准确和鲁棒的感知模块。 Abstract: Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[108] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

Yingkai Wang,Yaoyao Zhu,Xiuding Cai,Yuhao Xiao,Haotian Wu,Yu Yao

Main category: cs.CV

TL;DR: 本文提出了一种针对医学图像分割的领域泛化框架,通过隐式特征扰动和自适应一致性约束,有效解决了领域转移带来的性能下降问题。

Details Motivation: 医学图像分割在跨临床领域应用时面临领域转移导致的性能下降问题,主要由于成像条件、扫描仪类型和采集协议的差异。 Method: 引入了基于领域统计的隐式特征扰动,包括可学习的语义方向选择器和基于协方差的语义强度采样器,并设计了自适应一致性约束。 Result: 在两个公开的多中心基准数据集上进行了实验,结果表明该框架在不同临床领域下均优于现有的领域泛化方法。 Conclusion: 实验结果表明,所提出的领域泛化框架在医学图像分割中具有优越的鲁棒性和泛化能力,适用于多种临床领域。 Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.

[109] Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision

Qiang Lu,Waikit Xiu,Xiying Li,Shenyu Hu,Shengbo Sun

Main category: cs.CV

TL;DR: This paper introduces a two-stage framework with NanoVerse YOLO and TSR-MCL to improve traffic sign recognition under long-tail distribution and multi-scale conditions, achieving state-of-the-art results on TT100K.

Details Motivation: Traffic sign recognition faces challenges due to long-tail data distribution and small, multi-scale targets, which degrade the performance of traditional methods. Method: The framework consists of NanoVerse YOLO with RepVL-PAN and SPD-Conv for detection, and TSR-MCL using Vision Transformer and BERT for cross-modal classification. Result: On the TT100K dataset, the method achieves 78.4% mAP for long-tail detection, 91.8% accuracy, and 88.9% recall, outperforming mainstream algorithms. Conclusion: The paper proposes a novel two-stage framework combining open-vocabulary detection and cross-modal learning to address the challenges in traffic sign recognition, demonstrating superior performance and generalization in complex scenarios. Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.

[110] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

Xingyue Peng,Yuandong Lyu,Lang Zhang,Jian Zhu,Songtao Wang,Jiaxin Deng,Songxin Lu,Weiliang Ma,Dangen She,Peng Jia,XianPeng Lang

Main category: cs.CV

TL;DR: The paper presents a new framework for road surface reconstruction that effectively handles occlusions and visual challenges, significantly improving reconstruction quality for autonomous driving applications.

Details Motivation: The motivation is to improve road surface reconstruction for autonomous driving, achieving centimeter-accurate lane perception and high-definition mapping in complex urban environments, particularly addressing vulnerabilities to occlusions, visual clutter, and appearance degradation. Method: The method involves a robust reconstruction framework that uses occlusion-aware 2D Gaussian surfels and semantic-guided color enhancement. It includes a planar-adapted Gaussian representation for efficient modeling, segmentation-guided video inpainting to remove foreground objects, and color coherence enhancement through semantic-aware correction in HSV space. Result: The result is that the proposed framework achieves superior performance, producing clean and consistent road surface reconstructions that are both visually coherent and geometrically accurate on urban-scale datasets. Conclusion: The paper concludes that their proposed framework for road surface reconstruction significantly outperforms previous methods in real-world conditions by producing visually coherent and geometrically accurate results. Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.

[111] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models

Ahmet Can Ömercikoğlu,Mustafa Mansur Yönügül,Pakize Erdoğmuş

Main category: cs.CV

TL;DR: 這項研究探討了輸入解析度對三種深度學習人臉檢測模型(YOLOv11、YOLOv12 和 MTCNN)的影響,並提供了適用於不同條件的模型選擇建議。

Details Motivation: 現實世界條件(如低解析度圖像)會對檢測性能產生重大挑戰,這促使了對輸入解析度對深度學習基於的人臉檢測器的準確性和魯棒性的研究。 Method: 使用 WIDER FACE 數據集,對多個圖像解析度(160x160、320x320 和 640x640)進行了廣泛的評估,並使用精度、召回率、mAP50、mAP50-95 和推論時間等指標評估了每個模型的表現。 Result: 結果表明,YOLOv11 在檢測準確性方面優於 YOLOv12 和 MTCNN,尤其是在較高解析度下,而 YOLOv12 的召回率略高。MTCNN 在標誌性定位方面具有競爭力,但在實時推論速度上落後。 Conclusion: 本研究的結果提供了選擇適用於不同操作條件的解析度感知人臉檢測模型的實用見解。 Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model's performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.

[112] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Yingjie Zhou,Jiezhang Cao,Zicheng Zhang,Farong Wen,Yanwei Jiang,Jun Jia,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文介绍了THQA-10K数据集和一种新的AGTH质量评估方法,该方法基于第一帧、Y-T切片和音调-嘴唇一致性,达到了最先进的性能。

Details Motivation: 尽管AI生成的Talking Heads (AGTHs) 因为Text-to-Image (T2I) 模型的快速发展而逐渐成为一种新兴的数字人类媒体,但这些生成物的质量仍然面临挑战,相关综合研究仍有限。本研究旨在填补这一空白。 Method: 通过选择12种突出的T2I模型和14种先进的talkers生成AGTHs,利用志愿者主观评分和分析结果评估talkers的泛化能力和质量,并提出基于第一帧、Y-T切片和音调-嘴唇一致性的客观质量评估方法。 Result: THQA-10K数据集包含10,457个AGTHs,并通过主观实验分析了现有AGTHs的失真问题。提出的客观质量评估方法在实验中显示出最先进的性能。 Conclusion: 本文介绍了最大的AGTH质量评估数据集THQA-10K,并提出了一种基于第一帧、Y-T切片和音调-嘴唇一致性的客观质量评估方法,该方法在AGTH质量评估中能够达到最先进的性能。 Abstract: Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.

[113] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

Radu-Andrei Bourceanu,Neil De La Fuente,Jan Grimm,Andrei Jardan,Andriy Manucharyan,Cornelius Weiss,Roman Pflugfelder

Main category: cs.CV

TL;DR: This report examines the evolution of key design patterns in computer vision by analyzing six influential papers on foundational architectures, generative models, and self-supervised learning techniques.

Details Motivation: The motivation for this report is to understand the evolution of key design patterns in computer vision and how they have contributed to advancements in the field. Method: The method involves analyzing six influential papers on computer vision to identify and examine key design patterns and their evolution over time. Result: The result is an analysis of six influential papers covering foundational architectures (ResNet, ViT), generative models (GANs, LDMs), and self-supervised learning techniques (DINO, MAE) that have advanced the state-of-the-art in computer vision. Conclusion: The report concludes that significant progress has been made in computer vision through the development of key design patterns, including foundational architectures like ResNet and ViT, generative models like GANs and LDMs, and self-supervised learning techniques like DINO and MAE. Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

[114] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

Ji Ma,Wei Suo,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: This paper proposes Short-LVLM, a training-free and model-agnostic framework for compressing large vision-language models by addressing key challenges in layer pruning, resulting in improved efficiency without sacrificing performance.

Details Motivation: Large vision-language models (LVLMs) face practical limitations due to their massive parameters and high computational costs. While layer pruning has been effective in NLP models, its applicability to LVLMs is unclear due to modality divergence. Method: The authors conducted extensive experiments to evaluate the effectiveness of existing layer pruning methods in LVLMs, identified key challenges, and proposed the Short-LVLM framework to overcome these issues. Result: Direct application of NLP layer pruning techniques to LVLMs is ineffective. The proposed Short-LVLM framework achieves a better balance between performance and efficiency, while maintaining training-free and model-agnostic properties. Conclusion: Short-LVLM (SVL) effectively addresses the challenges of layer pruning in LVLMs by utilizing important vision-language tokens and mitigating inter-layer feature gaps, offering a training-free, model-agnostic, and highly compatible compression solution. Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

[115] VMatcher: State-Space Semi-Dense Local Feature Matching

Ali Youssef

Main category: cs.CV

TL;DR: 本文提出了VMatcher,一种结合Mamba和Transformer的混合网络,用于高效的图像特征匹配,显著降低了计算成本,同时保持了优异的性能。

Details Motivation: 现有的基于Transformer的特征匹配方法虽然性能优异,但由于其注意力机制的二次复杂度导致计算成本较高。而Mamba通过线性复杂度的状态空间模型实现了类似的性能,因此本文旨在结合两者的优势,设计一种更高效且有效的特征匹配方法。 Method: VMatcher采用了一种混合方法,将Mamba的选择性状态空间模型(SSM)与Transformer的注意力机制相结合,并提出了多种VMatcher配置,包括分层架构,以实现更高的效率和性能。 Result: VMatcher在多个基准测试中展示了其高效性和竞争力的性能,同时保持了与Transformer相当的准确性和鲁棒性,证明了其在实时应用中的潜力。 Conclusion: VMatcher结合了Mamba和Transformer的优势,在图像对的半稠密特征匹配任务中实现了高效且具有竞争力的性能,为需要快速推理的实时应用提供了实用且稳健的解决方案。 Abstract: This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

[116] UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Yijie Zhu,Lingsen Zhang,Zitong Yu,Rui Shao,Tao Tan,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出UniEmo框架,通过统一情感理解和生成任务以及双反馈机制,显著提升了模型性能。

Details Motivation: 情感理解和生成本质上是互补的,统一建模可以相互增强性能。 Method: 提出UniEmo框架,包含分层情感理解链、情感相关系数与情感条件损失,并采用数据过滤算法实现双反馈机制。 Result: UniEmo在情感理解和生成任务上均显著优于现有最先进方法。 Conclusion: UniEmo通过理解与生成任务的相互反馈机制,显著提升了情感理解与生成的性能。 Abstract: Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

[117] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen,Zexiao Wang,Haidong Cao,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: 本文提出MP^2A,一种基于CLIP的渐进式对齐方法,用于解决多源无监督领域自适应中的噪声和领域间隙问题,并在多个基准上实现了最先进的性能。

Details Motivation: 现有方法使用CLIP生成伪标签并进行一次性对齐,但在多源场景下由于领域差异和噪声影响导致性能下降。需要一种更鲁棒的对齐策略来缓解噪声影响并促进领域不变特征的学习。 Method: 该研究提出了一种渐进式对齐方法MP^2A,从高置信度样本开始训练,并逐步引入更多挑战性样本来优化模型。 Result: MP^2A在ImageCLEF、Office-Home和DomainNet等三个主流UDA基准上均取得最先进的性能,验证了方法的有效性。 Conclusion: MP^2A通过渐进式对齐策略有效解决了多源无监督领域自适应中的领域间隙和噪声问题,实现了最先进的性能。 Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

[118] NeRF Is a Valuable Assistant for 3D Gaussian Splatting

Shuangkang Fang,I-Chao Shen,Takeo Igarashi,Yufeng Wang,ZeSheng Wang,Yi Yang,Wenrui Ding,Shuchang Zhou

Main category: cs.CV

TL;DR: NeRF-GS 是一种结合NeRF和3DGS优势的新型3D场景表示框架,通过联合优化提高了性能,并证明两者是互补而非竞争的关系。

Details Motivation: 为了解决3DGS在高斯初始化敏感性、空间感知有限以及高斯间相关性弱等问题,同时结合NeRF的连续空间表示优势,提出NeRF-GS。 Method: NeRF-GS 通过共享3D空间信息,逐步对齐NeRF和3DGS的空间特征,并优化隐式特征和高斯位置的残差向量,以增强3DGS的个性化能力。 Result: 在基准数据集上的实验表明,NeRF-GS 超越了现有方法,达到了最先进的性能。 Conclusion: NeRF-GS 提出了一种新的联合优化NeRF和3DGS的框架,通过结合两者的优势,提高了3D场景表示的性能。 Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.

[119] AGA: An adaptive group alignment framework for structured medical cross-modal representation learning

Wei Li,Xun Gong,Jiao Li,Xiaobin Sun

Main category: cs.CV

TL;DR: 本文提出了一种名为自适应分组对齐(AGA)的新框架,用于从配对的医学图像和报告中捕捉结构化语义,解决了现有方法忽略临床报告固有结构以及对比学习框架依赖大量负样本的问题。

Details Motivation: 当前医学领域的视觉-语言预训练方法通常将临床报告简化为单一实体或碎片化标记,忽略了其内在结构;同时对比学习框架依赖大量负样本,这在小规模医学数据集中不切实际。 Method: AGA引入了一种基于稀疏相似矩阵的双向分组机制。对于每对图像-文本对,计算文本标记和图像块之间的细粒度相似度,形成视觉组和语言组。通过两个阈值门模块(语言分组阈值门和视觉分组阈值门)动态学习分组阈值,组表示基于相似度得分进行加权平均。此外,引入了实例感知分组对齐损失,消除了对外部负样本的需求,并通过双向跨模态分组对齐模块增强细粒度对齐效果。 Result: 在公开和私有数据集上的大量实验表明,该方法在图像-文本检索和分类任务中,在微调和零样本设置下均表现出优异的性能。 Conclusion: AGA框架有效解决了医学领域中视觉-语言预训练方法忽略文本结构和对比学习依赖负样本的问题,为医学图像和报告的配对分析提供了新的思路。 Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.

[120] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories

Lemar Abdi,Francisco Caetano,Amaan Valiuddin,Christiaan Viviers,Hamdi Joudeh,Fons van der Sommen

Main category: cs.CV

TL;DR: This paper proposes a fast and efficient out-of-distribution detection method using a Stein score-based diffusion model (SBDDM) for medical imaging, achieving superior performance with minimal computational cost.

Details Motivation: The motivation is to overcome the limitations of current generative OOD detection methods in medical imaging, such as computational expense, unreliability, and the need for retraining, in order to efficiently, consistently, and robustly distinguish nominal from anomalous inputs. Method: The method uses forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM) to capture trajectory curvature via the estimated Stein score, enabling reconstruction-free OOD detection with only five diffusion steps. Result: The SBDDM method achieved state-of-the-art performance on multiple Near-OOD and Far-OOD benchmarks with a relative improvement of up to 10.43% and 18.10% respectively, while significantly reducing computational cost during inference. Conclusion: The proposed SBDDM-based OOD detection method is highly effective and efficient, making it a practical solution for real-time, reliable computer-aided diagnosis in medical imaging. Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.

[121] Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的系统,利用蜂蜜高光谱成像数据自动检测蜂蜜掺假情况,整体交叉验证准确率达到96.39%。

Details Motivation: 开发一种可替代现有化学检测方法的高效、准确的蜂蜜掺假检测系统。 Method: 首先通过植物来源识别子系统分类蜂蜜样本的花源,再通过掺假检测子系统识别并量化糖浆掺假浓度。两个子系统均使用线性判别分析(LDA)提取特征,并采用K近邻(KNN)模型进行分类。 Result: 系统在公共蜂蜜高光谱图像数据集上测试,显示其蜂蜜掺假检测的交叉验证准确率为96.39%。 Conclusion: 该系统是一种有效的蜂蜜掺假检测方法,具备替代现有化学检测手段的潜力。 Abstract: This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.

[122] Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification

Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdelmalik Taleb-Ahmed

Main category: cs.CV

TL;DR: 本文提出了一种基于Kolmogorov-Arnold Networks(KANs)的双教师知识蒸馏框架,用于艺术风格分类,解决了传统方法在全局组成上下文和复杂风格特征交互建模方面的不足。

Details Motivation: 由于专家标记数据集的稀缺以及风格元素之间复杂且通常是非线性的相互作用,艺术风格分类在计算美学中仍然是一个巨大挑战。虽然现有的双教师自监督框架减少了对标签数据的依赖,但它们的线性投影层和局部关注点难以建模全局组成上下文和复杂的风格特征交互。 Method: 通过用Kolmogorov-Arnold Networks(KANs)取代传统的MLP投影和预测头来改进双教师知识蒸馏框架。该方法保留了来自两个教师网络的互补指导,其中一个教师网络强调局部纹理和笔触模式,另一个教师网络捕捉更广泛的风格层次结构,并利用KAN的基于样条的激活来精确建模非线性特征相关性。 Result: 在WikiArt和Pandora18k上的实验表明,该方法在Top-1准确率上优于基础双教师架构。研究结果突出了KANs在解耦复杂风格流形方面的重要性,其线性探针准确率优于MLP投影。 Conclusion: 本文通过引入基于KAN的双教师知识蒸馏框架,有效提升了艺术风格分类的性能,特别是在全局风格层次建模和非线性特征相关性建模方面取得了显著成果。 Abstract: Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs' spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.

[123] Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: This paper proposes HyCASS, a learning-based adjustable compression network for hyperspectral images, effectively balancing spectral and spatial compression.

Details Motivation: Need for efficient hyperspectral data storage and understanding the effects of spectral and spatial compression on learning-based HSI compression. Method: Proposed HyCASS with six modules, including spectral and spatial encoders and decoders, to exploit spectral, spatial, and joint spatio-spectral redundancies. Result: Experiments on two HSI datasets showed HyCASS outperforms existing models and provides guidelines for balancing spectral and spatial compression. Conclusion: HyCASS provides an effective, adjustable model for HSI compression that balances spectral and spatial compression across different CRs. Abstract: With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio (CR) adapter encoder; 4) CR adapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .

[124] Machine learning and machine learned prediction in chest X-ray images

Shereiff Garrett,Abhinav Adhikari,Sarina Gautam,DaShawn Marquis Morris,Chandra Mani Adhikari

Main category: cs.CV

TL;DR: 本文使用深度学习模型对胸部 X 光图像进行疾病预测,发现 DenseNet-121 比基础 CNN 更有效地聚焦关键图像区域,具有更高的诊断潜力。

Details Motivation: 机器学习和人工智能能够利用数据训练算法、识别模式并进行预测,为解决复杂问题提供了高效方案。本研究旨在探讨其在医学影像诊断中的应用潜力。 Method: 使用 5824 张胸部 X 光图像,实现并比较了基础 CNN 和 DenseNet-121 两种机器学习算法在疾病预测中的性能,使用梯度加权类别激活映射来分析模型决策的关键区域。 Result: 基础 CNN 和 DenseNet-121 在二分类问题中均表现出色,但 DenseNet-121 在决策过程中能更准确地聚焦于胸部 X 光图像中的关键区域。 Conclusion: DenseNet-121 模型在胸部 X 光图像的二分类问题中表现优异,并且相比基础 CNN 模型更能关注到图像中的关键部分。 Abstract: Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.

[125] Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection

Taeheon Lim,Joohyung Lee,Kyungjae Lee,Jungchan Cho

Main category: cs.CV

TL;DR: This paper introduces RAF, a resolution-adaptive federated learning approach, to address resolution drift in non-classification tasks like human pose estimation, achieving improved performance and generalizability.

Details Motivation: The motivation is to address the underexplored issue of resolution drift in federated learning for non-classification tasks such as human pose estimation. Method: The paper proposes RAF, a resolution-adaptive federated learning method using heatmap-based knowledge distillation to handle resolution variability. Result: The experiments show that RAF improves performance, mitigates resolution drift, and is compatible with existing FL frameworks. Conclusion: The paper concludes that RAF effectively mitigates resolution drift in federated learning for high-resolution tasks like human pose estimation and can be applied to other tasks requiring spatial detail preservation. Abstract: The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,'' where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.

[126] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie,Congxuan Zhang,Fagan Wang,Peng Liu,Feng Lu,Zhen Chen,Weiming Hu

Main category: cs.CV

TL;DR: CST Anti-UAV是一个新的复杂场景下单个小型无人机跟踪的热红外数据集,包含220个视频序列和超过24万高质量边界框注释。

Details Motivation: 现有的无人机跟踪数据集主要针对显眼的物体,缺乏场景复杂性和属性表示的多样性,限制了它们在现实场景中的适用性。 Method: 提出了CST Anti-UAV数据集,专门用于复杂场景中单个小型无人机的跟踪,并对20种现有的单目标跟踪方法进行了评估。 Result: 实验结果表明,在复杂环境中跟踪小型无人机仍然是一个挑战,因为最先进的方法在CST Anti-UAV数据集上的状态准确率仅为35.92%。 Conclusion: CST Anti-UAV数据集的发布不仅促进了更鲁棒的单目标跟踪方法的发展,也推动了反无人机系统的研究创新。 Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

[127] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Ting Huang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出3D-R1,通过构建高质量合成数据集Scene-30K、引入RLHF强化学习策略与动态视角选择方法,显著提升3D视觉语言模型的推理与泛化能力。

Details Motivation: 当前的3D视觉语言模型在推理和泛化方面存在局限,主要受限于高质量空间数据的缺乏和静态视角假设。 Method: 构建了一个高质量的合成数据集Scene-30K,并采用RLHF策略(如GRPO)进行强化学习训练,引入了三种奖励函数(感知奖励、语义相似性奖励和格式奖励),同时引入了动态视角选择策略。 Result: 3D-R1在多个3D场景基准测试中平均提升了10%,证明了其在增强推理和泛化能力方面的有效性。 Conclusion: 3D-R1有效提升了3D视觉语言模型的推理能力和泛化能力,为3D场景理解提供了新的解决方案。 Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

[128] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning

Julia Werner,Oliver Bause,Julius Oexle,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann

Main category: cs.CV

TL;DR: This paper presents a compact multi-task AI model for video capsule endoscopy that improves battery life by combining self-localization and anomaly detection, achieving state-of-the-art results with fewer parameters.

Details Motivation: The motivation is to address the limited battery life of video capsule endoscopy devices by integrating artificial intelligence for intelligent real-time decision-making, reducing energy consumption without compromising performance. Method: The authors developed a multi-task neural network that combines self-localization and anomaly detection for the gastrointestinal tract. They used the Galar dataset and integrated established multi-task methods with Viterbi decoding for time-series analysis, ensuring the model remains compact with only 1 million parameters. Result: The proposed model achieved an accuracy of 93.63% on the self-localization task and 87.48% on anomaly detection, outperforming current single-task models while maintaining a compact size. Conclusion: The paper concludes that their multi-task neural network model improves energy efficiency and performance in video capsule endoscopy, achieving high accuracy in both self-localization and anomaly detection tasks while using fewer parameters than existing models. Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

[129] FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon

Main category: cs.CV

TL;DR: FastPoint accelerates 3D point cloud processing by predicting distance trends during sampling, significantly improving speed without sacrificing accuracy.

Details Motivation: Handling large and irregular 3D point clouds efficiently remains challenging despite advancements in deep neural networks. Method: FastPoint leverages the predictable distance trend between sampled points during farthest point sampling to efficiently identify subsequent sample points without exhaustive pairwise distance computation. Result: Integrating FastPoint into state-of-the-art 3D point cloud models achieves a 2.55x end-to-end speedup on an NVIDIA RTX 3090 GPU without compromising accuracy. Conclusion: FastPoint preserves sampling quality and model performance while significantly accelerating farthest point sampling and neighbor search operations. Abstract: Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.

[130] Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

Mutian Xu,Chongjie Ye,Haolin Liu,Yushuang Wu,Jiahao Chang,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出了一种基于两阶段深度扩散模型的3D数据模拟新方法Stable-Sim2Real,旨在提高模拟3D数据与真实捕获数据的相似性,并通过新的基准测试验证了该方法的有效性。

Details Motivation: 现有的3D数据模拟方法通常依赖于预定义的物理先验,难以完全捕捉真实数据的复杂性。因此,需要一种能够通过数据驱动的方式隐式学习从合成数据到真实数据映射的解决方案。 Method: 本文提出了一种新的数据驱动的3D模拟方法Stable-Sim2Real,其核心是一个新颖的两阶段深度扩散模型。第一阶段通过微调Stable-Diffusion模型生成真实与合成配对深度之间的残差,生成稳定但粗糙的深度图;第二阶段则通过调整扩散损失,利用3D判别器识别出的显著区域进一步优化深度图。 Result: 实验表明,使用该方法生成的3D模拟数据进行训练,能够显著提升在真实世界3D视觉任务中的性能。评估还显示,该方法生成的3D模拟数据与真实捕获的数据模式具有高度相似性。 Conclusion: Stable-Sim2Real提供了一种有效的3D数据模拟解决方案,能够显著缩小模拟数据与真实数据之间的差距,并为未来的研究提供了新的基准测试方法。 Abstract: 3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.

[131] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions

Jinshan Zhen,Yuanyue Ge,Tianxiao Zhu,Hui Zhao,Ya Xiong

Main category: cs.CV

TL;DR: 该研究开发了一种利用RGB-D和深度学习技术实现草莓质量实时估计的方法,有效应对遮挡问题,并在田间条件下取得了较高的估计精度。

Details Motivation: 在田间条件下,由于遮挡和姿态变化频繁,传统的草莓质量估计方法存在较大挑战,因此需要一种非破坏性、实时且在线的质量估计方法。 Method: 采用YOLOv8-Seg进行实例分割,使用CycleGAN进行遮挡区域补全,并通过倾角校正优化正面投影面积计算,最后通过多项式回归模型将几何特征映射到质量。 Result: 实验结果显示,对于未遮挡的草莓,平均质量估计误差为8.11%;对于遮挡情况,误差为10.47%。CycleGAN在遮挡恢复方面优于LaMa模型,实现了更优的像素面积比(PAR)和更高的交并比(IoU)得分。 Conclusion: 该研究提出了一种基于视觉的处理流程,结合RGB-D传感和深度学习,解决了田间环境下因遮挡和姿态变化导致的草莓质量估计难题,为自动化采摘和产量监测提供了可行方案。 Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

[132] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

Timing Li,Bing Cao,Jiahe Feng,Haifang Cao,Qinghau Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: The paper introduces a novel image registration method called Hy-CycleAlign that uses hyperbolic space to effectively handle cross-modal misalignment, significantly outperforming existing approaches in both alignment and fusion quality.

Details Motivation: The motivation is to overcome the limitations of existing registration methods that fail to handle cross-modal misalignment effectively by exploring image alignment in non-Euclidean (specifically hyperbolic) space. Method: The paper proposes a Hyperbolic Cycle Alignment Network (Hy-CycleAlign), which introduces a dual-path cross-modal cyclic registration framework and a Hyperbolic Hierarchy Contrastive Alignment (H²CA) module to improve multi-modal image registration. Result: Extensive experiments demonstrate that the proposed method significantly outperforms existing approaches in aligning and fusing multi-modal images, proving the sensitivity and effectiveness of hyperbolic space for multi-modal image registration. Conclusion: The paper concludes that the proposed Hy-CycleAlign method significantly outperforms existing approaches in both image alignment and fusion, establishing its effectiveness in handling cross-modal misalignment through hyperbolic space exploration. Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.

[133] I Am Big, You Are Little; I Am Right, You Are Wrong

David A. Kelly,Akchunya Chanchal,Nathan Blake

Main category: cs.CV

TL;DR: This paper explores how different image classification models focus on specific pixels, showing that model architecture affects decision-making and that misclassifications involve larger pixel sets.

Details Motivation: The motivation is to understand how different image classification models work by examining their decision-making processes, particularly focusing on choosing the right model as the field rapidly develops. Method: The authors propose using minimal sufficient pixel sets to assess a model's 'concentration' to gain insight into the decision-making process of different vision models. Result: The study identifies that different model architectures have varying levels of concentration regarding pixel sets, and misclassified images are linked to larger pixel sets compared to correctly classified images. Conclusion: Different architectures have statistically different concentration in both size and position of pixel sets, particularly distinguishing ConvNext and EVA models, and misclassifications are associated with larger pixel sets. Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model's `concentration': the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.

[134] ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran,Hikaru Shindo,Patrick Schramowski,Simone Schaub-Meyer,Kristian Kersting,Stefan Roth

Main category: cs.CV

TL;DR: ART improves visual relation detection by adapting vision-language models via instruction tuning, enabling better generalization and handling of novel relations.

Details Motivation: VRD models struggle with generalization, especially for novel or complex relations, and existing prompt tuning methods are limited in this regard. Method: ART converts VRD datasets into an instruction tuning format and employs adaptive sampling to focus on informative relations. Result: ART outperforms baselines and demonstrates the ability to infer unseen relations, which is not possible with mainstream VRD approaches. Conclusion: The proposed ART framework effectively adapts vision-language models for visual relation detection through instruction tuning, offering improved generalization and the ability to infer unseen relation concepts. Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

[135] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang,Luigi Piccinelli,Mattia Segu,Siyuan Li,Rui Huang,Yuqian Fu,Marc Pollefeys,Hermann Blum,Zuria Bauer

Main category: cs.CV

TL;DR: 本文提出了一种适用于开放场景的端到端单目3D目标检测方法3D-MOOD,并在多个数据集上取得了最佳性能。

Details Motivation: 现有的单目3D目标检测方法局限于封闭场景,而现实应用中通常包含新环境和新目标类别,因此需要一种适用于开放场景的检测方法。 Method: 设计了一个端到端的3D Monocular Open-set Object Detector (3D-MOOD),通过3D边界框头将开放集2D检测提升到3D空间,并设计了规范图像空间以提高跨数据集训练的效率。 Result: 3D-MOOD在封闭场景(Omni3D)和开放场景(Omni3D到Argoverse 2、ScanNet)的评估中均取得了最先进的结果。 Conclusion: 论文提出了一种新的单目3D目标检测方法,适用于开放场景,并在多个数据集上达到了最先进的结果。 Abstract: Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

[136] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization

Maxime Pietrantoni,Gabriela Csurka,Torsten Sattler

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉定位方法GSFFs,结合3D高斯点阵与隐式特征场,实现高精度且保护隐私的相机姿态估计。

Details Motivation: 视觉定位任务需要在已知环境中估计相机姿态,本文旨在结合3D几何模型与隐式特征场的优势,提升定位的准确性并实现隐私保护。 Method: 本文利用3D高斯点阵(3DGS)的密集几何信息和可微分光栅化算法,提出Gaussian Splatting Feature Fields(GSFFs),通过对比框架对齐3D尺度感知特征场和2D特征编码器,并结合3D结构感知的聚类过程来正则化表示学习,最终通过特征图或分割图的对齐实现位姿优化。 Result: 该方法在多个真实世界数据集上评估,展示了隐私保护与非隐私保护定位流程的最先进性能。 Conclusion: 本文提出了一种基于3D高斯点阵特征场(GSFFs)的视觉定位方法,结合了显式几何模型与隐式特征场,实现了准确且保护隐私的视觉定位,并在多个真实世界数据集中表现出最先进的性能。 Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

[137] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi,Mohamed Ilyas Lakhal,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出了一种名为BeyondGloss的手语翻译框架,通过生成细粒度、时间感知的手部动作描述来提高翻译性能。

Details Motivation: 现有的视频大语言模型在详细建模长视频方面存在困难,手语翻译任务需要弥合视觉和语言信息之间的模态差距,同时捕捉手势和动作的细微变化。 Method: 提出了一种新的对齐模块,并从HaMeR中提取细粒度特征,同时应用对比损失来减少模态差距。 Result: BeyondGloss在Phoenix14T和CSL-Daily基准测试中达到了最先进的性能。 Conclusion: BeyondGloss实现了最先进的性能,表明所提出的框架的有效性。 Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

[138] MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

Yaoye Zhu,Zhe Wang,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于V2X的基础设施相机校准新方法MamV2XCalib,通过车辆端LiDAR实现了无需特定参考物体或人工干预的大规模精确校准,并在多个数据集上验证了其有效性与鲁棒性。

Details Motivation: 传统的手动校准方法通常耗时、费力,还可能需要封闭道路,因此需要一种更高效、自动化的校准方法。 Method: 结合多尺度特征和4D相关性体积估计车辆端点云与路边图像之间的相关性,并利用Mamba建模时间信息并估计旋转角度。 Result: 在V2X-Seq和TUMTraf-V2X真实世界数据集上的评估表明,与之前为单辆车设计的LiDAR-相机校准方法相比,MamV2XCalib 在V2X场景中实现了更好且更稳定的校准性能,并且参数更少。 Conclusion: MamV2XCalib 是一种基于V2X的基础设施相机校准方法,通过使用车辆端的LiDAR实现了无需特定参考物体或人工干预的大规模精确校准。 Abstract: As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.

[139] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong,Longteng Duan,Jie Song,Michael J. Black,Andreas Geiger

Main category: cs.CV

TL;DR: MoGA利用2D扩散模型和3D生成模型结合的方式,从单视角图像重建高质量、可动画的3D高斯化身,解决了现有方法在3D一致性和细节上的不足。

Details Motivation: 现有方法依赖于稀疏且不一致的2D扩散模型生成的视图,导致3D重建存在模糊和不真实的问题。因此,需要一种能够确保3D一致性和细节真实性的新方法。 Method: MoGA采用了一种模型反转的方法,将输入图像投影到生成模型的潜在空间,并施加3D外观和几何约束,以拟合来自2D扩散模型的合成视图。 Result: 实验表明,MoGA在重建质量上优于现有最先进方法,并且能够很好地推广到真实世界场景中,同时生成的3D高斯化身具有可动画性。 Conclusion: MoGA通过结合3D生成模型和2D扩散模型,实现了从单视角图像重建高保真可动画的3D高斯化身,克服了现有方法在几何一致性和细节真实性上的不足。 Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable

[140] DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

Yuchen Zhou,Yan Luo,Xiangang Wang,Xingjian Gu,Mingzhou Lu

Main category: cs.CV

TL;DR: 论文提出了一种高效且高精度的2D方法,通过切片3D体素特征和方向注意力机制,实现了自动驾驶中3D占用预测的实时处理与精度的平衡。

Details Motivation: 当前许多方法在追求高精度的同时牺牲了实时处理的需求,因此需要平衡精度和推理速度。 Method: 该方法包括切片3D体素特征以保持完整的垂直几何信息,并采用方向注意力机制从不同方向高效提取几何特征。 Result: 在Occ3D-nuScenes上,该方法达到了39.3%的mIoU和27.7 FPS的推理速度;在边缘设备上的模拟中,推理速度达到14.8 FPS。 Conclusion: 该论文提出了一种高效的2D方法,用于在自动驾驶中实现3D占用预测的高精度与实时处理的平衡。 Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird's-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method's applicability for real-time deployment in resource-constrained environments.

[141] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li,Keren Fu,Qijun Zhao

Main category: cs.CV

TL;DR: 为解决视频伪装目标检测(VCOD)中空间特征受限的问题,本文提出了一种结合空间和频率域运动感知的新型Mamba架构Vcamba,实现了更高效准确的检测。

Details Motivation: 现有的VCOD方法依赖空间外观特征感知运动线索,但前景和背景的高度相似性限制了检测效果,因此需要结合频率特征和Mamba模型提升性能。 Method: 提出了一种新的视觉伪装Mamba(Vcamba),结合了空间和频率域的运动感知,包括RFVSS模块、AFE模块、SLMP模块、FLMP模块和SFMF模块。 Result: Vcamba在2个数据集上的6项评估指标中均优于现有方法,并且计算成本更低。 Conclusion: Vcamba在VCOD任务中通过融合空间和频率域特征,实现了比现有方法更高效和准确的检测,同时计算成本更低。 Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.

[142] Medical Image De-Identification Benchmark Challenge

Linmin Pei,Granger Sutton,Michael Rutherford,Ulrike Wagner,Tracy Nolan,Kirk Smith,Phillip Farmer,Peter Gu,Ambar Rana,Kailing Chen,Thomas Ferleman,Brian Park,Ye Wu,Jordan Kojouharov,Gargi Singh,Jon Lemon,Tyler Willis,Milos Vukadinovic,Grant Duffy,Bryan He,David Ouyang,Marco Pereanez,Daniel Samber,Derek A. Smith,Christopher Cannistraci,Zahi Fayad,David S. Mendelson,Michele Bufano,Elmar Kotter,Hamideh Haghiri,Rajesh Baidya,Stefan Dvoretskii,Klaus H. Maier-Hein,Marco Nolden,Christopher Ablett,Silvia Siggillino,Sandeep Kaushik,Hongzhu Jiang,Sihan Xie,Zhiyu Wan,Alex Michie,Simon J Doran,Angeline Aurelia Waly,Felix A. Nathaniel Liang,Humam Arshad Mustagfirin,Michelle Grace Felicia,Kuo Po Chih,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Nikolas Koutsoubis,Kyle Naddeo,Kartik Pandit,Tony O'Sullivan,Raj Krish,Qinyan Pan,Scott Gustafson,Benjamin Kopchick,Laura Opsahl-Ong,Andrea Olvera-Morales,Jonathan Pinney,Kathryn Johnson,Theresa Do,Juergen Klenk,Maria Diaz,Arti Singh,Rong Chai,David A. Clunie,Fred Prior,Keyvan Farahani

Main category: cs.CV

TL;DR: The MIDI-B Challenge provided a standardized platform for benchmarking DICOM image deID tools, achieving high accuracy scores between 97.91% and 99.93%. Ten teams successfully completed the test phase, employing various tools and technologies to meet the requirements.

Details Motivation: The de-identification of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images while ensuring compliance with patient privacy laws. Preservation of non-PHI metadata to enable downstream development of imaging artificial intelligence (AI) is also important in biomedical research. Method: The MIDI-B Challenge consisted of three phases: training, validation, and test. A large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted were used. Scores were computed as the percentage of correct actions from the total number of required actions. Result: Ten teams successfully completed the test phase of the challenge. Scores ranged from 97.91% to 99.93%, indicating high accuracy in de-identifying DICOM images. Participants employed a variety of open-source and proprietary tools, customized configurations, large language models, and optical character recognition (OCR). Conclusion: The MIDI-B Challenge successfully provided a standardized platform for benchmarking DICOM image deID tools, achieving high accuracy scores between 97.91% and 99.93%. The challenge demonstrated the effectiveness of a rule-based approach to image deID, with participants employing various tools and technologies to meet the requirements. Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge's design, implementation, results, and lessons learned.

[143] Consistent Point Matching

Halid Ziya Yerebakan,Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: This study improves the robustness of point-matching algorithms for medical image navigation by incorporating a consistency heuristic, achieving state-of-the-art results without requiring machine learning or training data.

Details Motivation: To improve the robustness of matching anatomical locations across pairs of medical images and achieve high-precision navigation without requiring a machine learning model or training data. Method: The method incorporates a consistency heuristic into the point-matching algorithm and is validated on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Result: The approach surpasses state-of-the-art results on the Deep Lesion Tracking dataset and effectively addresses landmark localization. Conclusion: Incorporating a consistency heuristic into the point-matching algorithm improves robustness in matching anatomical locations across pairs of medical images. Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

[144] DivControl: Knowledge Diversion for Controllable Image Generation

Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng

Main category: cs.CV

TL;DR: DivControl is a unified framework for controllable image generation that efficiently adapts to new conditions using a decomposable architecture and dynamic knowledge diversion.

Details Motivation: Existing methods for controllable image generation either require separate models for each condition or suffer from entangled representations, leading to poor generalization and high adaptation costs. This work aims to address these limitations with a unified, decomposable framework. Method: DivControl factorizes ControlNet using SVD to separate condition-agnostic and condition-specific components, incorporating a dynamic gate for soft routing based on condition semantics, along with a representation alignment loss. Result: DivControl achieves state-of-the-art controllability with 36.4× less training cost, improves performance on basic conditions, and demonstrates strong zero-shot and few-shot capabilities on unseen conditions. Conclusion: DivControl provides a scalable and efficient framework for controllable image generation, outperforming existing methods in performance, generalization, and adaptation efficiency. Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

[145] Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda,Stefan Roth,Simone Schaub-Meyer

Main category: cs.CV

TL;DR: This paper proposes EMAT, an efficient method for few-shot classification and segmentation that excels at handling small objects and uses fewer parameters than existing approaches.

Details Motivation: Current state-of-the-art methods struggle with small objects in few-shot classification and segmentation tasks. Method: The Efficient Masked Attention Transformer (EMAT) introduces a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. Result: Improvements in classification and segmentation accuracy, particularly for small objects, with at least four times fewer trainable parameters. Conclusion: EMAT outperforms all FS-CS methods on the PASCAL-5^i and COCO-20^i datasets while using fewer trainable parameters. Abstract: Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

[146] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks

Changqing Xu,Ziqiang Yang,Yi Liu,Xinfang Liao,Guiqi Mo,Hao Zeng,Yintang Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的基于Forward-Forward的训练框架,用于脉冲神经网络,无需梯度近似,显著降低了计算复杂性,并引入了动态优化损失函数的机制,提高了准确性并降低了功耗。

Details Motivation: 由于脉冲神经网络(SNNs)的不可微性,训练它们具有挑战性,现有梯度近似方法经常牺牲准确性,并且在边缘设备上部署受限于反向传播的巨大计算需求。 Method: 提出了一种基于Forward-Forward(FF)的梯度近似自由训练框架,并引入了一种类感知复杂度自适应机制,动态优化损失函数。 Result: 在MNIST、Fashion-MNIST和CIFAR-10数据集上分别实现了99.58%、92.13%和75.64%的测试准确率,并在内存访问和计算功耗方面表现优异。 Conclusion: 实验结果表明,所提出的训练框架在MNIST、Fashion-MNIST和CIFAR-10数据集上的测试准确率分别达到99.58%、92.13%和75.64%,超越了所有现有的基于FF的SNN方法。此外,该方法在内存访问和计算功耗方面表现出显著优势。 Abstract: Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.

[147] Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis

Kunpeng Qiu,Zhiying Zhou,Yongxin Guo

Main category: cs.CV

TL;DR: This paper proposes an Adaptively Distilled ControlNet to address challenges in medical image annotation, achieving superior performance on two medical datasets.

Details Motivation: Medical image annotation faces challenges due to privacy concerns and labor-intensive labeling, limiting segmentation model performance and generalization. Current mask-controllable diffusion models struggle with precise lesion-mask alignment. Method: The paper proposes an Adaptively Distilled ControlNet, which uses dual-model distillation to accelerate training and optimization. During training, a teacher model regularizes a mask-only student model via predicted noise alignment, with adaptive regularization based on lesion-background ratios. Result: Comprehensive evaluations showed state-of-the-art performance: TransUNet improved mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieved 2.6%/3.5% gains on Polyps. Conclusion: Adaptively Distilled ControlNet is an effective and superior framework for medical image generation, as demonstrated by the state-of-the-art performance on two medical datasets. Abstract: Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.

[148] OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction

Yang Gao,Po-Chien Luan,Kaouther Messaoud,Lan Feng,Alexandre Alahi

Main category: cs.CV

TL;DR: OmniTraj是一个基于Transformer的轨迹预测模型,通过显式条件帧率实现了零样本迁移能力,并在多个数据集上取得了最先进的性能。

Details Motivation: 现有的预训练模型在迁移到具有不同时间动态的未见数据集时需要微调,限制了它们的可扩展性和实用性。 Method: OmniTraj基于Transformer,通过显式条件机制对时间元数据进行建模。 Result: OmniTraj在零样本迁移设置下减少了超过70%的预测误差,并在四个数据集上取得了最先进的结果。 Conclusion: OmniTraj通过显式条件帧率实现了在不同数据集上的零样本迁移和最先进的性能。 Abstract: While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70\% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj

[149] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA is a novel interactive segmentation framework for hyperspectral medical imaging that combines RGB models with spectral analysis, achieving high accuracy with minimal user input and few training examples.

Details Motivation: The motivation is to overcome challenges in hyperspectral imaging (HSI) segmentation caused by data limitations and hardware variations, enabling accurate and efficient medical image analysis. Method: SAMSA combines an RGB foundation model with spectral analysis, utilizing user clicks to guide segmentation and spectral similarity computations through a spectral feature fusion strategy independent of spectral band count and resolution. Result: SAMSA achieved 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical dataset, and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine HSI dataset, demonstrating strong performance in few-shot and zero-shot learning settings. Conclusion: The study concludes that SAMSA is an effective and flexible framework for hyperspectral medical image analysis, particularly in scenarios with limited training data. Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA's effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.

[150] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation

Jialei Chen,Wuhao Xu,Sipeng He,Baoru Huang,Dongchun Ren

Main category: cs.CV

TL;DR: 本论文提出了一种名为I2V-GS的新方法,能够将基础设施视角转换为车辆视角,从而生成高质量的自动驾驶数据集,解决了数据收集成本高和效率低的问题。

Details Motivation: 自动驾驶系统需要大量高质量数据,但目前的驾驶数据收集方式成本高且效率低,因此提出了从真实图像合成驾驶数据的解决方案。 Method: 提出了I2V-GS方法,结合高斯点绘技术,通过自适应深度扭曲生成密集训练视图,并采用级联策略进行图像修复,利用交叉视图信息优化扩散模型。 Result: I2V-GS在车辆视角下的合成质量显著提高,相较于StreetGaussian在NTA-Iou、NTL-Iou和FID指标上分别提升了45.7%、34.2%和14.9%。 Conclusion: I2V-GS实现了从基础设施视角到车辆视角的自动驾驶数据集生成,实验结果表明其在合成质量上优于现有方法。 Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.

[151] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng,Liangtai Zhou,Dian Chen,Ni Tang,Xiaotong Luo,Yanyun Qu

Main category: cs.CV

TL;DR: 本文提出了一种统一的图像修复框架,通过引入降质感知模块和细节增强模块,在多样化降质条件下实现了优异的修复效果。

Details Motivation: 全合一图像修复(AiOIR)作为一个有前景但具有挑战性的研究方向,需要解决其核心挑战,即如何有效应对多样化降质问题。 Method: 研究设计了降质感知特征融合模块和细节感知专家模块,以适应不同降质类型并增强纹理和细结构恢复能力。 Result: 实验表明,该方法在多任务和混合降质设置下始终达到最先进的性能。 Conclusion: 该研究提出了一种基于潜在扩散模型的新型统一图像修复框架,能够有效应对多样化降质问题,并展现出扩散先验在统一图像修复中的实用潜力。 Abstract: All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

[152] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

Zhenyang Li,Xiaoyang Bai,Tongchen Zhang,Pengfei Shen,Weiwei Xu,Yifan Peng

Main category: cs.CV

TL;DR: FlowGaussian-VR enhances 3D video reconstruction for dynamic scenes using optical flow-based optimization and adaptive densification, leading to sharper visuals and better motion tracking.

Details Motivation: The motivation stems from the limitations of existing deformation networks in handling complex motion and scale variations in dynamic scenes for 3D video reconstruction. Method: The method introduces a velocity field rendering (VFR) pipeline and a flow-assisted adaptive densification (FAD) strategy to optimize Gaussian video reconstruction in dynamic scenes. Result: The results show significant visual improvements, including a PSNR gain of over 2.5 dB, reduced blurriness in dynamic textures, and regularized Gaussian trajectories. Conclusion: The proposed FlowGaussian-VR method effectively improves 3D video reconstruction by addressing challenges in dynamic scenes, resulting in better visual quality and trackable Gaussian trajectories. Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model's effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.

[153] Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation

Alfie Roddan,Chi Xu,Serine Ajlouni,Irini Kakaletri,Patra Charalampaki,Stamatia Giannarou

Main category: cs.CV

TL;DR: 本文提出了一种将风险估计引入像素归因的方法,提高了图像分类模型的可解释性,并在实际数据和ImageNet上取得了优于现有方法的效果。

Details Motivation: 现有的深度学习模型及其像素归因方法存在过度自信的问题,导致归因结果不可靠。因此,需要一种能够提供更可靠可解释性并量化风险的方法。 Method: 该论文通过迭代应用分类模型和像素归因方法生成PA图的体积,利用像素级分布的期望值生成增强的PA图,并使用变异系数(CV)估计像素级风险。 Result: 该方法在pCLE数据和ImageNet上的性能评估表明,其改进的可解释性方法优于现有最先进方法。 Conclusion: 该论文提出了一种新的方法,将风险估计引入像素归因方法中,以提高图像分类的可解释性。该方法不仅提供了改进的PA图,还对其输出PA值进行了风险估计,实验验证了其在pCLE数据和ImageNet上的优越性能。 Abstract: The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model's predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.

[154] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

Emery Pierson,Lei Li,Angela Dai,Maks Ovsjanikov

Main category: cs.CV

TL;DR: This paper introduces a data-driven approach to replace traditional methods in deep functional maps for shape correspondence, achieving better results without relying on axiomatic models.

Details Motivation: The motivation is to overcome the limitations of existing methods that rely on axiomatic modeling for training loss or functional map regularization, thereby restricting accuracy and applicability. Method: The paper introduces a generative model of functional maps in the spectral domain using score-based generative modeling. It also proposes a novel distillation strategy from diffusion models to promote structural properties of ground truth functional maps. Result: The experiments show that the learned regularization methods outperform traditional axiomatic approaches in zero-shot non-rigid shape matching tasks. Conclusion: The paper concludes that data-driven methods can replace traditional axiomatic models in deep functional maps for non-rigid shape correspondence tasks, leading to improved accuracy and broader applicability. Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/

[155] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu,Yanping Fu,Saike Huang,Yingfei Liu,Fan Jia,Nian Liu,Feng Dai,Tiancai Wang,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jianbing Shen

Main category: cs.CV

TL;DR: This study presents RAGNet, a large-scale benchmark for robotic grasping, and AffordanceNet, a framework that improves open-world generalization for object affordance perception. It achieves strong results on benchmarks and real-world tasks.

Details Motivation: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. Current studies face limitations due to a lack of reasoning-based large-scale affordance prediction data, raising concerns about open-world effectiveness. Method: The study builds a large-scale grasping-oriented affordance segmentation benchmark named RAGNet, which includes 273k images, 180 categories, and 26k reasoning instructions. It proposes a framework named AffordanceNet, consisting of a Vision-Language Model (VLM) pre-trained on RAGNet data and a grasping network that conditions an affordance map to grasp targets. Result: Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks demonstrate that the proposed model (AffordanceNet) has strong open-world generalization ability. The dataset and code are made publicly available. Conclusion: The study introduces RAGNet, a large-scale benchmark for affordance segmentation, and AffordanceNet, a framework that enhances open-world generalization for robotic grasping. The results show that the model has powerful generalization abilities on both benchmarks and real-robot tasks. Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

[156] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao,Yi Zhao,Juho Kannala,Joni Pajarinen

Main category: cs.CV

TL;DR: DIAS提出了一种新的Slot Attention机制,通过重新初始化和自蒸馏解决冗余槽问题,在多个OCL任务上取得了最先进的性能。

Details Motivation: 现有的OCL方法在槽初始化后会重复使用这些槽,导致冗余槽与信息槽竞争表示对象,从而导致对象被错误分割成多个部分。此外,主流方法仅通过将槽解码为输入重构来获得监督信号,忽略了基于内部信息的潜在监督。 Method: Slot Attention with re-Initialization and self-Distillation (DIAS):1)通过减少聚合槽中的冗余并重新初始化额外的聚合来更新剩余槽;2)驱动第一次聚合迭代时的注意力图去近似最后一次迭代的注意力图以实现自蒸馏。 Result: 实验表明,DIAS在对象发现和识别等OCL任务上表现最佳,同时提升了高级视觉预测和推理能力。 Conclusion: DIAS通过重新初始化和自蒸馏机制解决了冗余槽的问题,并且在对象发现和识别等OCL任务中达到了最先进的性能。 Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

[157] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting

Di Li,Jie Feng,Jiahao Chen,Weisheng Dong,Guanbin Li,Yuhui Zheng,Mingtao Feng,Guangming Shi

Main category: cs.CV

TL;DR: The paper introduces a new task and benchmark for 3D affordance reasoning and proposes a novel framework, SeqSplatNet, which advances affordance reasoning to complex, sequential tasks at the scene level.

Details Motivation: The motivation of the paper is to address the limitations of current methods based on 3D Gaussian Splatting, which are limited to single-object, single-step interactions, and to bridge the gap for complex real-world applications that require long-horizon, multi-object tasks. Method: The paper proposes SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. It employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. The paper also introduces a pre-training strategy and a feature injection mechanism. Result: The result of the paper is that extensive experiments demonstrate that their method sets a new state-of-the-art on their challenging benchmark. Conclusion: The paper concludes that SeqSplatNet effectively advances affordance reasoning from single-step interactions to complex, sequential tasks at the scene level. Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.

[158] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

Li Siyao,Yao Feng,Omid Tehari,Chen Change Loy,Michael J. Black

Main category: cs.CV

TL;DR: 本研究提出了一种“半物理”方法,成功将SMPL-X模型转化为能够动态物理交互的实体,解决了传统运动学模型的交互问题,并保持了实时性和高保真运动效果。

Details Motivation: 当前的通用3D人体模型(如SMPL-X)虽然能够高效表示人体形状和姿态,但由于其运动学特性,难以实现与环境的真实物理交互,常导致穿透和不真实的物体动态问题。 Method: 提出了一种将3D运动学运动转换为物理模拟的“半物理”机制,使SMPL-X能够在保持原有姿态控制的同时实现与场景和物体的物理交互。 Result: 该方法解决了运动学交互模型的穿透问题,实现了逼真的物理互动,同时无需复杂的训练,具有实时性和泛化能力。 Conclusion: SMPL-X通过嵌入“半物理”机制,实现了对人体与环境动态物理交互的精确模拟,同时保持了原有运动的保真度和实时操作性。 Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a "half-physics" mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions

[159] Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Miaosen Zhang,Ziqiang Xu,Jialiang Zhu,Qi Dai,Kai Qiu,Yifan Yang,Chong Luo,Tianyi Chen,Justin Wagle,Tim Franklin,Baining Guo

Main category: cs.CV

TL;DR: 我们开发了名为Phi-Ground的模型家族,它在代理设置下的所有五个基础基准测试中均达到了最先进的性能。

Details Motivation: 当前的端到端基础模型在ScreenSpot-pro和UI-Vision等具有挑战性的基准测试中仍达不到65%的准确率,表明它们远未准备好部署。 Method: 对基础模型的训练进行了实证研究,从数据收集到模型训练都进行了详细考察,并最终开发出了Phi-Ground模型家族。 Result: Phi-Ground模型家族在代理设置下的所有五个基础基准测试中均达到了最先进的性能。在端到端模型设置中,该模型在ScreenSpot-pro上的得分为43.2,在UI-Vision上的得分为27.2。 Conclusion: Phi-Ground模型家族在代理设置下的所有五个基础基准测试中均达到了最先进的性能,其模型参数低于100亿。 Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

[160] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang,Jeff Tan,Tarasha Khurana,Neehar Peri,Deva Ramanan

Main category: cs.CV

TL;DR: 本文提出了一种稀疏视角下动态场景重建的新方法,通过独立单目重建和一致性对齐实现高质量重建。

Details Motivation: 现有动态场景重建方法通常需要使用数百个校准相机的密集多视角捕捉,这在成本和灵活性上存在限制,因此本文提出了一种适用于稀疏视角设置的重建方法。 Method: 该方法通过独立进行每个相机的单目重建,并将这些重建结果进行时间与视角一致性对齐,从而克服稀疏视角下视点重叠不足的问题。 Result: 实验表明,该方法在PanopticStudio和Ego-Exo4D数据集上均取得了优于现有方法的重建质量,尤其是在新视角渲染方面。 Conclusion: 该论文提出了一种基于稀疏视角视频动态场景重建的方法,通过独立对每个相机进行单目重建并进行时间与视角一致性的对齐,实现了高质量的动态场景重建。 Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.

[161] SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

Jessica Bader,Leander Girrbach,Stephan Alaniz,Zeynep Akata

Main category: cs.CV

TL;DR: The paper introduces SUB, a benchmark for evaluating the robustness of CBMs and similar interpretable models to concept variations, and proposes a novel image generation method called Tied Diffusion Guidance.

Details Motivation: CBMs and other concept-based interpretable models show promise for making AI applications more transparent, but their ability to identify correct concepts under distribution shifts is questionable. This motivates the need for a benchmark to evaluate CBMs' robustness to concept variations. Method: The authors introduce SUB, a fine-grained image and concept benchmark based on the CUB dataset, and a novel Tied Diffusion Guidance (TDG) method to precisely control generated images. Result: The proposed SUB benchmark contains 38,400 synthetic images generated using the TDG method, which enables precise control of generated images by ensuring both the correct bird class and attribute are generated. Conclusion: The paper concludes that CBMs struggle to reliably identify the correct concepts under distribution shifts, and the proposed benchmark SUB enables rigorous evaluation of CBMs and similar interpretable models. Abstract: Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.

[162] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang,Sicheng Xu,Chuxin Wang,Jiaolong Yang,Feng Zhao,Dong Chen,Baining Guo

Main category: cs.CV

TL;DR: 本文介绍了一种新的视频到4D生成方法,该方法利用Direct 4DMesh-to-GS Variation Field VAE和高斯变化场扩散模型,实现了从单个视频输入生成高质量动态3D内容的能力。

Details Motivation: 由于直接的4D扩散建模面临数据构建成本高和高维表示的挑战,该研究旨在开发一种高效的方法来生成高质量的动态3D内容。 Method: 引入了一种Direct 4DMesh-to-GS Variation Field VAE,直接从3D动画数据中编码标准高斯点(GS)及其时间变化,将高维动画压缩到紧凑的潜在空间,并基于此训练了一个具有时间感知扩散变压器的高斯变化场扩散模型。 Result: 该模型在Objaverse数据集上训练,展示了比现有方法更优的生成质量,并且尽管仅在合成数据上训练,仍表现出对真实世界视频输入的显著泛化能力。 Conclusion: 该论文提出了一种新的视频到4D生成框架,能够从单个视频输入创建高质量的动态3D内容,并展示了其在生成质量和泛化能力上的优势。 Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

eess.IV [Back]

[163] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

Zheyuan Zhang,Linkai Peng,Wanying Dou,Cuiling Sun,Halil Ertugrul Aktas,Andrea M. Bejar,Elif Keles,Gorkem Durak,Ulas Bagci

Main category: eess.IV

TL;DR: 本文介绍了PancreasDG,这是一个用于医学图像领域泛化研究的大规模多中心3D MRI胰腺分割数据集,以及一种性能优越的半监督分割方法。

Details Motivation: 胰腺分割在腹部成像中仍然是一个挑战,现有的领域泛化基准测试主要关注跨中心变化,而忽视了主要的变异性来源。 Method: 提出了一个名为PancreasDG的大规模多中心3D MRI胰腺分割数据集,以及一种利用解剖不变性的半监督方法。 Result: 提出的方法在两个测试中心的跨序列分割中分别达到了61.63%和87.00%的Dice分数,显著优于最先进的领域泛化技术。 Conclusion: PancreasDG为医学成像中的领域泛化设定了新的基准,并通过提出一种半监督方法显著提高了跨序列分割的性能。 Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.

[164] Towards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery

Philip Wootaek Shin,Vishal Gaur,Rahul Ramachandran,Manil Maskey,Jack Sampson,Vijaykrishnan Narayanan,Sujit Roy

Main category: eess.IV

TL;DR: 本研究开发了一种利用HLS10作为参考对齐和调和HLS30图像的框架,有效提升Landsat图像的分辨率质量,为异构卫星图像超分辨率提供了可行性和重要见解。

Details Motivation: 不同卫星传感器的空间分辨率差异对数据融合和下游应用提出了挑战,而现有超分辨率方法依赖于人工降尺度图像,不适用于具有不同光谱、时间特性的异构卫星传感器。 Method: 使用Harmonized Landsat Sentinel 10m(HLS10)作为参考,对齐和调和Landsat Sentinel 30m(HLS30)图像,以缩小这些传感器之间的分辨率差距。 Result: 通过定量和定性评估,证明了该方法的有效性,显示了其增强基于卫星的传感应用的潜力。 Conclusion: 该研究提供了一种初步的框架,用于异构卫星传感器之间的图像超分辨率,为未来该领域的进步提供了重要的见解和考虑因素。 Abstract: High-resolution satellite imagery is essential for geospatial analysis, yet differences in spatial resolution across satellite sensors present challenges for data fusion and downstream applications. Super-resolution techniques can help bridge this gap, but existing methods rely on artificially downscaled images rather than real sensor data and are not well suited for heterogeneous satellite sensors with differing spectral, temporal characteristics. In this work, we develop a preliminary framework to align and Harmonized Landsat Sentinel 30m(HLS 30) imagery using Harmonized Landsat Sentinel 10m(HLS10) as a reference from the HLS dataset. Our approach aims to bridge the resolution gap between these sensors and improve the quality of super-resolved Landsat imagery. Quantitative and qualitative evaluations demonstrate the effectiveness of our method, showing its potential for enhancing satellite-based sensing applications. This study provides insights into the feasibility of heterogeneous satellite image super-resolution and highlights key considerations for future advancements in the field.