cs.CL [Back]

[1] Large Language Models in the Travel Domain: An Industrial Experience

Sergio Di Meglio,Aniello Somma,Luigi Libero Lucio Starace,Fabio Scippacercola,Giancarlo Sperlì,Sergio Di Martino

Main category: cs.CL

TL;DR: 本文研究了在房产预订平台CALEIDOHOTELS中使用大型语言模型（LLM）以解决第三方数据不一致的问题。Mixtral 8x7B在生成更完整、更精确且更简洁的住宿描述方面优于Mistral 7B，但其计算成本显著更高。

Details

Motivation: 在线房产预订平台依赖于第三方提供的一致、最新的住宿设施信息，但这些数据源常常存在信息不完整或不一致的问题，导致用户体验不佳并造成市场损失。因此，研究旨在解决这些问题，提高数据的可靠性和一致性。 Method: 通过在工业案例研究中评估两个大型语言模型（LLM）——Mistral 7B和Mixtral 8x7B，比较它们在生成一致且无幻觉的住宿描述方面的性能。Mistral 7B经过QLoRA微调，而Mixtral 8x7B使用了优化的系统提示。 Result: Mixtral 8x7B在性能上优于Mistral 7B，具体表现为更高的完整性（99.6% vs. 93%）、更高的精度（98.8% vs. 96%）、更低的幻觉率（1.2% vs. 4%）以及更简洁的输出（平均249词 vs. 277词）。然而，Mixtral 8x7B的计算成本显著更高（50GB显存和每小时1.61美元 vs. 5GB和每小时0.16美元） Conclusion: 研究得出，虽然Mixtral 8x7B在生成描述的一致性和完整性方面优于Mistral 7B，但其计算成本显著更高。这为在生产环境中部署LLM提供了实用的权衡指导，即模型质量与资源效率之间的平衡。 Abstract: Online property booking platforms are widely used and rely heavily on consistent, up-to-date information about accommodation facilities, often sourced from third-party providers. However, these external data sources are frequently affected by incomplete or inconsistent details, which can frustrate users and result in a loss of market. In response to these challenges, we present an industrial case study involving the integration of Large Language Models (LLMs) into CALEIDOHOTELS, a property reservation platform developed by FERVENTO. We evaluate two well-known LLMs in this context: Mistral 7B, fine-tuned with QLoRA, and Mixtral 8x7B, utilized with a refined system prompt. Both models were assessed based on their ability to generate consistent and homogeneous descriptions while minimizing hallucinations. Mixtral 8x7B outperformed Mistral 7B in terms of completeness (99.6% vs. 93%), precision (98.8% vs. 96%), and hallucination rate (1.2% vs. 4%), producing shorter yet more concise content (249 vs. 277 words on average). However, this came at a significantly higher computational cost: 50GB VRAM and $1.61/hour versus 5GB and $0.16/hour for Mistral 7B. Our findings provide practical insights into the trade-offs between model quality and resource efficiency, offering guidance for deploying LLMs in production environments and demonstrating their effectiveness in enhancing the consistency and reliability of accommodation data.

[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

Jinzhi Wang,Qingke Peng,Haozhou Li,Zeyuan Zeng,Qinfeng Song,Kaixuan Yang,Jiangbo Zhang,Yaoying Wang,Ruimeng Li,Biyi Zhou

Main category: cs.CL

TL;DR: This paper introduces ElectriQ, a benchmark for evaluating and improving LLMs in electric power marketing, which addresses the lack of domain expertise and empathy in existing models.

Details

Motivation: Current systems like China's 95598 hotline face challenges such as slow response times and limited accuracy in domain-specific tasks. While LLMs like GPT-4o have strong general capabilities, they lack domain expertise and empathy in electric power marketing. Method: The study introduces ElectriQ, a benchmark comprising a dialogue dataset and four evaluation metrics (professionalism, popularity, readability, user-friendliness). It incorporates a domain-specific knowledge base and proposes a knowledge augmentation approach to enhance LLM performance. Result: Experiments on 13 LLMs show that smaller models like LLama3-8B, when fine-tuned and augmented, outperform GPT-4o in professionalism and user-friendliness. Conclusion: ElectriQ provides a comprehensive foundation for developing LLMs tailored to the needs of electric power marketing services by introducing a domain-specific benchmark and knowledge augmentation method. Abstract: Electric power marketing customer service plays a critical role in addressing inquiries, complaints, and service requests. However, current systems, such as China's 95598 hotline, often struggle with slow response times, inflexible procedures, and limited accuracy in domain-specific tasks. While large language models (LLMs) like GPT-4o and Claude 3 demonstrate strong general capabilities, they lack the domain expertise and empathy required in this field. To bridge this gap, we introduce ElectriQ, the first benchmark designed to evaluate and enhance LLMs in electric power marketing scenarios. ElectriQ consists of a dialogue dataset covering six key service categories and introduces four evaluation metrics: professionalism, popularity, readability, and user-friendliness. We further incorporate a domain-specific knowledge base and propose a knowledge augmentation method to boost model performance. Experiments on 13 LLMs reveal that smaller models such as LLama3-8B, when fine-tuned and augmented, can surpass GPT-4o in terms of professionalism and user-friendliness. ElectriQ establishes a comprehensive foundation for developing LLMs tailored to the needs of power marketing services.

Navid Yazdanjue,Morteza Rakhshaninejad,Hossein Yazdanjouei,Mohammad Sadegh Khorshidi,Mikko S. Niemela,Fang Chen,Amir H. Gandomi

Main category: cs.CL

TL;DR: 本文提出了一种结合语言模型与集成学习的非法市场内容检测框架，在多平台上表现出色，具有高准确率和鲁棒性。

Details

Motivation: 非法市场越来越多地转移到互联网的隐蔽部分，如深网和暗网，以及Telegram、Reddit、Pastebin等平台，检测和分类此类内容面临标签数据有限、非法语言不断演变以及在线来源结构异质性的挑战。 Method: 结合微调的语言模型和半监督集成学习策略，使用ModernBERT提取语义表示，并结合手动设计特征（如文档结构、比特币地址、电子邮件、IP等元数据）进行两级分类。第一级使用XGBoost、随机森林和SVM的半监督集成进行销售相关文档检测，第二级进一步分类为毒品、武器或凭证销售。 Result: 在三个数据集上的实验表明，该模型优于多个基线模型（如BERT、ModernBERT、DarkBERT、ALBERT、Longformer、BigBird），取得了0.96489的准确率、0.93467的F1分数和0.95388的TMCC。 Conclusion: 该论文提出了一种基于层次分类的框架，能够有效检测和分类非法市场内容，表现出较强的泛化能力、在有限监督下的鲁棒性以及在实际非法内容检测中的有效性。 Abstract: Illegal marketplaces have increasingly shifted to concealed parts of the internet, including the deep and dark web, as well as platforms such as Telegram, Reddit, and Pastebin. These channels enable the anonymous trade of illicit goods including drugs, weapons, and stolen credentials. Detecting and categorizing such content remains challenging due to limited labeled data, the evolving nature of illicit language, and the structural heterogeneity of online sources. This paper presents a hierarchical classification framework that combines fine-tuned language models with a semi-supervised ensemble learning strategy to detect and classify illicit marketplace content across diverse platforms. We extract semantic representations using ModernBERT, a transformer model for long documents, finetuned on domain-specific data from deep and dark web pages, Telegram channels, Subreddits, and Pastebin pastes to capture specialized jargon and ambiguous linguistic patterns. In addition, we incorporate manually engineered features such as document structure, embedded patterns including Bitcoin addresses, emails, and IPs, and metadata, which complement language model embeddings. The classification pipeline operates in two stages. The first stage uses a semi-supervised ensemble of XGBoost, Random Forest, and SVM with entropy-based weighted voting to detect sales-related documents. The second stage further classifies these into drug, weapon, or credential sales. Experiments on three datasets, including our multi-source corpus, DUTA, and CoDA, show that our model outperforms several baselines, including BERT, ModernBERT, DarkBERT, ALBERT, Longformer, and BigBird. The model achieves an accuracy of 0.96489, an F1-score of 0.93467, and a TMCC of 0.95388, demonstrating strong generalization, robustness under limited supervision, and effectiveness in real-world illicit content detection.

[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

Jinyu Liu,Xiaoying Song,Diana Zhang,Jason Thomale,Daqing He,Lingzi Hong

Main category: cs.CL

TL;DR: This paper proposes a hybrid framework combining ML models and LLMs to improve subject term prediction for library resources.

Details

Motivation: LLMs are underexplored for subject analysis, while traditional ML models struggle with unseen cases. A solution is needed to leverage the strengths of both while mitigating their weaknesses. Method: A hybrid framework integrating embedding-based ML models with LLMs was developed. ML models predicted the optimal number of LCSH labels to guide LLMs and post-edited the LLM outputs using actual LCSH terms. Result: The hybrid approach resulted in more controlled and vocabulary-aligned outputs compared to using LLMs alone. Conclusion: The hybrid framework combining ML models and LLMs improves subject term prediction by controlling output and aligning with LCSH vocabulary. Abstract: Providing subject access to information resources is an essential function of any library management system. Large language models (LLMs) have been widely used in classification and summarization tasks, but their capability to perform subject analysis is underexplored. Multi-label classification with traditional machine learning (ML) models has been used for subject analysis but struggles with unseen cases. LLMs offer an alternative but often over-generate and hallucinate. Therefore, we propose a hybrid framework that integrates embedding-based ML models with LLMs. This approach uses ML models to (1) predict the optimal number of LCSH labels to guide LLM predictions and (2) post-edit the predicted terms with actual LCSH terms to mitigate hallucinations. We experimented with LLMs and the hybrid framework to predict the subject terms of books using the Library of Congress Subject Headings (LCSH). Experiment results show that providing initial predictions to guide LLM generations and imposing post-edits result in more controlled and vocabulary-aligned outputs.

[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

Victor Eiti Yamamoto,Hideaki Takeda

Main category: cs.CL

TL;DR: This paper introduces a novel knowledge graph integration method focused on context matching through label and triple matching techniques, achieving strong performance across diverse datasets.

Details

Motivation: Context matching in knowledge graphs remains largely unexplored, despite its importance in real-world KGs that vary in source, size, and information density. Method: The method involves label matching using string manipulation, fuzzy matching, and vector similarity, and triple matching to improve entity-matching accuracy. Result: The approach achieves high accuracy across diverse test cases and performs competitively compared to leading systems in the OAEI competition and supervised methods. Conclusion: The proposed KG integration method, which includes label matching and triple matching, demonstrates competitive performance and introduces a more comprehensive dataset for evaluating triple-matching. Abstract: Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.

[6] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

Esmail Gumaan

Main category: cs.CL

TL;DR: This paper provides a theoretical framework for analyzing hallucinations in LLMs and proposes practical strategies for detecting, mitigating, and evaluating them.

Details

Motivation: Hallucinations in LLMs lead to content that is not faithful to input or facts, which poses a significant challenge for reliable AI applications. Method: The paper uses learning-theoretic frameworks (PAC-Bayes and Rademacher complexity) to derive bounds on hallucination risk and surveys detection and mitigation strategies. Result: The paper formally defines hallucination risk, proposes detection strategies (e.g., uncertainty estimation, confidence calibration), mitigation approaches (e.g., retrieval-augmented generation, logit calibration), and outlines evaluation protocols for hallucinations. Conclusion: The paper concludes that hallucinations in LLMs can be theoretically analyzed and practically addressed through detection and mitigation strategies, laying a foundation for future research and applications. Abstract: Hallucination in Large Language Models (LLMs) refers to the generation of content that is not faithful to the input or the real-world facts. This paper provides a rigorous treatment of hallucination in LLMs, including formal definitions and theoretical analyses. We distinguish between intrinsic and extrinsic hallucinations, and define a \textit{hallucination risk} for models. We derive bounds on this risk using learning-theoretic frameworks (PAC-Bayes and Rademacher complexity). We then survey detection strategies for hallucinations, such as token-level uncertainty estimation, confidence calibration, and attention alignment checks. On the mitigation side, we discuss approaches including retrieval-augmented generation, hallucination-aware fine-tuning, logit calibration, and the incorporation of fact-verification modules. We propose a unified detection and mitigation workflow, illustrated with a diagram, to integrate these strategies. Finally, we outline evaluation protocols for hallucination, recommending datasets, metrics, and experimental setups to quantify and reduce hallucinations. Our work lays a theoretical foundation and practical guidelines for addressing the crucial challenge of hallucination in LLMs.

[7] Reading Between the Timelines: RAG for Answering Diachronic Questions

Kwun Hang Lau,Ruiyuan Zhang,Weijie Shi,Xiaofang Zhou,Xiaojun Cheng

Main category: cs.CL

TL;DR: This paper presents a new framework for Retrieval-Augmented Generation (RAG) systems that infuses temporal logic to better handle longitudinal queries, leading to improved accuracy in answering complex questions that involve tracking changes over time.

Details

Motivation: The motivation behind this study is the observed deficit in conventional RAG systems in handling longitudinal queries that require tracking entities and phenomena across time due to the lack of temporal coherence in evidence gathering. Method: The paper proposes a new framework that redesigns the RAG pipeline by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance to ensure the collection of temporally coherent evidence. Result: Empirical results on the introduced Analytical Diachronic Question Answering Benchmark (ADQAB) show that the proposed approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. Conclusion: This paper concludes that by infusing temporal logic into the RAG pipeline, it is possible to substantially improve the handling of longitudinal queries, providing a validated pathway towards performing nuanced, evolutionary analysis on complex, real-world questions. Abstract: While Retrieval-Augmented Generation (RAG) excels at injecting static, factual knowledge into Large Language Models (LLMs), it exhibits a critical deficit in handling longitudinal queries that require tracking entities and phenomena across time. This blind spot arises because conventional, semantically-driven retrieval methods are not equipped to gather evidence that is both topically relevant and temporally coherent for a specified duration. We address this challenge by proposing a new framework that fundamentally redesigns the RAG pipeline to infuse temporal logic. Our methodology begins by disentangling a user's query into its core subject and its temporal window. It then employs a specialized retriever that calibrates semantic matching against temporal relevance, ensuring the collection of a contiguous evidence set that spans the entire queried period. To enable rigorous evaluation of this capability, we also introduce the Analytical Diachronic Question Answering Benchmark (ADQAB), a challenging evaluation suite grounded in a hybrid corpus of real and synthetic financial news. Empirical results on ADQAB show that our approach yields substantial gains in answer accuracy, surpassing standard RAG implementations by 13% to 27%. This work provides a validated pathway toward RAG systems capable of performing the nuanced, evolutionary analysis required for complex, real-world questions. The dataset and code for this study are publicly available at https://github.com/kwunhang/TA-RAG.

[8] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

Daniel Son,Sanjana Rathore,Andrew Rufail,Adrian Simon,Daniel Zhang,Soham Dave,Cole Blondin,Kevin Zhu,Sean O'Brien

Main category: cs.CL

TL;DR: The study finds evidence of feature universality in Gemma-2 language models of different sizes, particularly in middle layers, suggesting that despite scale differences, models develop similar internal representations.

Details

Motivation: To understand if models with significant size differences converge on similar internal concepts, which could underpin cross-model interpretability. Method: The study uses Sparse Autoencoder (SAE) dictionary-learning to analyze residual-stream activations in Gemma-2 models, aligning features through activation correlation and comparing them using SVCCA and RSA. Result: Middle layers of the models showed strong feature overlap, while early and late layers were less similar; initial experiments suggest semantically similar multi-token subspaces interact similarly with models. Conclusion: Large language models, despite size differences, develop similar and interpretable internal features, supporting the concept of feature universality. Abstract: We investigate feature universality in Gemma-2 language models (Gemma-2-2B and Gemma-2-9B), asking whether models with a four-fold difference in scale still converge on comparable internal concepts. Using the Sparse Autoencoder (SAE) dictionary-learning pipeline, we utilize SAEs on each model's residual-stream activations, align the resulting monosemantic features via activation correlation, and compare the matched feature spaces with SVCCA and RSA. Middle layers yield the strongest overlap, while early and late layers show far less similarity. Preliminary experiments extend the analysis from single tokens to multi-token subspaces, showing that semantically similar subspaces interact similarly with language models. These results strengthen the case that large language models carve the world into broadly similar, interpretable features despite size differences, reinforcing universality as a foundation for cross-model interpretability.

[9] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

Qixuan Hu,Xumou Zhang,Jinman Kim,Florence Bourgeois,Adam G. Dunn

Main category: cs.CL

TL;DR: 该论文探讨了使用临床试验注册信息预测严重不良事件结果的方法，通过开发两种预测模型和迁移学习方法，实现了对试验臂SAE发生率和对照组SAE比例的预测，并强调了临床试验数据的潜在利用价值。

Details

Motivation: 通过准确估计预期的安全结果，可以避免临床试验的终止并限制参与者不必要的风险。 Method: 使用来自ClinicalTrials.gov的22,107个两臂平行干预临床试验的数据，开发了两种预测模型，并采用迁移学习方法结合预训练语言模型进行特征提取。 Result: 最佳模型(ClinicalT5+Transformer+MLP)在预测哪个试验组会有更高的SAE发生率方面达到了77.6%的AUC，在预测对照组中经历SAE的参与者比例时，RMSE为18.6%。滑动窗口方法始终优于没有使用它的方法。 Conclusion: 临床试验设计可以通过预测严重不良事件结果来优化，以避免提前终止和不必要的参与者风险。 Abstract: Objectives: With accurate estimates of expected safety results, clinical trials could be designed to avoid terminations and limit exposing participants to unnecessary risks. We evaluated methods for predicting serious adverse event (SAE) results in clinical trials using information only from their registrations prior to the trial. Material and Methods: We analysed 22,107 two-arm parallel interventional clinical trials from ClinicalTrials.gov with structured summary results. Two prediction models were developed: a classifier predicting will experimental arm have higher SAE rates (area under the receiver operating characteristic curve; AUC) than control arm, and a regression model to predict the proportion of SAEs in control arms (root mean squared error; RMSE). A transfer learning approach using pretrained language models (e.g., ClinicalT5, BioBERT) was used for feature extraction, combined with downstream model for prediction. To maintain semantic representation in long trial texts exceeding localised language model input limits, a sliding window method was developed for embedding extraction. Results: The best model (ClinicalT5+Transformer+MLP) had 77.6% AUC predicting which trial arm has a higher proportion of patients with SAEs. When predicting proportion of participants experiencing SAE in the control arm, the same model achieved RMSE of 18.6%. The sliding window approach consistently outperformed methods without it. Across 12 classifiers, the average absolute AUC increase was 2.00%; across 12 regressors, the average absolute RMSE reduction was 1.58%. Discussion: Summary results data available at ClinicalTrials.gov remains underutilised. The potential to estimate results of trials before they start is an opportunity to improve trial design and flag discrepancies between expected and reported safety results.

[10] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

Jindong Li,Yali Fu,Jiahong Liu,Linxiao Cao,Wei Ji,Menglin Yang,Irwin King,Ming-Hsuan Yang

Main category: cs.CL

TL;DR: The paper provides a comprehensive survey of discrete tokenization methods for large language models (LLMs), presenting a taxonomy of vector quantization techniques and analyzing their application in LLM-based systems, while identifying key challenges and future research directions.

Details

Motivation: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. Method: The study presents a structured taxonomy and analysis of discrete tokenization methods for LLMs, categorizing eight representative VQ variants and analyzing their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Result: The study identifies key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints, and discusses emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. It also highlights how quantization strategies influence alignment, reasoning, and generation performance across different types of LLM-based systems. Conclusion: This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. Abstract: The rapid advancement of large language models (LLMs) has intensified the need for effective mechanisms to transform continuous multimodal data into discrete representations suitable for language-based processing. Discrete tokenization, with vector quantization (VQ) as a central approach, offers both computational efficiency and compatibility with LLM architectures. Despite its growing importance, there is a lack of a comprehensive survey that systematically examines VQ techniques in the context of LLM-based systems. This work fills this gap by presenting the first structured taxonomy and analysis of discrete tokenization methods designed for LLMs. We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines. Beyond algorithm-level investigation, we discuss existing research in terms of classical applications without LLMs, LLM-based single-modality systems, and LLM-based multimodal systems, highlighting how quantization strategies influence alignment, reasoning, and generation performance. In addition, we identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints. Finally, we discuss emerging research directions such as dynamic and task-adaptive quantization, unified tokenization frameworks, and biologically inspired codebook learning. This survey bridges the gap between traditional vector quantization and modern LLM applications, serving as a foundational reference for the development of efficient and generalizable multimodal systems. A continuously updated version is available at: https://github.com/jindongli-Ai/LLM-Discrete-Tokenization-Survey.

[11] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

Lee Harris

Main category: cs.CL

TL;DR: The paper introduces the Language Model Chain (LMC) algorithm, which reduces the cost and hallucinations of language models by using a multi-stage cascade approach, significantly improving accuracy and speed in extracting patient dates of birth from medical documents.

Details

Motivation: Language models are costly and prone to hallucinations, which can waste resources if the generated information is incorrect. This research aims to address these issues. Method: The Language Model Chain (LMC) algorithm was proposed, implemented, and applied. This algorithm ensures that a language model's response to a given prompt is only considered correct if it exists in a collection of possible answers. If incorrect, the response is fed into a more predictive (but slower) language model. The process repeats until all predictions are correct. Result: When applied to extract patient dates of birth from medical documents, the LMC algorithm, using a multi-stage cascade of language models, significantly improved prediction speed and accuracy while reducing hallucinations. Conclusion: The LMC algorithm contributes significantly to the field of knowledge extraction and should be further explored in the future. Abstract: Language models can capture complex relationships in given text, but these are notorious for being costly and for producing information that does not exist (i.e., hallucinations). Furthermore, the resources invested into producing this information would be wasted if it were incorrect. We address these issues by proposing, implementing, and applying the Language Model Chain (LMC) algorithm. In this, a language model's response to a given prompt about given text is only correct if it exists in the collection of possible (i.e., candidate) answers, and text corresponding to incorrect responses is fed into a more predictive (but slower) language model. This process is repeated for a collection of language models, or until all predictions about the text are correct. We used the LMC algorithm to extract patient dates of birth from medical documents, and combining a collection of language models in a multi-stage cascade significantly increased prediction speed and accuracy over individual language models, while greatly reducing the number of corresponding hallucinations. We believe that the novel LMC algorithm significantly contributes to the knowledge extraction field, and that this should be explored much further in the future.

[12] Predicting stock prices with ChatGPT-annotated Reddit sentiment

Mateusz Kmak,Kamil Chmurzyński,Kamil Matejuk,Paweł Kotzbach,Jan Kocoń

Main category: cs.CL

TL;DR: 本文探讨了社交媒体情绪是否能预测股市变动，并引入了一种新的基于RoBERTa模型的方法来解读社交媒体的非正式语言和表情符号。研究发现社交媒体情绪与股票价格之间只有微弱的相关性，相比之下，评论量和谷歌搜索趋势等简单指标表现出更强的预测信号。

Details

Motivation: 2021年GameStop空头挤压所体现的社交媒体上散户投资者活动的激增，引发了关于在线情绪对股票价格影响的讨论。本文探讨了从社交媒体讨论中得出的情绪是否能够有意义地预测股市变动。 Method: 我们使用了两种现有的基于文本的情绪分析方法，并引入了第三种方法，即经过ChatGPT注释和微调的RoBERTa模型，以更好地解读社交媒体讨论中的非正式语言和表情符号，并通过相关性和因果关系度量来确定这些模型的预测能力。 Result: 研究发现社交媒体情绪与股票价格之间只有微弱的相关性。 Conclusion: 传统的情绪分析可能无法完全捕捉市场在线讨论的细微差别，而简单的指标如评论量和谷歌搜索趋势表现出更强的预测信号，这突出了散户投资者行为的复杂性。 Abstract: The surge of retail investor activity on social media, exemplified by the 2021 GameStop short squeeze, raised questions about the influence of online sentiment on stock prices. This paper explores whether sentiment derived from social media discussions can meaningfully predict stock market movements. We focus on Reddit's r/wallstreetbets and analyze sentiment related to two companies: GameStop (GME) and AMC Entertainment (AMC). To assess sentiment's role, we employ two existing text-based sentiment analysis methods and introduce a third, a ChatGPT-annotated and fine-tuned RoBERTa-based model designed to better interpret the informal language and emojis prevalent in social media discussions. We use correlation and causality metrics to determine these models' predictive power. Surprisingly, our findings suggest that social media sentiment has only a weak correlation with stock prices. At the same time, simpler metrics, such as the volume of comments and Google search trends, exhibit stronger predictive signals. These results highlight the complexity of retail investor behavior and suggest that traditional sentiment analysis may not fully capture the nuances of market-moving online discussions.

[13] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

Aman Gupta,Yingying Zhuang,Zhou Yu,Ziji Zhang,Anurag Beniwal

Main category: cs.CL

TL;DR: This paper investigates how optimizing prompt translation strategies can enhance knowledge sharing across languages in multilingual systems using RAG-enhanced LLMs, showing significant improvements in classification tasks, especially for low-resource languages.

Details

Motivation: Despite progress in multilingual LLMs, their performance varies across languages and tasks. Knowledge bases are often shared from high-resource to low-resource languages, creating challenges in multilingual RAG-based systems. This paper investigates how prompt strategies can address these challenges. Method: The paper systematically evaluates different prompt translation strategies for classification tasks using retrieval-augmented generation (RAG)-enhanced large language models (LLMs) in multilingual settings. Result: Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages and boost performance on downstream classification tasks. Conclusion: The study concludes that optimizing prompt translation strategies enhances knowledge sharing across languages, improving the performance of multilingual systems, especially for low-resource languages. Abstract: Despite advances in the multilingual capabilities of Large Language Models (LLMs), their performance varies substantially across different languages and tasks. In multilingual retrieval-augmented generation (RAG)-based systems, knowledge bases (KB) are often shared from high-resource languages (such as English) to low-resource ones, resulting in retrieved information from the KB being in a different language than the rest of the context. In such scenarios, two common practices are pre-translation to create a mono-lingual prompt and cross-lingual prompting for direct inference. However, the impact of these choices remains unclear. In this paper, we systematically evaluate the impact of different prompt translation strategies for classification tasks with RAG-enhanced LLMs in multilingual systems. Experimental results show that an optimized prompting strategy can significantly improve knowledge sharing across languages, therefore improve the performance on the downstream classification task. The findings advocate for a broader utilization of multilingual resource sharing and cross-lingual prompt optimization for non-English languages, especially the low-resource ones.

[14] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

Brittney Exline,Melanie Duffin,Brittany Harbison,Chrissa da Gomez,David Joyner

Main category: cs.CL

TL;DR: 该研究探讨了在线美国计算机课程中，学生语言背景如何影响他们的同行反馈体验。

Details

Motivation: 许多国际学生在美国基于英语的在线课程中接受教育，这促使我们研究母语为非英语的学生与母语为英语的学生在同行反馈体验方面的影响。 Method: 使用Twitter-roBERTa-based模型分析了500名学生的同行评审的情感，并将情感得分和同行反馈评分与学生的语言背景联系起来。 Result: 结果显示，以英语为母语的学生对反馈的评价较差，而非母语学生撰写的反馈更积极，但收到的反馈情感却不那么积极。当控制性别和年龄时，出现了显著的交互作用。 Conclusion: 语言背景在塑造同行反馈体验方面起到了适度但复杂的作用。 Abstract: Graduate-level CS programs in the U.S. increasingly enroll international students, with 60.2 percent of master's degrees in 2023 awarded to non-U.S. students. Many of these students take online courses, where peer feedback is used to engage students and improve pedagogy in a scalable manner. Since these courses are conducted in English, many students study in a language other than their first. This paper examines how native versus non-native English speaker status affects three metrics of peer feedback experience in online U.S.-based computing courses. Using the Twitter-roBERTa-based model, we analyze the sentiment of peer reviews written by and to a random sample of 500 students. We then relate sentiment scores and peer feedback ratings to students' language background. Results show that native English speakers rate feedback less favorably, while non-native speakers write more positively but receive less positive sentiment in return. When controlling for sex and age, significant interactions emerge, suggesting that language background plays a modest but complex role in shaping peer feedback experiences.

[15] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

Haoran Sun,Shaoning Zeng

Main category: cs.CL

TL;DR: This paper introduces H-MEM, a hierarchical memory architecture for LLM Agents that improves memory organization and retrieval efficiency, demonstrating superior performance in long-term dialogue tasks.

Details

Motivation: The study aimed to overcome limitations in existing memory mechanisms, which often fall short in structured memory organization and efficient retrieval for LLM Agents. Method: A Hierarchical Memory (H-MEM) architecture was developed, organizing memory in a multi-level structure with positional index encoding and an index-based routing mechanism for efficient retrieval. Result: The experimental evaluation on the LoCoMo dataset showed that the H-MEM approach consistently outperformed five baseline methods, particularly in long-term dialogue scenarios. Conclusion: The proposed H-MEM architecture effectively enhances the long-term memory capabilities of LLM Agents by enabling structured memory organization and efficient retrieval. Abstract: Long-term memory is one of the key factors influencing the reasoning capabilities of Large Language Model Agents (LLM Agents). Incorporating a memory mechanism that effectively integrates past interactions can significantly enhance decision-making and contextual coherence of LLM Agents. While recent works have made progress in memory storage and retrieval, such as encoding memory into dense vectors for similarity-based search or organizing knowledge in the form of graph, these approaches often fall short in structured memory organization and efficient retrieval. To address these limitations, we propose a Hierarchical Memory (H-MEM) architecture for LLM Agents that organizes and updates memory in a multi-level fashion based on the degree of semantic abstraction. Each memory vector is embedded with a positional index encoding pointing to its semantically related sub-memories in the next layer. During the reasoning phase, an index-based routing mechanism enables efficient, layer-by-layer retrieval without performing exhaustive similarity computations. We evaluate our method on five task settings from the LoCoMo dataset. Experimental results show that our approach consistently outperforms five baseline methods, demonstrating its effectiveness in long-term dialogue scenarios.

[16] Multi-Relation Extraction in Entity Pairs using Global Context

Nilesh,Atul Gupta,Avinash C Panday

Main category: cs.CL

TL;DR: 本文提出了一种新的输入嵌入方法，通过捕捉实体在整个文档中的位置来构建全局上下文，以提升文档级关系抽取的准确性。

Details

Motivation: 在文档级关系抽取中，实体可能在文档中多次出现，其关系可能随上下文变化。之前的方未能捕捉准确关系抽取所需的完整文档上下文。 Method: 通过将实体表示为独立于文档位置的独立片段，利用全局关系和多句子推理，提出了一种新的输入编码方法。 Result: 在三个基准关系抽取数据集（DocRED、Re-DocRED和REBEL）上的实验结果表明，所提出的方法能够准确预测文档级实体间的关系。 Conclusion: 该论文提出了一种新的输入嵌入方法，用于文档级关系抽取，通过捕捉实体在整个文档中的位置来构建全局上下文，从而准确预测实体间的关系。 Abstract: In document-level relation extraction, entities may appear multiple times in a document, and their relationships can shift from one context to another. Accurate prediction of the relationship between two entities across an entire document requires building a global context spanning all relevant sentences. Previous approaches have focused only on the sentences where entities are mentioned, which fails to capture the complete document context necessary for accurate relation extraction. Therefore, this paper introduces a novel input embedding approach to capture the positions of mentioned entities throughout the document rather than focusing solely on the span where they appear. The proposed input encoding approach leverages global relationships and multi-sentence reasoning by representing entities as standalone segments, independent of their positions within the document. The performance of the proposed method has been tested on three benchmark relation extraction datasets, namely DocRED, Re-DocRED, and REBEL. The experimental results demonstrated that the proposed method accurately predicts relationships between entities in a document-level setting. The proposed research also has theoretical and practical implications. Theoretically, it advances global context modeling and multi-sentence reasoning in document-level relation extraction. Practically, it enhances relationship detection, enabling improved performance in real-world NLP applications requiring comprehensive entity-level insights and interpretability.

[17] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

Zhehao Tan,Yihan Jiao,Dan Yang,Lei Liu,Jie Feng,Duolin Sun,Yue Shen,Jian Wang,Peng Wei,Jinjie Gu

Main category: cs.CL

TL;DR: This paper introduces a fine-grained benchmark, Placeholder-RAG-Benchmark, to evaluate the specific capabilities of large language models (LLMs) within Retrieval-Augmented Generation (RAG) systems, focusing on knowledge integration, filtering, and reasoning.

Details

Motivation: The motivation stems from the lack of systematic and granular evaluation frameworks for assessing LLM-specific capabilities in Retrieval-Augmented Generation (RAG) systems, particularly regarding document utilization and the integration of external knowledge. Method: The authors introduce a multi-level, fine-grained benchmark called Placeholder-RAG-Benchmark, which evaluates LLMs across three progressive dimensions: multi-level filtering abilities, combination abilities, and reference reasoning. They use a placeholder-based approach to decouple the contributions of parametric knowledge and external knowledge. Result: The experiments reveal limitations in representative LLMs concerning error resilience and context faithfulness within RAG systems. The benchmark provides insights into the role of LLMs and offers a framework for improving RAG system development. Conclusion: The paper concludes that the proposed Placeholder-RAG-Benchmark offers a reproducible framework for evaluating and developing more reliable and efficient RAG systems, highlighting the limitations of current LLMs in error resilience and context faithfulness. Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge, where the LLM's ability to generate responses based on the combination of a given query and retrieved documents is crucial. However, most benchmarks focus on overall RAG system performance, rarely assessing LLM-specific capabilities. Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization. To this end, we introduce \textit{Placeholder-RAG-Benchmark}, a multi-level fine-grained benchmark, emphasizing the following progressive dimensions: (1) multi-level filtering abilities, (2) combination abilities, and (3) reference reasoning. To provide a more nuanced understanding of LLMs' roles in RAG systems, we formulate an innovative placeholder-based approach to decouple the contributions of the LLM's parametric knowledge and the external knowledge. Experiments demonstrate the limitations of representative LLMs in the RAG system's generation capabilities, particularly in error resilience and context faithfulness. Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems. Our code is available in https://github.com/Alipay-Med/PRGB.

[18] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

Xi Chen,Aske Plaat,Niki van Stein

Main category: cs.CL

TL;DR: This study shows that Chain-of-thought prompting enhances interpretability and modularity in larger language models, indicating it effectively structures the model's internal reasoning.

Details

Motivation: The study aims to determine whether the 'thoughts' generated by Chain-of-thought prompting reflect true internal reasoning processes and how model scale affects this phenomenon. Method: Combining sparse autoencoders with activation patching, the study examines the causal effects of CoT-reasoning features in Pythia-70M and Pythia-2.8B models while solving GSM8K math problems under CoT and plain prompting. Result: Swapping CoT-reasoning features into a noCoT run significantly raises answer log-probabilities in the 2.8B model but not in the 70M model, revealing a scale threshold. CoT also increases activation sparsity and feature interpretability in larger models, with confidence in correct answers improving from 1.2 to 4.3. Conclusion: Chain-of-thought prompting can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method. Abstract: Chain-of-thought (CoT) prompting boosts Large Language Models accuracy on multi-step tasks, yet whether the generated "thoughts" reflect the true internal reasoning process is unresolved. We present the first feature-level causal study of CoT faithfulness. Combining sparse autoencoders with activation patching, we extract monosemantic features from Pythia-70M and Pythia-2.8B while they tackle GSM8K math problems under CoT and plain (noCoT) prompting. Swapping a small set of CoT-reasoning features into a noCoT run raises answer log-probabilities significantly in the 2.8B model, but has no reliable effect in 70M, revealing a clear scale threshold. CoT also leads to significantly higher activation sparsity and feature interpretability scores in the larger model, signalling more modular internal computation. For example, the model's confidence in generating correct answers improves from 1.2 to 4.3. We introduce patch-curves and random-feature patching baselines, showing that useful CoT information is not only present in the top-K patches but widely distributed. Overall, our results indicate that CoT can induce more interpretable internal structures in high-capacity LLMs, validating its role as a structured prompting method.

[19] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

Xiaoyu Pan,Yang Bai,Ke Zou,Yang Zhou,Jun Zhou,Huazhu Fu,Yih-Chung Tham,Yong Liu

Main category: cs.CL

TL;DR: EH-Benchmark通过基于代理的三阶段框架，有效评估并减轻了眼科诊断中MLLMs的幻觉问题。

Details

Motivation: 现有的医学基准测试无法有效评估和解决MLLMs中的幻觉问题，限制了其在眼科诊断中的准确性。 Method: 提出了一个基于代理的三阶段框架，包括知识级检索、任务级案例研究和结果级验证阶段。 Result: 实验结果表明，多代理框架显著减轻了两种类型的幻觉。 Conclusion: EH-Benchmark有效地评估和减轻了MLLMs中的幻觉问题，提高了诊断的准确性、可解释性和可靠性。 Abstract: Medical Large Language Models (MLLMs) play a crucial role in ophthalmic diagnosis, holding significant potential to address vision-threatening diseases. However, their accuracy is constrained by hallucinations stemming from limited ophthalmic knowledge, insufficient visual localization and reasoning capabilities, and a scarcity of multimodal ophthalmic data, which collectively impede precise lesion detection and disease diagnosis. Furthermore, existing medical benchmarks fail to effectively evaluate various types of hallucinations or provide actionable solutions to mitigate them. To address the above challenges, we introduce EH-Benchmark, a novel ophthalmology benchmark designed to evaluate hallucinations in MLLMs. We categorize MLLMs' hallucinations based on specific tasks and error types into two primary classes: Visual Understanding and Logical Composition, each comprising multiple subclasses. Given that MLLMs predominantly rely on language-based reasoning rather than visual processing, we propose an agent-centric, three-phase framework, including the Knowledge-Level Retrieval stage, the Task-Level Case Studies stage, and the Result-Level Validation stage. Experimental results show that our multi-agent framework significantly mitigates both types of hallucinations, enhancing accuracy, interpretability, and reliability. Our project is available at https://github.com/ppxy1/EH-Benchmark.

[20] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

Shalini Jangra,Suparna De,Nishanth Sastry,Saeed Fadaei

Main category: cs.CL

TL;DR: This paper introduces a synthetic dataset for studying privacy risks related to PII disclosures on social media, generated using large language models and evaluated for quality and privacy preservation.

Details

Motivation: The lack of open-source labeled datasets for PII-revealing text hampers research on privacy risks in social media. The authors aim to address this gap by creating a shareable synthetic dataset. Method: The authors developed a methodology to generate synthetic PII-revealing text data using three large language models (Llama2-7B, Llama3-8B, and zephyr-7b-beta) and created a taxonomy of 19 PII categories. They evaluated the synthetic data's quality using reproducibility equivalence, unlinkability, and indistinguishability metrics. Result: A synthetic PII-labeled dataset was successfully generated, meeting the criteria of reproducibility equivalence, unlinkability, and indistinguishability. The dataset and code were made publicly available to support further research. Conclusion: The paper concludes that their synthetic dataset effectively mirrors real PII-revealing text data while ensuring privacy, thereby promoting reproducible research on privacy risks in social media. Abstract: Social platforms such as Reddit have a network of communities of shared interests, with a prevalence of posts and comments from which one can infer users' Personal Information Identifiers (PIIs). While such self-disclosures can lead to rewarding social interactions, they pose privacy risks and the threat of online harms. Research into the identification and retrieval of such risky self-disclosures of PIIs is hampered by the lack of open-source labeled datasets. To foster reproducible research into PII-revealing text detection, we develop a novel methodology to create synthetic equivalents of PII-revealing data that can be safely shared. Our contributions include creating a taxonomy of 19 PII-revealing categories for vulnerable populations and the creation and release of a synthetic PII-labeled multi-text span dataset generated from 3 text generation Large Language Models (LLMs), Llama2-7B, Llama3-8B, and zephyr-7b-beta, with sequential instruction prompting to resemble the original Reddit posts. The utility of our methodology to generate this synthetic dataset is evaluated with three metrics: First, we require reproducibility equivalence, i.e., results from training a model on the synthetic data should be comparable to those obtained by training the same models on the original posts. Second, we require that the synthetic data be unlinkable to the original users, through common mechanisms such as Google Search. Third, we wish to ensure that the synthetic data be indistinguishable from the original, i.e., trained humans should not be able to tell them apart. We release our dataset and code at https://netsys.surrey.ac.uk/datasets/synthetic-self-disclosure/ to foster reproducible research into PII privacy risks in online social media.

[21] Enhancing RAG Efficiency with Adaptive Context Compression

Shuyu Guo,Zhaochun Ren

Main category: cs.CL

TL;DR: ACC-RAG dynamically compresses retrieved contexts for RAG, optimizing efficiency and accuracy by adapting to input complexity.

Details

Motivation: Existing context compression methods apply fixed rates, leading to over- or under-compression. ACC-RAG addresses this by adapting to input complexity. Method: ACC-RAG uses a hierarchical compressor and context selector to dynamically adjust compression rates based on input complexity. Result: ACC-RAG achieves up to 4 times faster inference speed while maintaining or improving accuracy across Wikipedia and five QA datasets. Conclusion: ACC-RAG maintains or improves accuracy while significantly enhancing inference efficiency, showing superiority over fixed-rate compression methods. Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.

[22] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

Baptiste Lefort,Eric Benhamou,Beatrice Guez,Jean-Jacques Ohana,Ethan Setrouk,Alban Etienne

Main category: cs.CL

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: This paper presents a novel hierarchical framework for portfolio optimization, integrating lightweight Large Language Models (LLMs) with Deep Reinforcement Learning (DRL) to combine sentiment signals from financial news with traditional market indicators. Our three-tier architecture employs base RL agents to process hybrid data, meta-agents to aggregate their decisions, and a super-agent to merge decisions based on market data and sentiment analysis. Evaluated on data from 2018 to 2024, after training on 2000-2017, the framework achieves a 26% annualized return and a Sharpe ratio of 1.2, outperforming equal-weighted and S&P 500 benchmarks. Key contributions include scalable cross-modal integration, a hierarchical RL structure for enhanced stability, and open-source reproducibility.

[23] Augmented Vision-Language Models: A Systematic Review

Anthony C Davis,Burhan Sadiq,Tianmin Shu,Chien-Ming Huang

Main category: cs.CL

TL;DR: This paper reviews how combining visual-language machine learning models with external symbolic systems can enhance interpretability, reasoning, and adaptability, offering a promising solution to the limitations of current models.

Details

Motivation: Recent advances in visual-language machine learning models have shown great ability in understanding natural language and visual scenes, but they struggle with interpretability, logical reasoning, and require retraining for new information. This paper explores neural-symbolic systems as a promising solution. Method: The paper employs a systematic literature review to categorize techniques that improve visual-language understanding through interaction with external symbolic information systems. Result: The paper identifies and categorizes techniques that effectively integrate Vision-Language Models with external symbolic systems to improve visual-language understanding. Conclusion: The paper concludes that integrating neural networks with external symbolic information systems can enhance reasoning and memory abilities, making outputs more interpretable and allowing the assimilation of new information without extensive retraining. Abstract: Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers a pragmatic approach to realizing the benefits of neural-symbolic integration. This systematic literature review aims to categorize techniques through which visual-language understanding can be improved by interacting with external symbolic information systems.

[24] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

Jingwei Zhao,Yuhua Wen,Qifei Li,Minchi Hu,Yingying Zhou,Jingyao Xue,Junyang Wu,Yingming Gao,Zhengqi Wen,Jianhua Tao,Ya Li

Main category: cs.CL

TL;DR: This article reviews deep learning methods for intent recognition, highlighting the shift from unimodal to multimodal approaches and the impact of Transformer-based models. It provides insights into datasets, methodologies, applications, and future research directions in multimodal intent recognition.

Details

Motivation: The motivation for the study is the growing demand for natural human-computer interaction, which has driven the evolution of intent recognition methods through deep learning and multimodal approaches incorporating data from multiple sources like audio, vision, and physiological signals. Method: The article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, datasets, methodologies, and applications. Result: The result is a comprehensive overview of the latest developments in multimodal intent recognition (MIR), including methodologies, datasets, and current challenges. The article provides insights into future research directions in the field. Conclusion: The article concludes that the evolution of intent recognition has transitioned from unimodal to multimodal deep learning approaches, with Transformer-based models achieving significant breakthroughs. It highlights the importance of MIR for natural human-computer interaction and outlines future research directions. Abstract: Intent recognition aims to identify users' underlying intentions, traditionally focusing on text in natural language processing. With growing demands for natural human-computer interaction, the field has evolved through deep learning and multimodal approaches, incorporating data from audio, vision, and physiological signals. Recently, the introduction of Transformer-based models has led to notable breakthroughs in this domain. This article surveys deep learning methods for intent recognition, covering the shift from unimodal to multimodal techniques, relevant datasets, methodologies, applications, and current challenges. It provides researchers with insights into the latest developments in multimodal intent recognition (MIR) and directions for future research.

[25] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

Kathleen Mealey,Jonathan A. Karr Jr.,Priscila Saboia Moreira,Paul R. Brenner,Charles F. Vardeman II

Main category: cs.CL

TL;DR: 本文研究了在航空运维智能应用中使用NLP和LLM工具的可信性和技术挑战，评估了多个工具的零样本性能，并提出了增强信任的建议和开源数据集。

Details

Motivation: 由于数据保密性与数据集成目标的二元对立，以及自然语言处理（NLP）工具在特定领域（如运维和维护）知识结构上的局限性，从组织数据仓库中获取操作智能成为一项关键挑战。 Method: 本文将知识抽取过程分解为命名实体识别、共指解析、命名实体链接和关系抽取功能组件，并评估了十六种NLP工具与大语言模型（LLMs）的性能。 Result: 作者发现NLP和LLM工具在零样本性能上存在显著限制，并讨论了在受控、保密环境中运行这些工具所面临的挑战。 Conclusion: 本文讨论了在航空等关键任务行业中，NLP和LLM工具在可信性和技术就绪水平方面面临的挑战，并提出了增强信任的建议，同时提供了开源精选数据集以支持进一步的基准测试和评估。 Abstract: Deriving operational intelligence from organizational data repositories is a key challenge due to the dichotomy of data confidentiality vs data integration objectives, as well as the limitations of Natural Language Processing (NLP) tools relative to the specific knowledge structure of domains such as operations and maintenance. In this work, we discuss Knowledge Graph construction and break down the Knowledge Extraction process into its Named Entity Recognition, Coreference Resolution, Named Entity Linking, and Relation Extraction functional components. We then evaluate sixteen NLP tools in concert with or in comparison to the rapidly advancing capabilities of Large Language Models (LLMs). We focus on the operational and maintenance intelligence use case for trusted applications in the aircraft industry. A baseline dataset is derived from a rich public domain US Federal Aviation Administration dataset focused on equipment failures or maintenance requirements. We assess the zero-shot performance of NLP and LLM tools that can be operated within a controlled, confidential environment (no data is sent to third parties). Based on our observation of significant performance limitations, we discuss the challenges related to trusted NLP and LLM tools as well as their Technical Readiness Level for wider use in mission-critical industries such as aviation. We conclude with recommendations to enhance trust and provide our open-source curated dataset to support further baseline testing and evaluation.

[26] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

Md Talha Mohsin

Main category: cs.CL

TL;DR: 本研究系统性评估五种主流LLM在财务分析中的表现，发现GPT最优，其次是Claude和Perplexity，而Gemini和DeepSeek表现不稳定，输出质量受提示和源材料影响较大。

Details

Motivation: 尽管大型语言模型（LLM）在金融自然语言处理任务中表现出色，但现有研究缺乏对不同LLM之间的系统性比较。因此，本文旨在对主流LLM在财务分析中的表现进行深入比较评估。 Method: 对五种主流LLM（GPT、Claude、Perplexity、Gemini和DeepSeek）进行比较评估，使用领域特定提示和三种评估方法：人工标注、自动化词法-语义指标（ROUGE、余弦相似度、Jaccard）以及模型行为诊断（提示级方差和跨模型相似性）。 Result: GPT在连贯性、语义一致性和上下文相关性方面表现最佳；Claude和Perplexity紧随其后。Gemini和DeepSeek表现出更高的变异性与较低的一致性。此外，输出的相似性和稳定性因公司和时间而异，表明模型对提示写法和源材料敏感。 Conclusion: 研究显示GPT在财务分析中表现最佳，其次是Claude和Perplexity，而Gemini和DeepSeek的表现则更具变异性且一致性较低。输出的相似性和稳定性随公司和时间而变化，表明它们对提示写法和源材料敏感。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks. However, systematic comparisons among widely used LLMs remain underexplored. Given the rapid advancement and growing influence of LLMs in financial analysis, this study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies. We create a set of domain-specific prompts and then use three methodologies to evaluate model performance: human annotation, automated lexical-semantic metrics (ROUGE, Cosine Similarity, Jaccard), and model behavior diagnostics (prompt-level variance and across-model similarity). The results show that GPT gives the most coherent, semantically aligned, and contextually relevant answers; followed by Claude and Perplexity. Gemini and DeepSeek, on the other hand, have more variability and less agreement. Also, the similarity and stability of outputs change from company to company and over time, showing that they are sensitive to how prompts are written and what source material is used.

[27] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

Jinkun Zhao,Yuanshuai Wang,Xingjian Zhang,Ruibo Chen,Xingchuang Liao,Junle Wang,Lei Huang,Kui Zhang,Wenjun Wu

Main category: cs.CL

TL;DR: 本文提出了一种新的AIOps框架CoE-Ops，通过结合多个专家模型和检索增强生成机制，在任务路由和问题解决方面取得了显著的性能提升。

Details

Motivation: 由于单一模型受限于特定领域知识，难以满足多样化的AIOps任务需求，因此需要一种更加高效和通用的解决方案。 Method: 提出了一种协作专家框架CoE-Ops，并引入了检索增强生成机制，以提高处理高级和低级AIOps任务的能力。 Result: 实验结果显示，CoE-Ops在任务路由准确性上比现有CoE方法提高了72%，在DevOps问题解决中比单一模型提高了8%，并比大规模MoE模型高出14%。 Conclusion: CoE-Ops在AIOps领域中通过结合多个专家模型和检索增强生成机制，显著提高了任务路由准确性和问题解决效果。 Abstract: With the rapid evolution of artificial intelligence, AIOps has emerged as a prominent paradigm in DevOps. Lots of work has been proposed to improve the performance of different AIOps phases. However, constrained by domain-specific knowledge, a single model can only handle the operation requirement of a specific task,such as log parser,root cause analysis. Meanwhile, combining multiple models can achieve more efficient results, which have been proved in both previous ensemble learning and the recent LLM training domain. Inspired by these works,to address the similar challenges in AIOPS, this paper first proposes a collaboration-of-expert framework(CoE-Ops) incorporating a general-purpose large language model task classifier. A retrieval-augmented generation mechanism is introduced to improve the framework's capability in handling both Question-Answering tasks with high-level(Code,build,Test,etc.) and low-level(fault analysis,anomaly detection,etc.). Finally, the proposed method is implemented in the AIOps domain, and extensive experiments are conducted on the DevOps-EVAL dataset. Experimental results demonstrate that CoE-Ops achieves a 72% improvement in routing accuracy for high-level AIOps tasks compared to existing CoE methods, delivers up to 8% accuracy enhancement over single AIOps models in DevOps problem resolution, and outperforms larger-scale Mixture-of-Experts (MoE) models by up to 14% in accuracy.

Sumit Soman,H. G. Ranjani,Sujoy Roychowdhury,Venkata Dharma Surya Narayana Sastry,Akshat Jain,Pranav Gangrade,Ayaaz Khan

Main category: cs.CL

TL;DR: This paper proposes an end-to-end method to improve Question-Answering in the telecom domain by incorporating graph representations of flowcharts generated by Visual Large Language Models into text-based RAG systems, resulting in better retrieval performance and reduced inference costs.

Details

Motivation: Traditional text-based Retrieval Augmented Generation systems struggle to answer questions whose answers are contained in figures like flowcharts. This study aims to enhance these systems by incorporating image-derived graph representations to improve QA accuracy in technical domains like telecommunications. Method: The study uses a fine-tuned Visual Large Language Model to generate graph representations of flowchart images from technical telecom documents. These graph representations are then integrated with a text embedding pipeline to enhance retrieval performance in a QA system. Result: Graph representations derived from a fine-tuned Visual Large Language Model showed lower edit distance from ground truth, indicating robustness. The enhanced QA system demonstrated strong retrieval performance using text-based embedding models, including domain-specific ones, and eliminated the need for Visual Large Language Models during inference. Conclusion: Incorporating graph representations of flowcharts from Visual Large Language Models with text-based RAG systems improves QA performance in the telecom domain, while also reducing the need for VLMs during inference, offering cost benefits. Abstract: Question-Answering (QA) from technical documents often involves questions whose answers are present in figures, such as flowcharts or flow diagrams. Text-based Retrieval Augmented Generation (RAG) systems may fail to answer such questions. We leverage graph representations of flowcharts obtained from Visual large Language Models (VLMs) and incorporate them in a text-based RAG system to show that this approach can enable image retrieval for QA in the telecom domain. We present the end-to-end approach from processing technical documents, classifying image types, building graph representations, and incorporating them with the text embedding pipeline for efficient retrieval. We benchmark the same on a QA dataset created based on proprietary telecom product information documents. Results show that the graph representations obtained using a fine-tuned VLM model have lower edit distance with respect to the ground truth, which illustrate the robustness of these representations for flowchart images. Further, the approach for QA using these representations gives good retrieval performance using text-based embedding models, including a telecom-domain adapted one. Our approach also alleviates the need for a VLM in inference, which is an important cost benefit for deployed QA systems.

[29] PARROT: An Open Multilingual Radiology Reports Dataset

Bastien Le Guellec,Kokou Adambounou,Lisa C Adams,Thibault Agripnidis,Sung Soo Ahn,Radhia Ait Chalal,Tugba Akinci D Antonoli,Philippe Amouyel,Henrik Andersson,Raphael Bentegeac,Claudio Benzoni,Antonino Andrea Blandino,Felix Busch,Elif Can,Riccardo Cau,Armando Ugo Cavallo,Christelle Chavihot,Erwin Chiquete,Renato Cuocolo,Eugen Divjak,Gordana Ivanac,Barbara Dziadkowiec Macek,Armel Elogne,Salvatore Claudio Fanni,Carlos Ferrarotti,Claudia Fossataro,Federica Fossataro,Katarzyna Fulek,Michal Fulek,Pawel Gac,Martyna Gachowska,Ignacio Garcia Juarez,Marco Gatti,Natalia Gorelik,Alexia Maria Goulianou,Aghiles Hamroun,Nicolas Herinirina,Krzysztof Kraik,Dominik Krupka,Quentin Holay,Felipe Kitamura,Michail E Klontzas,Anna Kompanowska,Rafal Kompanowski,Alexandre Lefevre,Tristan Lemke,Maximilian Lindholz,Lukas Muller,Piotr Macek,Marcus Makowski,Luigi Mannacio,Aymen Meddeb,Antonio Natale,Beatrice Nguema Edzang,Adriana Ojeda,Yae Won Park,Federica Piccione,Andrea Ponsiglione,Malgorzata Poreba,Rafal Poreba,Philipp Prucker,Jean Pierre Pruvo,Rosa Alba Pugliesi,Feno Hasina Rabemanorintsoa,Vasileios Rafailidis,Katarzyna Resler,Jan Rotkegel,Luca Saba,Ezann Siebert,Arnaldo Stanzione,Ali Fuat Tekin,Liz Toapanta Yanchapaxi,Matthaios Triantafyllou,Ekaterini Tsaoulia,Evangelia Vassalou,Federica Vernuccio,Johan Wasselius,Weilang Wang,Szymon Urban,Adrian Wlodarczak,Szymon Wlodarczak,Andrzej Wysocki,Lina Xu,Tomasz Zatonski,Shuhang Zhang,Sebastian Ziegelmayer,Gregory Kuchcinski,Keno K Bressem

Main category: cs.CL

TL;DR: PARROT is the largest open multilingual radiology report dataset designed to facilitate the development and validation of natural language processing applications in radiology across linguistic, geographic, and clinical boundaries.

Details

Motivation: To develop and validate PARROT, a large, multicenter, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Method: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants assessing whether reports were human-authored or AI-generated. Result: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities and anatomical regions, with chest, abdomen, head, and pelvis being most prevalent. In the differentiation study, participants achieved 53.9% accuracy in distinguishing between human and AI-generated reports, with radiologists performing significantly better than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints. Abstract: Rationale and Objectives: To develop and validate PARROT (Polyglottal Annotated Radiology Reports for Open Testing), a large, multicentric, open-access dataset of fictional radiology reports spanning multiple languages for testing natural language processing applications in radiology. Materials and Methods: From May to September 2024, radiologists were invited to contribute fictional radiology reports following their standard reporting practices. Contributors provided at least 20 reports with associated metadata including anatomical region, imaging modality, clinical context, and for non-English reports, English translations. All reports were assigned ICD-10 codes. A human vs. AI report differentiation study was conducted with 154 participants (radiologists, healthcare professionals, and non-healthcare professionals) assessing whether reports were human-authored or AI-generated. Results: The dataset comprises 2,658 radiology reports from 76 authors across 21 countries and 13 languages. Reports cover multiple imaging modalities (CT: 36.1%, MRI: 22.8%, radiography: 19.0%, ultrasound: 16.8%) and anatomical regions, with chest (19.9%), abdomen (18.6%), head (17.3%), and pelvis (14.1%) being most prevalent. In the differentiation study, participants achieved 53.9% accuracy (95% CI: 50.7%-57.1%) in distinguishing between human and AI-generated reports, with radiologists performing significantly better (56.9%, 95% CI: 53.3%-60.6%, p<0.05) than other groups. Conclusion: PARROT represents the largest open multilingual radiology report dataset, enabling development and validation of natural language processing applications across linguistic, geographic, and clinical boundaries without privacy constraints.

[30] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

Rui Jiao,Yue Zhang,Jinku Li

Main category: cs.CL

TL;DR: RELIANCE通过事实验证、强化学习和可解释性分析，显著提升大语言模型推理过程中的事实准确性，有效解决高风险领域中因推理错误导致的风险问题。

Details

Motivation: 大语言模型（LLMs）在中间推理步骤中存在事实不准确的问题，尽管最终答案正确，但这种错误在高风险领域如医疗、法律和科学研究中可能导致严重后果。 Method: RELIANCE框架包含三个核心组件：基于反事实增强数据训练的专门事实验证分类器、平衡事实性、连贯性和结构正确性的多维奖励强化学习方法（GRPO），以及分析模型激活层中事实改进表现的可解释性模块。 Result: 在十个最先进模型上的广泛评估显示，即使是领先模型如Claude-3.7和GPT-o1，其推理事实准确率分别仅为81.93%和82.57%。RELIANCE显著提升了事实稳健性（最高达49.90%的改进），同时保持或提升了在Math-500、AIME-2024和GPQA等挑战性基准上的性能。 Conclusion: RELIANCE提供了一个全面的框架，通过结合事实验证分类器、强化学习策略和可解释性模块，有效提升大语言模型推理过程中的事实稳健性，为未来针对性训练方法奠定了基础。 Abstract: We present RELIANCE (Reasoning Evaluation with Logical Integrity and Accuracy for Confidence Enhancement), a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) a Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability module examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across ten state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. RELIANCE significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.

[31] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

Paul Minchella,Loïc Verlingue,Stéphane Chrétien,Rémi Vaucher,Guillaume Metzler

Main category: cs.CL

TL;DR: SigBERT是一种创新的时间生存分析框架，能够高效处理大量临床报告，通过从时间序列句子嵌入中提取几何特征来显著提升生存模型性能。

Details

Motivation: 现有的生存分析方法难以有效处理文本数据的复杂性，尤其是在其顺序形式中。 Method: SigBERT通过从带有时间戳的医疗报告中提取和平均词嵌入来生成句子嵌入，并应用粗糙路径理论中的签名提取来捕捉时间序列的动态特征，将这些特征集成到LASSO惩罚的Cox模型中以估计患者特定的风险评分。 Result: 模型在Leon Bérard中心的真实肿瘤数据集上进行了训练和评估，在独立测试队列上C指数得分为0.75（标准差0.014）。 Conclusion: SigBERT有效地整合了顺序医疗数据，提高了风险估计的准确性，推动了基于叙述的生存分析的发展。 Abstract: Electronic medical reports (EHR) contain a vast amount of information that can be leveraged for machine learning applications in healthcare. However, existing survival analysis methods often struggle to effectively handle the complexity of textual data, particularly in its sequential form. Here, we propose SigBERT, an innovative temporal survival analysis framework designed to efficiently process a large number of clinical reports per patient. SigBERT processes timestamped medical reports by extracting and averaging word embeddings into sentence embeddings. To capture temporal dynamics from the time series of sentence embedding coordinates, we apply signature extraction from rough path theory to derive geometric features for each patient, which significantly enhance survival model performance by capturing complex temporal dynamics. These features are then integrated into a LASSO-penalized Cox model to estimate patient-specific risk scores. The model was trained and evaluated on a real-world oncology dataset from the L\'eon B\'erard Center corpus, with a C-index score of 0.75 (sd 0.014) on the independent test cohort. SigBERT integrates sequential medical data to enhance risk estimation, advancing narrative-based survival analysis.

[32] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

Shirley V Wang,Georg Hahn,Sushama Kattinakere Sreedhara,Mufaddal Mahesri,Haritha S. Pillai,Rajendra Aldis,Joyce Lii,Sarah K. Dutcher,Rhoda Eniafe,Jamal T. Jones,Keewan Kim,Jiwei He,Hana Lee,Sengwee Toh,Rishi J Desai,Jie Yang

Main category: cs.CL

TL;DR: The paper proposes an efficient validation method using NLP and adaptive sampling to assess code-based algorithms in health studies, significantly reducing time and effort while maintaining accuracy.

Details

Motivation: Validating the measurement characteristics of code-based algorithms is essential for assessing the robustness of study results against potential biases, but traditional methods are time-consuming and resource-intensive. Method: The study introduced an expedited validation process using natural language processing (NLP) to reduce human review time and a multi-wave adaptive sampling approach with a pre-defined stopping rule to achieve sufficient precision efficiently. Result: The NLP-assisted annotation reduced chart review time by 40%, while the multi-wave sampling approach with a stopping rule could have avoided reviewing 77% of the charts with minimal impact on precision. Conclusion: The approach described can enable more routine validation of code-based algorithms used to define key study parameters, which enhances the understanding of the reliability of findings derived from database studies. Abstract: Background: One of the ways to enhance analyses conducted with large claims databases is by validating the measurement characteristics of code-based algorithms used to identify health outcomes or other key study parameters of interest. These metrics can be used in quantitative bias analyses to assess the robustness of results for an inferential study given potential bias from outcome misclassification. However, extensive time and resource allocation are typically re-quired to create reference-standard labels through manual chart review of free-text notes from linked electronic health records. Methods: We describe an expedited process that introduces efficiency in a validation study us-ing two distinct mechanisms: 1) use of natural language processing (NLP) to reduce time spent by human reviewers to review each chart, and 2) a multi-wave adaptive sampling approach with pre-defined criteria to stop the validation study once performance characteristics are identified with sufficient precision. We illustrate this process in a case study that validates the performance of a claims-based outcome algorithm for intentional self-harm in patients with obesity. Results: We empirically demonstrate that the NLP-assisted annotation process reduced the time spent on review per chart by 40% and use of the pre-defined stopping rule with multi-wave samples would have prevented review of 77% of patient charts with limited compromise to precision in derived measurement characteristics. Conclusion: This approach could facilitate more routine validation of code-based algorithms used to define key study parameters, ultimately enhancing understanding of the reliability of find-ings derived from database studies.

[33] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

Naomi Omeonga wa Kayembe

Main category: cs.CL

TL;DR: 这篇论文重新定义任意性为一种基础功能机制，通过索绪尔和香农的理论，提出任意性作为中性操作符在控制和人际关系中的重要性。

Details

Motivation: 论文的动机是重新定义任意性，并探讨其在语言、法律和社会系统中的作用。 Method: 基于索绪尔的符号任意性概念，结合香农的熵模型，提出任意性的理论框架。 Result: 论文提出了“动机->可证实性->可争议性”链，并形式化任意性为A = H(L|M)（条件熵）。 Conclusion: 论文得出结论，任意性不是规范缺陷或支配的征兆，而是构建人类系统和互动的基础功能机制。 Abstract: This article redefines arbitrariness not as a normative flaw or a symptom of domination, but as a foundational functional mechanism structuring human systems and interactions. Diverging from critical traditions that conflate arbitrariness with injustice, it posits arbitrariness as a semiotic trait: a property enabling systems - linguistic, legal, or social - to operate effectively while withholding their internal rationale. Building on Ferdinand de Saussure's concept of l'arbitraire du signe, the analysis extends this principle beyond language to demonstrate its cross-domain applicability, particularly in law and social dynamics. The paper introduces the "Motivation -> Constatability -> Contestability" chain, arguing that motivation functions as a crucial interface rendering an act's logic vulnerable to intersubjective contestation. When this chain is broken through mechanisms like "immotivization" or "Conflict Lateralization" (exemplified by "the blur of the wolf drowned in the fish"), acts produce binding effects without exposing their rationale, thus precluding justiciability. This structural opacity, while appearing illogical, is a deliberate design protecting authority from accountability. Drawing on Shannon's entropy model, the paper formalizes arbitrariness as A = H(L|M) (conditional entropy). It thereby proposes a modern theory of arbitrariness as a neutral operator central to control as well as care, an overlooked dimension of interpersonal relations. While primarily developed through human social systems, this framework also illuminates a new pathway for analyzing explainability in advanced artificial intelligence systems.

[34] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

Chengqian Ma,Wei Tao,Yiwen Guo

Main category: cs.CL

TL;DR: This paper highlights the research gap in understanding the practical effectiveness of Spoken Dialogue Models (SDMs) in comparison to text-based Large Language Models (LLMs) and presents a benchmark dataset along with an LLM-based evaluation method to evaluate SDMs' performance in handling spoken dialogue complexities.

Details

Motivation: The motivation for this research is the increasing popularity of Spoken Dialogue Models (SDMs) and the gap in research focused on understanding their practical effectiveness in comprehending and emulating human conversations, especially when compared to text-based Large Language Models (LLMs). Method: The authors present a benchmark dataset comprising 1,079 instances in English and Chinese, along with an LLM-based evaluation method that aligns closely with human judgment. Result: The result of this paper is the presentation of a benchmark dataset and an LLM-based evaluation method that facilitate a comprehensive exploration of the performance of SDMs in tackling practical challenges in spoken dialogue. Conclusion: This paper concludes that there is a research gap in understanding the practical effectiveness of Spoken Dialogue Models (SDMs) in comprehending and emulating human conversations, especially in comparison to text-based Large Language Models (LLMs). To address this, the authors present a benchmark dataset and an LLM-based evaluation method to explore the performance of SDMs. Abstract: Spoken Dialogue Models (SDMs) have recently attracted significant attention for their ability to generate voice responses directly to users' spoken queries. Despite their increasing popularity, there exists a gap in research focused on comprehensively understanding their practical effectiveness in comprehending and emulating human conversations. This is especially true compared to text-based Large Language Models (LLMs), which benefit from extensive benchmarking. Human voice interactions are inherently more complex than text due to characteristics unique to spoken dialogue. Ambiguity poses one challenge, stemming from semantic factors like polysemy, as well as phonological aspects such as heterograph, heteronyms, and stress patterns. Additionally, context-dependency, like omission, coreference, and multi-turn interaction, adds further complexity to human conversational dynamics. To illuminate the current state of SDM development and to address these challenges, we present a benchmark dataset in this paper, which comprises 1,079 instances in English and Chinese. Accompanied by an LLM-based evaluation method that closely aligns with human judgment, this dataset facilitates a comprehensive exploration of the performance of SDMs in tackling these practical challenges.

[35] Math Natural Language Inference: this should be easy!

Valeria de Paiva,Qiyue Gao,Hai Hu,Pavel Kovalev,Yikang Liu,Lawrence S. Moss,Zhiheng Qian

Main category: cs.CL

TL;DR: The study explores the ability of LLMs to perform natural language inference on mathematical texts, showing both potential and current limitations.

Details

Motivation: To determine if contemporary LLMs can effectively perform NLI tasks on mathematical texts and assess the feasibility of LLM-generated hypotheses. Method: Constructed a Math NLI corpus with premises from mathematical texts and human-labeled hypotheses; evaluated LLM performance and inter-group consistency. Result: LLMs demonstrated improved performance with majority voting but still struggled with basic mathematical inference tasks. Conclusion: LLMs show both potential and limitations in performing NLI tasks on mathematical texts, with majority voting showing promise while basic inference challenges persist. Abstract: We ask whether contemporary LLMs are able to perform natural language inference (NLI) tasks on mathematical texts. We call this the Math NLI problem. We construct a corpus of Math NLI pairs whose premises are from extant mathematical text and whose hypotheses and gold labels were provided by people with experience in both research-level mathematics and also in the NLI field. We also investigate the quality of corpora using the same premises but whose hypotheses are provided by LLMs themselves. We not only investigate the performance but also the inter-group consistency of the diverse group of LLMs. We have both positive and negative findings. Among our positive findings: in some settings, using a majority vote of LLMs is approximately equivalent to using human-labeled data in the Math NLI area. On the negative side: LLMs still struggle with mathematical language. They occasionally fail at even basic inferences. Current models are not as prone to hypothesis-only "inference" in our data the way the previous generation had been. In addition to our findings, we also provide our corpora as data to support future work on Math NLI.

[36] Exploring In-Context Learning for Frame-Semantic Parsing

Diego Garat,Guillermo Moncecchi,Dina Wonsever

Main category: cs.CL

TL;DR: This paper proposes an In-Context Learning approach using Large Language Models to perform Frame Semantic Parsing without fine-tuning, achieving high F1 scores for Frame Identification and Semantic Role Labeling tasks related to violent events.

Details

Motivation: The study aims to explore the use of In-Context Learning (ICL) with Large Language Models (LLMs) for Frame Semantic Parsing (FSP), focusing on predicate identification and argument labeling according to Frame Semantics, without the need for model fine-tuning. Method: The paper proposes a method that automatically generates task-specific prompts for Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) using the FrameNet database. These prompts, based on frame definitions and annotated examples, are applied to guide six different Large Language Models (LLMs) without model fine-tuning. Result: Experiments conducted on a subset of frames related to violent events achieved competitive results, with F1 scores of 94.3% for Frame Identification (FI) and 77.4% for Frame Semantic Role Labeling (FSRL). Conclusion: In-Context Learning (ICL) offers a practical and effective alternative to traditional fine-tuning for domain-specific Frame Semantic Parsing (FSP) tasks, particularly for Frame Identification (FI) and Frame Semantic Role Labeling (FSRL). Abstract: Frame Semantic Parsing (FSP) entails identifying predicates and labeling their arguments according to Frame Semantics. This paper investigates the use of In-Context Learning (ICL) with Large Language Models (LLMs) to perform FSP without model fine-tuning. We propose a method that automatically generates task-specific prompts for the Frame Identification (FI) and Frame Semantic Role Labeling (FSRL) subtasks, relying solely on the FrameNet database. These prompts, constructed from frame definitions and annotated examples, are used to guide six different LLMs. Experiments are conducted on a subset of frames related to violent events. The method achieves competitive results, with F1 scores of 94.3% for FI and 77.4% for FSRL. The findings suggest that ICL offers a practical and effective alternative to traditional fine-tuning for domain-specific FSP tasks.

[37] Context-aware Rotary Position Embedding

Ali Veisi,Delaram Fartoot,Hamidreza Amirzadeh

Main category: cs.CL

TL;DR: CARoPE是一种改进的Transformer位置编码方法，通过动态生成与上下文和token相关的频率模式，提升了模型的表达能力和性能。

Details

Motivation: RoPE依赖静态的、与输入无关的正弦频率模式，限制了其对上下文敏感关系的建模能力。因此，需要一种更灵活、更具表达能力的替代方案。 Method: CARoPE使用一种受限的token嵌入变换来计算输入相关的相位偏移，并将其集成到旋转机制中，从而实现与token和上下文相关的表示。 Result: 在FineWeb-Edu-10B数据集上，使用GPT-2变体进行评估的结果表明，CARoPE在下一个token预测任务中始终优于RoPE和其他常见的位置编码基线方法，即使在更长的上下文长度下也显著降低了困惑度，并且训练吞吐量更快，同时不牺牲模型稳定性。 Conclusion: CARoPE通过动态生成与上下文相关的频率模式，改进了传统的RoPE方法，使其在保持效率和架构简洁性的同时，能够更好地建模上下文敏感的关系。 Abstract: Positional encoding is a vital component of Transformer architectures, enabling models to incorporate sequence order into self-attention mechanisms. Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency. However, RoPE relies on static, input-independent sinusoidal frequency patterns, limiting its ability to model context-sensitive relationships. In this work, we propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings. This design introduces token- and context-sensitive positional representations while preserving RoPE efficiency and architectural simplicity. CARoPE computes input-dependent phase shifts using a bounded transformation of token embeddings and integrates them into the rotary mechanism across attention heads. We evaluate CARoPE on the FineWeb-Edu-10B dataset using GPT-2 variants trained on next-token prediction tasks. Experimental results show that CARoPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths. Additionally, CARoPE enables faster training throughput without sacrificing model stability. These findings demonstrate that CARoPE offers a scalable, expressive, and efficient upgrade to existing positional encoding strategies in Transformer models.

[38] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

Ishani Mondal,Meera Bharadwaj,Ayush Roy,Aparna Garimella,Jordan Lee Boyd-Graber

Main category: cs.CL

TL;DR: SMART-Editor is a new framework for global layout and content editing that uses reward-guided planning to achieve superior results in structured and unstructured domains.

Details

Motivation: Prior models tend to perform local edits without maintaining global coherence, which SMART-Editor addresses. Method: The framework uses Reward-Refine for inference-time reward-guided refinement and RewardDPO for training-time preference optimization using reward-aligned layout pairs. Result: SMART-Editor outperforms baselines like InstructPix2Pix and HIVE, with RewardDPO achieving 15% gains in structured settings and Reward-Refine showing advantages in natural image editing. Conclusion: SMART-Editor is an effective framework for compositional layout and content editing, demonstrating superior performance in both structured and unstructured domains. Abstract: We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global coherence through two strategies: Reward-Refine, an inference-time rewardguided refinement method, and RewardDPO, a training-time preference optimization approach using reward-aligned layout pairs. To evaluate model performance, we introduce SMARTEdit-Bench, a benchmark covering multi-domain, cascading edit scenarios. SMART-Editor outperforms strong baselines like InstructPix2Pix and HIVE, with RewardDPO achieving up to 15% gains in structured settings and Reward-Refine showing advantages on natural images. Automatic and human evaluations confirm the value of reward-guided planning in producing semantically consistent and visually aligned edits.

[39] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

Jeffrey Eben,Aitzaz Ahmad,Stephen Lau

Main category: cs.CL

TL;DR: This paper proposes a scalable, component-based retrieval method for natural language interfaces to databases, eliminating the need for domain-specific fine-tuning and improving performance on enterprise-level data catalogs.

Details

Motivation: The motivation is to overcome the challenge of scaling large language model-based natural language interfaces to enterprise-level databases, which prior approaches fail to address due to reliance on domain-specific fine-tuning and lack of metadata utilization. Method: The authors introduce a component-based retrieval architecture that breaks down database schemas and metadata into discrete semantic units for targeted retrieval, focusing on table identification while utilizing column-level details. Result: The experiments show that the proposed method maintains high recall and accuracy, outperforming baseline systems on large databases with varying structures and metadata availability. Conclusion: The paper concludes that their component-based retrieval architecture effectively scales natural language interfaces for databases to enterprise-level data catalogs without domain-specific fine-tuning. Abstract: Despite advances in large language model (LLM)-based natural language interfaces for databases, scaling to enterprise-level data catalogs remains an under-explored challenge. Prior works addressing this challenge rely on domain-specific fine-tuning - complicating deployment - and fail to leverage important semantic context contained within database metadata. To address these limitations, we introduce a component-based retrieval architecture that decomposes database schemas and metadata into discrete semantic units, each separately indexed for targeted retrieval. Our approach prioritizes effective table identification while leveraging column-level information, ensuring the total number of retrieved tables remains within a manageable context budget. Experiments demonstrate that our method maintains high recall and accuracy, with our system outperforming baselines over massive databases with varying structure and available metadata. Our solution enables practical text-to-SQL systems deployable across diverse enterprise settings without specialized fine-tuning, addressing a critical scalability gap in natural language database interfaces.

[40] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

Xinwei Wu,Haojie Li,Hongyu Liu,Xinyu Ji,Ruohan Li,Yule Chen,Yigeng Zhang

Main category: cs.CL

TL;DR: 该研究分析了大型语言模型（LLM）在处理中文歧义文本时的局限性，发现其在理解歧义时表现脆弱且过于自信，提出了改进语言理解不确定处理的必要性。

Details

Motivation: 研究大型语言模型（LLM）在遇到歧义叙述文本时的表现，尤其是中文文本歧义问题，具有重要的现实应用意义。 Method: 创建了一个基准数据集，包含歧义句子及其消歧义配对，并进行了实验分析LLM在处理歧义时的表现。 Result: 实验发现LLM在处理歧义时表现出显著脆弱性，无法可靠区分歧义与非歧义文本，对歧义文本的解释过于自信，并在理解多种可能含义时表现出过度思考。 Conclusion: 研究揭示了当前LLM在处理语言歧义方面的基本局限性，表明需要改进方法来处理语言理解中的不确定性。 Abstract: In this work, we study a critical research problem regarding the trustworthiness of large language models (LLMs): how LLMs behave when encountering ambiguous narrative text, with a particular focus on Chinese textual ambiguity. We created a benchmark dataset by collecting and generating ambiguous sentences with context and their corresponding disambiguated pairs, representing multiple possible interpretations. These annotated examples are systematically categorized into 3 main categories and 9 subcategories. Through experiments, we discovered significant fragility in LLMs when handling ambiguity, revealing behavior that differs substantially from humans. Specifically, LLMs cannot reliably distinguish ambiguous text from unambiguous text, show overconfidence in interpreting ambiguous text as having a single meaning rather than multiple meanings, and exhibit overthinking when attempting to understand the various possible meanings. Our findings highlight a fundamental limitation in current LLMs that has significant implications for their deployment in real-world applications where linguistic ambiguity is common, calling for improved approaches to handle uncertainty in language understanding. The dataset and code are publicly available at this GitHub repository: https://github.com/ictup/LLM-Chinese-Textual-Disambiguation.

[41] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

Ananya Sadana,Yash Kumar Lal,Jiawei Zhou

Main category: cs.CL

TL;DR: ISO-Bench highlights the limitations of current multimodal AI models in understanding causal relationships between images and text, showing significant room for improvement compared to human performance.

Details

Motivation: Understanding causal relationships across different data modalities is essential for multimodal models working in real-world environments, yet it remains a challenging task. Method: The authors introduced ISO-Bench, a benchmark designed to evaluate the ability of multimodal models to infer causal dependencies between visual observations and procedural text. The evaluation involved ten advanced vision-language models. Result: The evaluation results showed that the best zero-shot performance achieved an F1 score of 0.57, while using chain-of-thought reasoning improved the score to a maximum of 0.62 F1, significantly lower than human performance (0.98 F1). Conclusion: The study concludes that there is a significant gap in the causal understanding capabilities of current multimodal models compared to humans, indicating the need for further research and development in this area. Abstract: Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.

[42] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

Yuhan Liu,Michael J. Q. Zhang,Eunsol Choi

Main category: cs.CL

TL;DR: 本文研究了从用户与语言模型交互日志中提取隐式用户反馈的方法，发现反馈内容对提升模型性能有一定作用，但效果受限于用户初始提示的质量。

Details

Motivation: 在语言模型部署后，通过用户反馈不断优化模型性能，但直接获取用户反馈可能会造成干扰，因此研究如何从用户与模型的交互日志中提取隐式反馈具有重要意义。 Method: 通过分析WildChat和LMSYS两个用户与语言模型交互数据集中的隐式用户反馈，研究用户反馈的内容及其对模型性能的影响。 Result: 研究发现，用户反馈的内容（如用户希望得到澄清）可以提升模型在简短问题（如MTBench）上的表现，但在更复杂的问题（如WildBench）上效果不佳。此外，用户反馈的有效性很大程度上取决于用户初始提示的质量。 Conclusion: 用户反馈的内容对于提升模型性能具有重要作用，但其效果与用户初始提示的质量密切相关。隐式用户反馈具有潜力但也存在局限性。 Abstract: Once language models (LMs) are deployed, they can interact with users long-term, ideally evolving continuously based on their feedback. Asking for direct user feedback can be disruptive; thus, we study harvesting user feedback from user-LM interaction logs. We study implicit user feedback in two user-LM interaction datasets (WildChat and LMSYS). First, we analyze user feedback in the user-LLM conversation trajectory, providing insights into when and why such feedback occurs. Second, we study harvesting learning signals from such implicit user feedback. We find that the contents of user feedback (e.g., user wanted clarification), not just the polarity (e.g., users were unhappy with the previous model response), can improve model performance in short human-designed questions (MTBench) but not on longer and more complex questions (WildBench). We also find that the usefulness of user feedback is largely tied to the quality of the user's initial prompt. Together, we provide an in-depth study of implicit user feedback, showing its potential and limitations.

[43] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

Jizhou Guo

Main category: cs.CL

TL;DR: LENS通过学习从神经状态中估计模型的置信度，提供了一种新颖且有效的多模型集成方法。

Details

Motivation: 现有的集成方法往往依赖于简单技术，忽略了不同上下文中模型的不同置信度和可靠性。需要一种更有效的方法来提升系统鲁棒性和性能。 Method: LENS方法使用了轻量级的线性置信度预测器，该预测器利用逐层的隐藏状态和归一化概率作为输入来估计模型的置信度。 Result: 实验结果表明，在多项选择和布尔问答任务中，LENS方法比传统集成方法表现出色。 Conclusion: LENS是一个有效的集成方法，它通过分析内部表示来估计模型的置信度，从而更细致地加权模型预测。 Abstract: Large Language Models (LLMs) have demonstrated impressive performance across various tasks, with different models excelling in distinct domains and specific abilities. Effectively combining the predictions of multiple LLMs is crucial for enhancing system robustness and performance. However, existing ensemble methods often rely on simple techniques like voting or logits ensembling, which overlook the varying confidence and reliability of models in different contexts. In this work, we propose LENS (Learning ENsemble confidence from Neural States), a novel approach that learns to estimate model confidence by analyzing internal representations. For each LLM, we train a lightweight linear confidence predictor that leverages layer-wise hidden states and normalized probabilities as inputs. This allows for more nuanced weighting of model predictions based on their context-dependent reliability. Our method does not require modifying the model parameters and requires negligible additional computation. Experimental results on multiple-choice and boolean question-answering tasks demonstrate that LENS outperforms traditional ensemble methods by a substantial margin. Our findings suggest that internal representations provide valuable signals for determining model confidence and can be effectively leveraged for ensemble learning.

[44] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Jianghui Wang,Vinay Joshi,Saptarshi Majumder,Xu Chao,Bin Ding,Ziqiong Liu,Pratik Prabhanjan Brahma,Dong Li,Zicheng Liu,Emad Barsoum

Main category: cs.CL

TL;DR: GEAK improves GPU kernel generation using LLMs, outperforming existing methods in both correctness and speed.

Details

Motivation: Growing complexity and diversity of deep learning workloads require automation of low-level kernel development for performance and productivity. Method: GEAK uses inference-time compute scaling and a Reflexion-style reasoning loop to generate optimized Triton code for GPU kernels. Result: GEAK achieves up to 63% correctness and 2.59X speedup over baseline methods on AMD GPUs like MI300X and MI250. Conclusion: GEAK is a promising framework for generating efficient GPU kernels for AMD GPUs using LLMs, outperforming other methods in correctness and execution speed. Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

[45] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

Yunhao Liang,Ruixuan Ying,Takuya Taniguchi,Zhe Cui

Main category: cs.CL

TL;DR: 本文提出了一种利用负样本改进正样本选择的新方法，从而提升大型语言模型的少样本上下文学习（ICL）性能。

Details

Motivation: 现有研究主要依赖正样本提升ICL性能，却忽略了负样本中可能蕴含的有价值信息。通过利用负样本信息，可以更有效地选择正样本示例，从而提升ICL效果。 Method: 首先基于Zero-Shot-Cot构建正样本和负样本语料库；在推理阶段，使用基于语义相似度的方法从正负样本库中选择最相似的示例；进一步基于负样本的语义相似度从正样本库中检索更多正样本，并将它们合并用于ICL演示。 Result: 实验结果表明，该方法优于仅依赖最相似正样本构建上下文的方法，验证了负样本信息在提升ICL性能方面的有效性。 Conclusion: 通过引入负样本进行正样本选择优化，能够有效提升少样本上下文学习的性能，表明负样本中蕴含的信息对ICL具有积极促进作用。 Abstract: Large Language Models exhibit powerful few-shot in-context learning (ICL) capabilities, but the performance is highly sensitive to provided examples. Recent research has focused on retrieving corresponding examples for each input query, not only enhancing the efficiency and scalability of the learning process but also mitigating inherent biases in manual example selection. However, these studies have primarily emphasized leveraging Positive samples while overlooking the additional information within Negative samples for contextual learning. We propose a novel method that utilizes Negative samples to better select Positive sample examples, thereby enhancing the performance of few-shot ICL. Initially, we construct Positive and Negative sample corpora based on Zero-Shot-Cot. Then, during inference, we employ a semantic similarity-based approach to select the most similar examples from both the Positive and Negative corpora for a given query. Subsequently, we further retrieve Positive examples from the Positive sample corpus based on semantic similarity to the Negative examples, then concatenating them with the previously selected Positive examples to serve as ICL demonstrations. Experimental results demonstrate that our approach surpasses methods solely relying on the most similar positive examples for context, validating that the additional information in negative samples aids in enhancing ICL performance through improved Positive sample selection.

[46] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

Carolina Zheng,Nicolas Beltran-Velez,Sweta Karlekar,Claudia Shi,Achille Nazaret,Asif Mallik,Amir Feder,David M. Blei

Main category: cs.CL

TL;DR: The paper introduces Mechanistic Topic Models (MTMs), which use sparse autoencoders to operate on interpretable features and enable controllable text generation. They propose an LLM-based evaluation framework called 'topic judge' and demonstrate that MTMs match or exceed traditional and neural baselines in coherence metrics while being consistently preferred in evaluations.

Details

Motivation: Traditional topic models struggle to capture semantically abstract features due to their reliance on bag-of-words representations, and some neural variants are constrained by expressing topics as word lists, which limits their ability to articulate complex topics. Method: The researchers introduced Mechanistic Topic Models (MTMs), which operate on interpretable features learned by sparse autoencoders (SAEs), and proposed a new evaluation framework called 'topic judge', which is an LLM-based pairwise comparison method. Result: Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics and are consistently preferred by the topic judge evaluation framework. Conclusion: MTMs enable effective steering of LLM outputs and are preferred by the topic judge evaluation framework when compared to traditional and neural baselines. Abstract: Traditional topic models are effective at uncovering latent themes in large text collections. However, due to their reliance on bag-of-words representations, they struggle to capture semantically abstract features. While some neural variants use richer representations, they are similarly constrained by expressing topics as word lists, which limits their ability to articulate complex topics. We introduce Mechanistic Topic Models (MTMs), a class of topic models that operate on interpretable features learned by sparse autoencoders (SAEs). By defining topics over this semantically rich space, MTMs can reveal deeper conceptual themes with expressive feature descriptions. Moreover, uniquely among topic models, MTMs enable controllable text generation using topic-based steering vectors. To properly evaluate MTM topics against word-list-based approaches, we propose \textit{topic judge}, an LLM-based pairwise comparison evaluation framework. Across five datasets, MTMs match or exceed traditional and neural baselines on coherence metrics, are consistently preferred by topic judge, and enable effective steering of LLM outputs.

[47] Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

Sophie Kearney,Shu Yang,Zixuan Wen,Bojian Hou,Duy Duong-Tran,Tianlong Chen,Jason Moore,Marylyn Ritchie,Li Shen

Main category: cs.CL

TL;DR: 本文提出了一种新的框架 TAP-GPT，用于通过表格生物标志物数据进行阿尔茨海默病的早期和准确诊断，该框架基于大语言模型（LLM）并利用少样本学习和参数高效微调技术，在临床二分类任务中表现出色。

Details

Motivation: 阿尔茨海默病（AD）是一种复杂的神经退行性疾病，早期和准确的诊断需要分析以表格形式表示的异构生物标志物，而大语言模型（LLM）具有灵活的少样本推理、多模态整合和基于自然语言的可解释性，为利用结构化生物医学数据进行预测提供了前所未有的机会。 Method: 该方法使用了 TableGPT2，这是一个最初为商业智能任务开发的多模态表格专用大语言模型，并通过参数高效的 qLoRA 适配对其进行微调，以完成临床二分类任务（阿尔茨海默病或认知正常）。 Result: TAP-GPT 框架利用 TableGPT2 强大的表格理解能力和 LLM 编码的先验知识，在 AD 诊断任务中优于更先进的通用 LLM 和专为预测任务开发的表格基础模型 (TFM)。 Conclusion: TAP-GPT 是首个将大语言模型 (LLM) 应用于使用表格生物标志物数据预测任务的框架，为未来生物医学信息学中 LLM 驱动的多智能体框架铺平了道路。 Abstract: Early and accurate diagnosis of Alzheimer's disease (AD), a complex neurodegenerative disorder, requires analysis of heterogeneous biomarkers (e.g., neuroimaging, genetic risk factors, cognitive tests, and cerebrospinal fluid proteins) typically represented in a tabular format. With flexible few-shot reasoning, multimodal integration, and natural-language-based interpretability, large language models (LLMs) offer unprecedented opportunities for prediction with structured biomedical data. We propose a novel framework called TAP-GPT, Tabular Alzheimer's Prediction GPT, that adapts TableGPT2, a multimodal tabular-specialized LLM originally developed for business intelligence tasks, for AD diagnosis using structured biomarker data with small sample sizes. Our approach constructs few-shot tabular prompts using in-context learning examples from structured biomedical data and finetunes TableGPT2 using the parameter-efficient qLoRA adaption for a clinical binary classification task of AD or cognitively normal (CN). The TAP-GPT framework harnesses the powerful tabular understanding ability of TableGPT2 and the encoded prior knowledge of LLMs to outperform more advanced general-purpose LLMs and a tabular foundation model (TFM) developed for prediction tasks. To our knowledge, this is the first application of LLMs to the prediction task using tabular biomarker data, paving the way for future LLM-driven multi-agent frameworks in biomedical informatics.

Sneha Oram,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文介绍了P-ReMe数据集和针对心理健康对话中隐含意义和预设的分析任务，评估了多个大型语言模型在该领域的推理能力，并提出StiPRompts研究心理健康相关的偏见问题。

Details

Motivation: 当前心理健康领域的可解释性和对话推理尚未被充分探索，因此需要研究大型语言模型在该领域的推理能力。 Method: 提出了P-ReMe数据集和修正的语用现象定义，并设计了相关任务，同时使用多个模型进行基准测试；此外，引入StiPRompts以评估模型在心理健康偏见方面的表现。 Result: 实验结果显示Mistral和Qwen在推理任务上表现良好，而Claude-3.5-haiku在处理心理健康偏见方面优于其他模型。 Conclusion: 本文推动了心理健康对话系统中模型推理与伦理问题的研究，为未来的发展提供了基础和方向。 Abstract: There has been an increase in recent advancements in the explainability and development of personalized chatbots for mental health. However, the reasoning aspects for explainability and dialogue discourse have not been explored previously for mental health. Hence, we are investigating the pragmatic reasoning capability of large language models (LLMs) in this domain. We introduce P-ReMe dataset, and propose a modified definition for the pragmatic phenomena of implicature (implied meaning) and presupposition (implicit assumption) in mental health. Following the definition, we formulate two tasks in implicature and one task in presupposition. To benchmark the dataset and the presented tasks, we consider four models - Llama3.1, Mistral, MentaLLaMa, and Qwen. The results of the experiments suggest that Mistral and Qwen show substantial reasoning capabilities in the domain. In addition, we also propose StiPRompts to study the stigma around mental health with the state-of-the-art LLMs, GPT-4o mini, Deepseek-chat, and Claude-3.5-haiku. Our evaluated findings show that Claude-3.5-haiku deals with the stigma more responsibly compared to the other two LLMs.

[49] Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

Shimanto Bhowmik,Tawsif Tashwar Dipto,Md Sazzad Islam,Sheryl Hsu,Tahsin Reasat

Main category: cs.CL

TL;DR: 本研究系统地调查了阻碍孟加拉语NLP性能的挑战，并评估了10个开源大语言模型的表现，发现了跨语言性能差距及某些架构的稳健性。

Details

Motivation: 孟加拉语在NLP研究中代表不足，由于其独特的语言结构和计算限制，仍然是一个挑战。 Method: 通过关注缺乏标准化评估基准的问题，研究人员系统地调查了阻碍孟加拉语NLP性能的挑战。他们评估了10个最近的开源大语言模型在8个翻译数据集上的表现，并进行了全面的错误分析以确定其主要失败模式。 Result: 研究结果揭示了孟加拉语与英语相比存在持续的性能差距，特别是对于较小的模型和特定的模型家族如Mistral。他们还发现某些架构如DeepSeek表现出更稳定的跨语言性能。分析还揭示了分词效率与LLM准确性之间的反比关系，即当输入过度分词时模型表现较差，而更高效和简洁的分词则提高了性能。 Conclusion: 该研究强调了当前模型在处理孟加拉语等代表性不足的语言时存在的不足，并指出需要改进数据集质量和多语言环境下的评估方法。此外，这项工作将促进对代表性不足语言的NLP研究，帮助在全球范围内普及对先进语言技术的访问。 Abstract: Bengali is an underrepresented language in NLP research. However, it remains a challenge due to its unique linguistic structure and computational constraints. In this work, we systematically investigate the challenges that hinder Bengali NLP performance by focusing on the absence of standardized evaluation benchmarks. We then evaluated 10 recent open source Large Language Models (LLMs) in 8 of the translated datasets and performed a comprehensive error analysis to pinpoint their primary failure modes. Our findings reveal consistent performance gaps for Bengali compared to English, particularly for smaller models and specific model families like Mistral. We also identified promising robustness in certain architectures, such as DeepSeek, that maintain more stable performance across languages. Our analysis reveals an inverse relationship between tokenization efficiency and LLM accuracy where models tend to perform worse when inputs are excessively tokenized, whereas more efficient \& concise tokenization results in improved performance. These findings highlight critical areas where current models fall short and underscore the need for improved dataset quality and evaluation methodologies tailored to multilingual contexts. This work will catalyze further research on NLP for underrepresented languages, helping to democratize access to advanced language technologies worldwide. The code and dataset used in this research is publicly available at https://github.com/BengaliAI/bn-llm-benchmark.

[50] Unveiling Super Experts in Mixture-of-Experts Large Language Models

Zunhai Su,Qingyuan Li,Hao Zhang,YuLei Qian,Yuchen Xie,Kehong Yuan

Main category: cs.CL

TL;DR: 本研究首次发现并分析了MoE大语言模型中的关键专家（SEs），它们在推理过程中具有不可替代的作用，压缩这些专家会显著影响模型性能，尤其是数学推理能力和注意力机制。

Details

Motivation: 现有的MoE模型专家压缩方法依赖经验标准，缺乏对专家异质重要性的深入理解，因此本研究旨在探索MoE模型中关键专家的作用及其压缩影响。 Method: 通过对SEs的剪枝实验，分析其对模型性能的影响，并研究SEs在不同任务中的作用以及其压缩对注意力得分分布的影响。 Result: 研究发现SEs在MoE模型中普遍存在，剪枝SEs会导致显著性能下降，特别是在数学推理任务中；SEs在隐藏状态中产生极端激活，并且其分布不受后训练过程影响。 Conclusion: 研究发现了一类在MoE LLMs中起关键作用的专家（Super Experts, SEs），并且这些专家对模型性能，尤其是数学推理能力，具有重要影响。此外，压缩SEs会对注意力机制产生显著干扰。 Abstract: Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression techniques to improve the efficiency of MoE LLMs. However, existing approaches often rely on empirical criteria to identify critical experts, lacking a deeper exploration and understanding of the heterogeneous importance of experts. In this study, we present the first discovery and investigation of a distinct subset of experts that play a crucial role in the underlying mechanisms during the model's forward inference. These experts are prevalent in open-source MoE LLMs, and despite their limited number, pruning them leads to a significant decline in model performance (e.g., pruning three causes Qwen3-30B-A3B to produce repetitive and uninformative outputs). We refer to these experts as Super Experts (SEs). Our comprehensive analysis provides progressively deeper insights into SEs. (i) SEs are characterized by rare but extreme activation outliers in the output of the down_proj, which give rise to massive activations in the hidden states between decoder layers. Moreover, the distribution of SEs remains model-specific and is unaffected by post-training processes. (ii) By pruning SEs, we assess their significance across a variety of tasks, revealing their considerable impact on the model's overall performance, particularly in mathematical reasoning. (iii) We further enhance our understanding of the influence of SEs compression. Our findings confirm that MoE LLMs rely on SEs to induce attention sinks, which are crucial for the distribution of attention scores but are significantly disrupted by SE pruning. The code is available at https://github.com/ZunhaiSu/Super-Experts-Profilling.

[51] What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

Alfio Ferrara,Sergio Picascia,Laura Pinnavaia,Vojimir Ranitovic,Elisabetta Rocchetti,Alice Tuveri

Main category: cs.CL

TL;DR: GPT-4o-mini 能隐式调节敏感内容并在无明确指令时减少冒犯性语言，显示出其在内容净化方面的潜力。

Details

Motivation: 研究 LLMs 是否在没有明确指令的情况下隐式地调节敏感内容，而不仅仅是依赖显式训练。 Method: 通过实验分析 GPT-4o-mini 在改写敏感内容时的行为，并评估其零样本分类敏感内容的能力。 Result: GPT-4o-mini 在改写中显著减少了敏感内容，且其零样本分类能力优于传统方法。 Conclusion: GPT-4o-mini 显示出对敏感内容的隐式调节能力，并在无明确指令情况下减少了冒犯性和禁忌性语言。 Abstract: Proprietary Large Language Models (LLMs) have shown tendencies toward politeness, formality, and implicit content moderation. While previous research has primarily focused on explicitly training models to moderate and detoxify sensitive content, there has been limited exploration of whether LLMs implicitly sanitize language without explicit instructions. This study empirically analyzes the implicit moderation behavior of GPT-4o-mini when paraphrasing sensitive content and evaluates the extent of sensitivity shifts. Our experiments indicate that GPT-4o-mini systematically moderates content toward less sensitive classes, with substantial reductions in derogatory and taboo language. Also, we evaluate the zero-shot capabilities of LLMs in classifying sentence sensitivity, comparing their performances against traditional methods.

[52] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

Daeyong Kwon,SeungHeon Doh,Juhan Nam

Main category: cs.CL

TL;DR: 本文提出了MusT-RAG框架，有效提升大型语言模型在音乐领域的适应能力。

Details

Motivation: 由于训练数据中音乐专业知识占比相对较小，大型语言模型在音乐相关应用中的有效性受到限制。 Method: 提出MusT-RAG框架，结合MusWikiDB音乐专业化向量数据库，并在推理和微调过程中利用上下文信息进行优化。 Result: MusT-RAG在音乐领域问答任务中显著优于传统微调方法，并且MusWikiDB比通用维基百科语料库更有效。 Conclusion: MusT-RAG框架显著优于传统微调方法，在提升大型语言模型音乐领域适应能力方面表现出色。 Abstract: Recent advancements in Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains. While they exhibit strong zero-shot performance on various tasks, LLMs' effectiveness in music-related applications remains limited due to the relatively small proportion of music-specific knowledge in their training data. To address this limitation, we propose MusT-RAG, a comprehensive framework based on Retrieval Augmented Generation (RAG) to adapt general-purpose LLMs for text-only music question answering (MQA) tasks. RAG is a technique that provides external knowledge to LLMs by retrieving relevant context information when generating answers to questions. To optimize RAG for the music domain, we (1) propose MusWikiDB, a music-specialized vector database for the retrieval stage, and (2) utilizes context information during both inference and fine-tuning processes to effectively transform general-purpose LLMs into music-specific models. Our experiment demonstrates that MusT-RAG significantly outperforms traditional fine-tuning approaches in enhancing LLMs' music domain adaptation capabilities, showing consistent improvements across both in-domain and out-of-domain MQA benchmarks. Additionally, our MusWikiDB proves substantially more effective than general Wikipedia corpora, delivering superior performance and computational efficiency.

[53] Text-to-SQL Task-oriented Dialogue Ontology Construction

Renato Vukovic,Carel van Niekerk,Michael Heck,Benjamin Ruppik,Hsien-Chin Lin,Shutong Feng,Nurul Lubis,Milica Gasic

Main category: cs.CL

TL;DR: TeQoDO 是一种无需监督的对话本体构建方法，结合了大型语言模型的 SQL 编程能力和对话理论，提高了模型的可解释性和任务性能。

Details

Motivation: 大型语言模型依赖参数化知识，缺乏可解释性和可信度。任务导向对话系统需要明确的外部数据库和本体结构，但传统方法需要手动标注或监督训练。 Method: TeQoDO 利用大型语言模型内在的 SQL 编程能力，并结合提示中的对话理论，无需监督即可自主构建 TOD 本体。 Result: TeQoDO 在对话状态跟踪任务中表现优于迁移学习方法，并且能够扩展到构建更大的本体。消融实验表明对话理论在本体构建中起到了关键作用。 Conclusion: TeQoDO 是一种用于构建任务导向对话本体的方法，能够提高大型语言模型的可解释性，并在对话状态跟踪任务中表现出色。 Abstract: Large language models (LLMs) are widely used as general-purpose knowledge sources, but they rely on parametric knowledge, limiting explainability and trustworthiness. In task-oriented dialogue (TOD) systems, this separation is explicit, using an external database structured by an explicit ontology to ensure explainability and controllability. However, building such ontologies requires manual labels or supervised training. We introduce TeQoDO: a Text-to-SQL task-oriented Dialogue Ontology construction method. Here, an LLM autonomously builds a TOD ontology from scratch without supervision using its inherent SQL programming capabilities combined with dialogue theory provided in the prompt. We show that TeQoDO outperforms transfer learning approaches, and its constructed ontology is competitive on a downstream dialogue state tracking task. Ablation studies demonstrate the key role of dialogue theory. TeQoDO also scales to allow construction of much larger ontologies, which we investigate on a Wikipedia and ArXiv dataset. We view this as a step towards broader application of ontologies to increase LLM explainability.

[54] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

Yiyan Ji,Haoran Chen,Qiguang Chen,Chengyue Wu,Libo Qin,Wanxiang Che

Main category: cs.CL

TL;DR: 本研究提出了MPCC，首个用于评估多模态约束下规划能力的基准，发现现有MLLM在复杂约束场景下表现不佳，凸显了约束感知推理的重要性。

Details

Motivation: 当前的基准测试无法直接评估多模态现实世界规划能力，并且缺乏跨模态的约束或隐含约束。 Method: 引入了多模态规划与复杂约束（MPCC）基准，专注于三个现实任务（航班规划、日历规划和会议规划），并引入了分级难度的复杂约束。 Result: 实验显示闭源模型仅能生成21.3%的可行计划，开源模型平均低于11%。此外，MLLM对约束复杂度敏感，传统多模态提示策略在多约束场景下失效。 Conclusion: MPCC强调了多模态约束感知推理在现实世界MLLM应用中的重要性，并为该领域的发展提供了评估框架和基准。 Abstract: Multimodal planning capabilities refer to the ability to predict, reason, and design steps for task execution with multimodal context, which is essential for complex reasoning and decision-making across multiple steps. However, current benchmarks face two key challenges: (1) they cannot directly assess multimodal real-world planning capabilities, and (2) they lack constraints or implicit constraints across modalities. To address these issues, we introduce Multimodal Planning with Complex Constraints (MPCC), the first benchmark to systematically evaluate MLLMs' ability to handle multimodal constraints in planning. To address the first challenge, MPCC focuses on three real-world tasks: Flight Planning, Calendar Planning, and Meeting Planning. To solve the second challenge, we introduce complex constraints (e.g. budget, temporal, and spatial) in these tasks, with graded difficulty levels (EASY, MEDIUM, HARD) to separate constraint complexity from search space expansion. Experiments on 13 advanced MLLMs reveal significant challenges: closed-source models achieve only 21.3% feasible plans, while open-source models average below 11%. Additionally, we observe that MLLMs are highly sensitive to constraint complexity and that traditional multimodal prompting strategies fail in multi-constraint scenarios. Our work formalizes multimodal constraints in planning, provides a rigorous evaluation framework, and highlights the need for advancements in constraint-aware reasoning for real-world MLLM applications.

[55] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

Ailiang Lin,Zhuoyun Li,Kotaro Funakoshi

Main category: cs.CL

TL;DR: Causal2Vec enhances decoder-only LLMs for embedding tasks without architectural changes, using a pre-encoder and modified pooling strategy to achieve superior performance with reduced computational cost.

Details

Motivation: Existing methods either remove the causal attention mask, undermining the model's pretraining advantages, or use extra input text, increasing computational costs. Causal2Vec aims to enhance decoder-only LLMs without changing their architecture or adding overhead. Method: Causal2Vec employs a lightweight BERT-style model to generate a Contextual token, which is prepended to the LLM input. The final embedding is formed by concatenating the last hidden states of the Contextual and EOS tokens. Result: Causal2Vec achieves state-of-the-art performance on MTEB among models trained only on publicly available retrieval datasets, reduces sequence length by up to 85%, and inference time by up to 82% compared to best-performing methods. Conclusion: Causal2Vec improves the performance of decoder-only LLMs as embedding models by introducing a lightweight BERT-style pre-encoder and a modified pooling strategy, achieving state-of-the-art results on MTEB while reducing sequence length and inference time. Abstract: Decoder-only large language models (LLMs) are increasingly used to build embedding models that effectively encode the semantic information of natural language texts into dense vector representations for various embedding tasks. However, many existing methods primarily focus on removing the causal attention mask in LLMs to enable bidirectional attention, potentially undermining the model's ability to extract semantic information acquired during pretraining. Additionally, leading unidirectional approaches often rely on extra input text to overcome the inherent limitations of causal attention, inevitably increasing computational costs. In this work, we propose Causal2Vec, a general-purpose embedding model tailored to enhance the performance of decoder-only LLMs without altering their original architectures or introducing significant computational overhead. Specifically, we first employ a lightweight BERT-style model to pre-encode the input text into a single Contextual token, which is then prepended to the LLM's input sequence, allowing each token to capture contextualized information even without attending to future tokens. Furthermore, to mitigate the recency bias introduced by last-token pooling and help LLMs better leverage the semantic information encoded in the Contextual token, we concatenate the last hidden states of Contextual and EOS tokens as the final text embedding. In practice, Causal2Vec achieves state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB) among models trained solely on publicly available retrieval datasets, while reducing the required sequence length by up to 85% and inference time by up to 82% compared to best-performing methods.

[56] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

Peter Sandrini

Main category: cs.CL

TL;DR: 本研究探讨了本地部署开源语言模型作为云AI翻译解决方案的替代方案的可行性，发现其在隐私保护和数据控制方面具有优势，适合个体和小型企业使用。

Details

Motivation: 大型语言模型的快速发展为翻译领域带来了机遇与挑战，而现有的基于云的AI聊天机器人存在数据隐私、安全性和公平访问的问题，因此需要探索替代的部署模式。 Method: 评估了三种可在基于CPU的平台上安装的开源模型，并将其与现有的商业在线聊天机器人进行对比，重点分析功能性表现而非人机翻译质量。 Result: 研究发现，本地部署的开源语言模型在功能性表现上具有可行性，并为个体翻译人员和小型企业提供更易获取和实用的AI技术方案。 Conclusion: 本地部署的开源语言模型作为商业云AI解决方案的替代方案具有可行性，尽管存在挑战，但在数据控制、隐私保护和减少对云服务依赖方面具有显著优势。 Abstract: The rapid proliferation of Large Language Models presents both opportunities and challenges for the translation field. While commercial, cloud-based AI chatbots have garnered significant attention in translation studies, concerns regarding data privacy, security, and equitable access necessitate exploration of alternative deployment models. This paper investigates the feasibility and performance of locally deployable, free language models as a viable alternative to proprietary, cloud-based AI solutions. This study evaluates three open-source models installed on CPU-based platforms and compared against commercially available online chat-bots. The evaluation focuses on functional performance rather than a comparative analysis of human-machine translation quality, an area already subject to extensive research. The platforms assessed were chosen for their accessibility and ease of use across various operating systems. While local deployment introduces its own challenges, the benefits of enhanced data control, improved privacy, and reduced dependency on cloud services are compelling. The findings of this study contribute to a growing body of knowledge concerning the democratization of AI technology and inform future research and development efforts aimed at making LLMs more accessible and practical for a wider range of users, specifically focusing on the needs of individual translators and small businesses.

[57] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

Yongbing Zhang,Fang Nan,Shengxiang Gao,Yuxin Huang,Kaiwen Tan,Zhengtao Yu

Main category: cs.CL

TL;DR: 本文提出MRGSEM-Sum，一种基于多关系图和结构熵最小化的无监督多文档摘要方法，能够有效处理文档间复杂关系并减少信息冗余，生成高质量摘要。

Details

Motivation: 现有的单关系图方法无法充分表达丰富的关系信息，并且需要预定义聚类数，难以适应性地划分句子组以减少冗余。 Method: 构建集成句子间语义和语篇关系的多关系图，采用二维结构熵最小化算法进行聚类，并引入位置感知压缩机制提炼每个聚类。 Result: 在四个基准数据集上均优于之前的无监督方法，在多个情况下达到与监督模型和大语言模型相当的性能，人工评估显示摘要质量接近人类水平。 Conclusion: MRGSEM-Sum是一种基于多关系图和结构熵最小化的无监督多文档摘要框架，能够自适应地确定最佳聚类数，有效组织句子生成连贯的摘要。 Abstract: The core challenge faced by multi-document summarization is the complexity of relationships among documents and the presence of information redundancy. Graph clustering is an effective paradigm for addressing this issue, as it models the complex relationships among documents using graph structures and reduces information redundancy through clustering, achieving significant research progress. However, existing methods often only consider single-relational graphs and require a predefined number of clusters, which hinders their ability to fully represent rich relational information and adaptively partition sentence groups to reduce redundancy. To overcome these limitations, we propose MRGSEM-Sum, an unsupervised multi-document summarization framework based on multi-relational graphs and structural entropy minimization. Specifically, we construct a multi-relational graph that integrates semantic and discourse relations between sentences, comprehensively modeling the intricate and dynamic connections among sentences across documents. We then apply a two-dimensional structural entropy minimization algorithm for clustering, automatically determining the optimal number of clusters and effectively organizing sentences into coherent groups. Finally, we introduce a position-aware compression mechanism to distill each cluster, generating concise and informative summaries. Extensive experiments on four benchmark datasets (Multi-News, DUC-2004, PubMed, and WikiSum) demonstrate that our approach consistently outperforms previous unsupervised methods and, in several cases, achieves performance comparable to supervised models and large language models. Human evaluation demonstrates that the summaries generated by MRGSEM-Sum exhibit high consistency and coverage, approaching human-level quality.

[58] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Salah Eddine Bekhouche,Azeddine Benlamoudi,Yazid Bounab,Fadi Dornaika,Abdenour Hadid

Main category: cs.CL

TL;DR: 本文提出了一种增强的阿拉伯语密集段落检索框架，通过一种新的注意力相关性评分方法提高了检索性能。

Details

Motivation: 阿拉伯语由于其复杂的形态、可选的变音符号以及现代标准阿拉伯语和各种方言的共存，给自然语言处理和信息检索带来了挑战。然而，阿拉伯语在自然语言处理研究和基准资源中仍然代表性不足。 Method: 开发了一种专门为阿拉伯语设计的增强型密集段落检索框架，核心是一种新的注意力相关性评分方法，用自适应评分函数取代了标准交互机制，并结合了预训练的阿拉伯语语言模型和架构改进。 Result: 该方法显著提高了回答阿拉伯语问题时的排序准确性。 Conclusion: 提出的增强型密集段落检索框架有效地提升了阿拉伯语信息检索的表现。 Abstract: Arabic poses a particular challenge for natural language processing (NLP) and information retrieval (IR) due to its complex morphology, optional diacritics and the coexistence of Modern Standard Arabic (MSA) and various dialects. Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources. In this paper, we present an enhanced Dense Passage Retrieval (DPR) framework developed specifically for Arabic. At the core of our approach is a novel Attentive Relevance Scoring (ARS) that replaces standard interaction mechanisms with an adaptive scoring function that more effectively models the semantic relevance between questions and passages. Our method integrates pre-trained Arabic language models and architectural refinements to improve retrieval performance and significantly increase ranking accuracy when answering Arabic questions. The code is made publicly available at \href{https://github.com/Bekhouche/APR}{GitHub}.

[59] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

Ante Wang,Yujie Lin,Jingyao Liu,Suhang Wu,Hao Liu,Xinyan Xiao,Jinsong Su

Main category: cs.CL

TL;DR: 本文提出了主动批判性思维范式，构建了GSM-MC和GSM-MCE基准测试，并通过强化学习显著提升了模型表现，推动了AI系统在问题解决中与用户的有效协作。

Details

Motivation: 传统的批判性思维研究主要集中在被动拒绝问题查询，而缺乏主动获取缺失信息的机制，因此需要一种更积极的批判性思维范式来提升AI系统的鲁棒性和用户协作能力。 Method: 论文引入了主动批判性思维的概念，设计了两个新的基准测试GSM-MC和GSM-MCE，并采用增强的强化学习算法进行实验验证。 Result: 实验结果显示，尽管Qwen3和Llama系列模型在传统推理任务中表现出色，但在主动批判性思维任务中表现不佳；通过增强的强化学习算法，Qwen3-1.7B在GSM-MC基准测试中的准确率从0.15%提升至73.98%。 Conclusion: 该论文提出了一种主动批判性思维的新范式，旨在提高AI系统在面对不完整或误导性信息时与用户协作解决问题的能力，并通过改进的强化学习算法显著提升了模型在GSM-MC基准测试中的准确性。 Abstract: Critical thinking is essential for building robust AI systems, preventing them from blindly accepting flawed data or biased reasoning. However, prior work has primarily focused on passive critical thinking, where models simply reject problematic queries without taking constructive steps to address user requests. In this work, we introduce proactive critical thinking, a paradigm where models actively seek missing or clarifying information from users to resolve their queries better. To evaluate this capability, we present GSM-MC and GSM-MCE, two novel benchmarks based on GSM8K for assessing mathematical reasoning under incomplete or misleading conditions. GSM-MC contains 1,368 math problems with a key variable deliberately removed, requiring models to identify and request the missing information. GSM-MCE further increases the difficulty by introducing irrelevant details to test robustness against distractions. Experiments on Qwen3 and Llama series models show that, while these models excel in traditional reasoning tasks due to extensive post-training and inference-time scaling, they struggle with proactive critical thinking, especially smaller ones. However, we demonstrate that reinforcement learning (RL) can significantly improve this ability. Using our enhanced RL algorithm, we achieve substantial gains, boosting the Qwen3-1.7B's accuracy from 0.15% to 73.98% on GSM-MC. We hope this work advances models that collaborate more effectively with users in problem-solving through proactive critical thinking.

[60] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

Saeed Almheiri,Yerulan Kongrat,Adrian Santosh,Ruslan Tasmukhanov,Josemaria Vera,Muhammad Dehan Al Kautsar,Fajri Koto

Main category: cs.CL

TL;DR: This paper explores how large language models can be fine-tuned to generate responses based on the access privileges of different organizational roles, using both adapted and synthetic datasets to evaluate performance.

Details

Motivation: As LLMs are increasingly used in enterprise settings, there is a growing need to control model behavior based on user roles to ensure appropriate access to information. Method: The authors explored three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. They evaluated these approaches using two datasets—one adapted from existing instruction-tuning corpora and another synthetically generated to reflect enterprise scenarios. Result: The study demonstrated that role-specific responses can be generated by fine-tuning LLMs, with some approaches proving more robust to challenges like prompt injection and role mismatch. Conclusion: The study concludes that LLMs can be fine-tuned to generate responses that reflect role-specific access privileges, with varying degrees of success across different modeling approaches and organizational structures. Abstract: As large language models (LLMs) are increasingly deployed in enterprise settings, controlling model behavior based on user roles becomes an essential requirement. Existing safety methods typically assume uniform access and focus on preventing harmful or toxic outputs, without addressing role-specific access constraints. In this work, we investigate whether LLMs can be fine-tuned to generate responses that reflect the access privileges associated with different organizational roles. We explore three modeling strategies: a BERT-based classifier, an LLM-based classifier, and role-conditioned generation. To evaluate these approaches, we construct two complementary datasets. The first is adapted from existing instruction-tuning corpora through clustering and role labeling, while the second is synthetically generated to reflect realistic, role-sensitive enterprise scenarios. We assess model performance across varying organizational structures and analyze robustness to prompt injection, role mismatch, and jailbreak attempts.

[61] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

Shirui Wang,Zhihui Tang,Huaxia Yang,Qiuhong Gong,Tiantian Gu,Hongyang Ma,Yongxin Wang,Wubin Sun,Zeliang Lian,Kehang Mao,Yinan Jiang,Zhicheng Huang,Lingyun Ma,Wenjie Shen,Yajie Ji,Yunhui Tan,Chunbo Wang,Yunlu Gao,Qianling Ye,Rui Lin,Mingyu Chen,Lijuan Niu,Zhihao Wang,Peng Yu,Mengran Lang,Yue Liu,Huimin Zhang,Haitao Shen,Long Chen,Qiguang Zhao,Si-Xuan Liu,Lina Zhou,Hua Gao,Dongqiang Ye,Lingmin Meng,Youtao Yu,Naixin Liang,Jianxiong Wu

Main category: cs.CL

TL;DR: 该研究开发了一个用于评估医学大型语言模型在临床安全性和有效性表现的基准测试框架（CSEDB），并发现领域特定模型优于通用模型。

Details

Motivation: 大型语言模型（LLMs）在临床决策支持中具有潜力，但在安全评估和有效性验证方面面临重大挑战。 Method: 开发了一个基于临床专家共识的多维框架CSEDB，并使用32名专科医师开发和审查的2069个开放式问答项目进行测试。 Result: 六款LLM的基准测试显示总体表现中等，平均总得分为57.2%，其中安全得分为54.7%，有效性得分为62.3%。 Conclusion: CSEDB为评估医学LLM的临床应用提供了标准化指标，有助于促进LLM在医疗环境中的安全和有效部署。 Abstract: Large language models (LLMs) hold promise in clinical decision support but face major challenges in safety evaluation and effectiveness validation. We developed the Clinical Safety-Effectiveness Dual-Track Benchmark (CSEDB), a multidimensional framework built on clinical expert consensus, encompassing 30 criteria covering critical areas like critical illness recognition, guideline adherence, and medication safety, with weighted consequence measures. Thirty-two specialist physicians developed and reviewed 2,069 open-ended Q&A items aligned with these criteria, spanning 26 clinical departments to simulate real-world scenarios. Benchmark testing of six LLMs revealed moderate overall performance (average total score 57.2%, safety 54.7%, effectiveness 62.3%), with a significant 13.3% performance drop in high-risk scenarios (p < 0.0001). Domain-specific medical LLMs showed consistent performance advantages over general-purpose models, with relatively higher top scores in safety (0.912) and effectiveness (0.861). The findings of this study not only provide a standardized metric for evaluating the clinical application of medical LLMs, facilitating comparative analyses, risk exposure identification, and improvement directions across different scenarios, but also hold the potential to promote safer and more effective deployment of large language models in healthcare environments.

[62] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

Keer Lu,Zheng Liang,Youquan Li,Jiejun Tan,Da Pan,Shusen Zhang,Guosheng Dong,Huang Leng

Main category: cs.CL

TL;DR: 本文提出了Med-R$^3$，一种基于强化学习的医学检索增强推理框架，通过联合优化检索和推理能力，在医学领域取得了显著的性能提升。

Details

Motivation: 现有方法主要单独优化检索或推理能力，缺乏对两者联合优化的关注，且依赖监督微调限制了模型的泛化能力。此外，现有强化学习方法的奖励函数设计未能充分满足医学领域的需求。 Method: Med-R$^3$ 框架逐步引入强化学习，首先增强模型的逻辑推理能力，然后自适应地优化检索能力，并进行检索与推理的联合优化。 Result: Med-R$^3$ 实现了最先进的性能，LLaMA3.1-8B-Instruct + Med-R$^3$ 在相似参数规模下超越了GPT-4o-mini 3.93%，Qwen2.5-14B + Med-R$^3$ 提升了13.53%。 Conclusion: Med-R$^3$ 框架通过联合优化检索和推理能力，在医学领域实现了先进的性能，超越了现有方法。 Abstract: In medical scenarios, effectively retrieving external knowledge and leveraging it for rigorous logical reasoning is of significant importance. Despite their potential, existing work has predominantly focused on enhancing either retrieval or reasoning capabilities of the models in isolation, with little attention given to their joint optimization, which leads to limited coordination between the two processes. Additionally, current methods rely heavily on supervised fine-tuning (SFT), which can cause models to memorize existing problem-solving pathways, thereby restricting their generalization ability when confronted with novel problem contexts. Furthermore, while some studies have explored to improve retrieval-augmented reasoning in general domains via reinforcement learning, their reward function designs do not adequately capture the specific demands of the medical domain. To address these challenges, we introduce **Med-R$^3$**, a **Med**ical **R**etrieval-augmented **R**easoning framework driven by progressive **R**einforcement learning. In this framework, we first develop the model's ability to perform logical reasoning over medical problems. Subsequently, on the basis of this foundation, we adaptively optimize the retrieval capability to better align with the characteristics of knowledge corpus and external information utilization throughout the reasoning process. Finally, we conduct joint optimization of the model's retrieval and reasoning coordination. Extensive experiments indicate that **Med-R$^3$** could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + Med-R$^3$ surpassing closed-sourced GPT-4o-mini by 3.93\% at a comparable parameter scale, while Qwen2.5-14B augmented with Med-R$^3$ shows a more substantial gain of 13.53\%.

[63] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

Alva West,Luodan Zhang,Liuliu Zhang,Minjun Zhu,Yixuan Weng,Yue Zhang

Main category: cs.CL

TL;DR: T-Detect是一种新的文本检测方法，通过使用基于学生t分布的重尾差异分数，提高了对对抗性文本的鲁棒性。

Details

Motivation: 现有的零样本检测器通常依赖于假设为高斯分布的统计度量，这在面对对抗性或非英语文本的重尾统计特征时失效。 Method: T-Detect用基于学生t分布的重尾差异分数取代了标准的高斯归一化，并通过归一化段落的对数似然来计算检测分数。 Result: T-Detect在RAID基准测试和HART数据集上都显示出了比强大基线方法更好的性能提升，AUROC提高了3.9%。 Conclusion: T-Detect提供了一种新的理论依据充分的文本检测统计基础，并展示了其在对抗条件下的优越性能。 Abstract: The proliferation of sophisticated text generation models necessitates the development of robust detection methods capable of identifying machine-generated content, particularly text designed to evade detection through adversarial perturbations. Existing zero-shot detectors often rely on statistical measures that implicitly assume Gaussian distributions, a premise that falters when confronted with the heavy-tailed statistical artifacts characteristic of adversarial or non-native English texts. This paper introduces T-Detect, a novel detection method that fundamentally redesigns the statistical core of curvature-based detectors. Our primary innovation is the replacement of standard Gaussian normalization with a heavy-tailed discrepancy score derived from the Student's t-distribution. This approach is theoretically grounded in the empirical observation that adversarial texts exhibit significant leptokurtosis, rendering traditional statistical assumptions inadequate. T-Detect computes a detection score by normalizing the log-likelihood of a passage against the expected moments of a t-distribution, providing superior resilience to statistical outliers. We validate our approach on the challenging RAID benchmark for adversarial text and the comprehensive HART dataset. Experiments show that T-Detect provides a consistent performance uplift over strong baselines, improving AUROC by up to 3.9\% in targeted domains. When integrated into a two-dimensional detection framework (CT), our method achieves state-of-the-art performance, with an AUROC of 0.926 on the Books domain of RAID. Our contributions are a new, theoretically-justified statistical foundation for text detection, an ablation-validated method that demonstrates superior robustness, and a comprehensive analysis of its performance under adversarial conditions. Ours code are released at https://github.com/ResearAI/t-detect.

[64] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

Alexandre Misrahi,Nadezhda Chirkova,Maxime Louis,Vassilina Nikoulina

Main category: cs.CL

TL;DR: DiffLoRA引入了低秩适配器的差分注意力机制，虽然在多数任务中表现不及其他参数高效微调方法，但在HumanEval上比LoRA高出11个百分点。

Details

Motivation: 为了提升Transformer模型性能，通过去噪注意力机制消除噪声，同时保持参数高效性。 Method: 提出DiffLoRA，采用正负注意力项上的低秩适配器，对差分注意力机制进行参数高效微调。 Result: 在多数任务中DiffLoRA表现一般，但在HumanEval任务上比LoRA高出11个百分点。 Conclusion: DiffLoRA在特定领域表现出色，但整体性能受限，未来可进一步优化注意力机制。 Abstract: Differential Transformer has recently been proposed to improve performance in Transformer models by canceling out noise through a denoiser attention mechanism. In this work, we introduce DiffLoRA, a parameter-efficient adaptation of the differential attention mechanism, with low-rank adapters on both positive and negative attention terms. This approach retains the efficiency of LoRA while aiming to benefit from the performance gains of differential attention. We evaluate DiffLoRA across a broad range of NLP tasks, including general benchmarks, many-shot in-context learning, RAG, and long-context tests. We observe that, although DiffLoRA falls short of other parameter-efficient fine-tuning methods in most evaluation tasks, it shows interesting results in certain domains (+11 pts on LoRA for HumanEval). We analyze the attention patterns post-finetuning to identify the reasons for this behavior.

Salam Thabet Doghmash,Motaz Saad

Main category: cs.CL

TL;DR: This research identifies the best model for detecting hate speech in Arabic text and creatively treats text cleaning as a machine translation task to mask offensive words.

Details

Motivation: Hate speech identification in social media is increasingly important. The research focuses on addressing this issue specifically for Arabic text, both by detecting hate speech and by cleaning text from such speech. Method: The researchers used deep learning models and transformers to detect hate speech. For text cleaning, they used machine translation techniques to replace offensive words with stars. Result: The best model for hate speech detection achieved a 92% Macro F1 score and 95% accuracy. The text cleaning experiment achieved a BLEU score of 0.3 with 1-gram, which is promising compared to current machine translation systems. Conclusion: The research successfully identified the best deep learning model for hate speech detection in Arabic text and treated text cleaning as a machine translation task to mask offensive words. Abstract: Hate speech identification in social media has become an increasingly important issue in recent years. In this research, we address two problems: 1) to detect hate speech in Arabic text, 2) to clean a given text from hate speech. The meaning of cleaning here is replacing each bad word with stars based on the number of letters for each word. Regarding the first problem, we conduct several experiments using deep learning models and transformers to determine the best model in terms of the F1 score. Regarding second problem, we consider it as a machine translation task, where the input is a sentence containing dirty text and the output is the same sentence with masking the dirty text. The presented methods achieve the best model in hate speech detection with a 92\% Macro F1 score and 95\% accuracy. Regarding the text cleaning experiment, the best result in the hate speech masking model reached 0.3 in BLEU score with 1-gram, which is a good result compared with the state of the art machine translation systems.

[66] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

Nasim Shirvani-Mahdavi,Devin Wingfield,Amin Ghasemi,Chengkai Li

Main category: cs.CL

TL;DR: This paper explores using large language models to generate understandable explanations for complex logical rules in knowledge graphs, achieving promising results.

Details

Motivation: Logical rules in knowledge graphs are hard for humans to understand due to their complexity and unique labeling. Generating natural language explanations can improve understanding, reasoning, and error detection. Method: Logical rules were extracted using the AMIE 3.5.1 algorithm from datasets FB15k-237, FB-CVT-REV, and FB+CVT-REV. Various prompting strategies were tested, and both human and automated evaluations were conducted. Result: Promising results were achieved in explanation correctness and clarity, though some challenges persist. Conclusion: The study shows that large language models can effectively generate natural language explanations for logical rules, though challenges remain. Abstract: Knowledge graphs (KGs) often contain sufficient information to support the inference of new facts. Identifying logical rules not only improves the completeness of a knowledge graph but also enables the detection of potential errors, reveals subtle data patterns, and enhances the overall capacity for reasoning and interpretation. However, the complexity of such rules, combined with the unique labeling conventions of each KG, can make them difficult for humans to understand. In this paper, we explore the potential of large language models to generate natural language explanations for logical rules. Specifically, we extract logical rules using the AMIE 3.5.1 rule discovery algorithm from the benchmark dataset FB15k-237 and two large-scale datasets, FB-CVT-REV and FB+CVT-REV. We examine various prompting strategies, including zero- and few-shot prompting, including variable entity types, and chain-of-thought reasoning. We conduct a comprehensive human evaluation of the generated explanations based on correctness, clarity, and hallucination, and also assess the use of large language models as automatic judges. Our results demonstrate promising performance in terms of explanation correctness and clarity, although several challenges remain for future research. All scripts and data used in this study are publicly available at https://github.com/idirlab/KGRule2NL}{https://github.com/idirlab/KGRule2NL.

[67] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

Yunxiang Yan,Tomohiro Sawada,Kartik Goyal

Main category: cs.CL

TL;DR: This paper introduces a cascaded question disclosure framework for evaluating LLMs, offering a more accurate and scalable method to assess problem-solving capabilities compared to standard QA benchmarks.

Details

Motivation: Standard QA benchmarks are indirect and may overestimate performance differences between LLMs. A more accurate and scalable evaluation method is needed to assess underlying problem-solving capabilities. Method: The proposed framework utilizes cascaded question disclosure, collecting model responses in a stagewise manner with partial information revealed at each stage to elicit generalized reasoning. Result: The approach enables better model comparisons, induces improved intermediate reasoning traces, and reduces performance gaps observed in traditional QA settings across diverse reasoning and knowledge-heavy datasets. Conclusion: The cascaded question disclosure framework provides a more accurate and holistic evaluation of LLMs' problem-solving capabilities compared to standard QA paradigms, offering better model comparisons and narrowing the observed performance gap. Abstract: While question-answering~(QA) benchmark performance is an automatic and scalable method to compare LLMs, it is an indirect method of evaluating their underlying problem-solving capabilities. Therefore, we propose a holistic and generalizable framework based on \emph{cascaded question disclosure} that provides a more accurate estimate of the models' problem-solving capabilities while maintaining the scalability and automation. This approach collects model responses in a stagewise manner with each stage revealing partial information about the question designed to elicit generalized reasoning in LLMs. We find that our approach not only provides a better comparison between LLMs, but also induces better intermediate traces in models compared to the standard QA paradigm. We empirically verify this behavior on diverse reasoning and knowledge-heavy QA datasets by comparing LLMs of varying sizes and families. Our approach narrows the performance gap observed in the standard QA evaluation settings, indicating that the prevalent indirect QA paradigm of evaluation overestimates the differences in performance between models. We further validate our findings by extensive ablation studies.

cs.CV [Back]

[68] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

Ruslan Khrulev

Main category: cs.CV

TL;DR: 本文提出了一种新的基准EGE-Math Solutions Assessment Benchmark，用于评估视觉语言模型在评估手写数学解题方面的能力。

Details

Motivation: 现有基准测试主要关注问题解决，而本文的方法侧重于理解学生的解题过程，识别错误，并根据固定标准分配成绩。 Method: 本文编译了122份来自俄罗斯统一国家考试（EGE）的扫描解决方案以及官方专家评分，并在三种推理模式下评估了来自Google、OpenAI、Arcee AI和Alibaba Cloud的七个现代VLMs。 Result: 结果揭示了当前在数学推理和人类评分标准对齐方面的局限性，为AI辅助评估开辟了新的研究方向。 Conclusion: 本文提出的新基准测试为评估视觉语言模型在评估手写数学解题方面的能力提供了一种新方法。 Abstract: This paper introduces a novel benchmark, EGE-Math Solutions Assessment Benchmark, for evaluating Vision-Language Models (VLMs) on their ability to assess hand-written mathematical solutions. Unlike existing benchmarks that focus on problem solving, our approach centres on understanding student solutions, identifying mistakes, and assigning grades according to fixed criteria. We compile 122 scanned solutions from the Russian Unified State Exam (EGE) together with official expert grades, and evaluate seven modern VLMs from Google, OpenAI, Arcee AI, and Alibaba Cloud in three inference modes. The results reveal current limitations in mathematical reasoning and human-rubric alignment, opening new research avenues in AI-assisted assessment. You can find code in https://github.com/Karifannaa/Auto-check-EGE-math

[69] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

Zhensheng Yuan,Haozhi Huang,Zhen Xiong,Di Wang,Guanghua Yang

Main category: cs.CV

TL;DR: 该论文介绍了一种用于城市规模场景重建与实时渲染的新框架，具有更高的效率和质量，并且能够处理外观变化问题。

Details

Motivation: 为了克服大规模场景重建中外观变化和计算效率的挑战，提出了一种新的解决方案。 Method: 该方法包括基于可见性的图像选择策略、可控的细节层次策略、外观变换模块以及增强模块，如深度正则化、尺度正则化和抗锯齿。 Result: 实验结果表明，该方法在效率和质量上均优于之前的方法，并能够有效重建城市规模的场景。 Conclusion: 该论文提出了一种高效的框架，能够实现城市规模场景的快速重建和实时渲染，同时在外观变化下保持鲁棒性。 Abstract: We present a framework that enables fast reconstruction and real-time rendering of urban-scale scenes while maintaining robustness against appearance variations across multi-view captures. Our approach begins with scene partitioning for parallel training, employing a visibility-based image selection strategy to optimize training efficiency. A controllable level-of-detail (LOD) strategy explicitly regulates Gaussian density under a user-defined budget, enabling efficient training and rendering while maintaining high visual fidelity. The appearance transformation module mitigates the negative effects of appearance inconsistencies across images while enabling flexible adjustments. Additionally, we utilize enhancement modules, such as depth regularization, scale regularization, and antialiasing, to improve reconstruction fidelity. Experimental results demonstrate that our method effectively reconstructs urban-scale scenes and outperforms previous approaches in both efficiency and quality. The source code is available at: https://yzslab.github.io/REUrbanGS.

[70] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

Giuseppe Cartella,Vittorio Cuculo,Alessandro D'Amelio,Marcella Cornia,Giuseppe Boccignone,Rita Cucchiara

Main category: cs.CV

TL;DR: 本研究介绍ScanDiff，一种结合扩散模型和Vision Transformers的新架构，成功生成多样且逼真的扫描路径，提高了注视预测的研究水平。

Details

Motivation: 预测人类注视扫描路径对于理解视觉注意力至关重要，但在现有方法中，人类视觉探索的可变性未能得到捕捉。 Method: 本文提出了一种新的架构ScanDiff，结合了扩散模型和Vision Transformers，以生成多样化且逼真的扫描路径。 Result: 实验表明，ScanDiff在自由观察和任务驱动场景中均优于现有技术，生成了更多样化和准确的扫描路径。 Conclusion: ScanDiff能够更好地捕捉人类视觉行为的复杂性，推动了注视预测研究的发展。 Abstract: Predicting human gaze scanpaths is crucial for understanding visual attention, with applications in human-computer interaction, autonomous systems, and cognitive robotics. While deep learning models have advanced scanpath prediction, most existing approaches generate averaged behaviors, failing to capture the variability of human visual exploration. In this work, we present ScanDiff, a novel architecture that combines diffusion models with Vision Transformers to generate diverse and realistic scanpaths. Our method explicitly models scanpath variability by leveraging the stochastic nature of diffusion models, producing a wide range of plausible gaze trajectories. Additionally, we introduce textual conditioning to enable task-driven scanpath generation, allowing the model to adapt to different visual search objectives. Experiments on benchmark datasets show that ScanDiff surpasses state-of-the-art methods in both free-viewing and task-driven scenarios, producing more diverse and accurate scanpaths. These results highlight its ability to better capture the complexity of human visual behavior, pushing forward gaze prediction research. Source code and models are publicly available at https://aimagelab.github.io/ScanDiff.

[71] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu,Om Prabhu,Annu,Mohanasankar Sivaprakasam

Main category: cs.CV

TL;DR: This study shows that deep learning-based super-resolution, especially using SRResNet, can enhance poor-quality echocardiograms, significantly improving diagnostic accuracy and making AI-assisted cardiac care feasible in resource-limited settings.

Details

Motivation: Automated cardiac interpretation in resource-constrained settings is hindered by poor-quality echocardiographic imaging. While super-resolution techniques have enhanced MRI and CT scans, their application to echocardiography, a widely accessible but noise-prone modality, remains underexplored. Method: Using the CAMUS dataset, the researchers stratified samples by image quality and applied two deep learning-based super-resolution models (SRGAN and SRResNet) to enhance low-quality 2D echocardiograms. They evaluated performance on two classification tasks: 2CH vs. 4CH view classification and ED vs. ES phase classification. Result: Significant improvements in classification accuracy were observed for low-quality images after enhancement, particularly with SRResNet, which also provided computational efficiency. Conclusion: The study concludes that super-resolution techniques, particularly SRResNet, can effectively enhance poor-quality echocardiographic images, improving diagnostic classification accuracy and serving as a viable tool for AI-assisted care in resource-constrained settings. Abstract: Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

[72] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

Ranxi Lin,Canming Yao,Jiayi Li,Weihang Liu,Xin Lou,Pingqiang Zhou

Main category: cs.CV

TL;DR: This paper proposes PATA, a spike-based NeRF framework with dynamic time step adjustment, significantly reducing computational costs during inference while maintaining rendering quality.

Details

Motivation: NeRF-based models are computationally expensive due to dense point sampling, limiting their use in edge computing scenarios. Spiking Neural Networks (SNNs) offer energy-efficient alternatives, prompting the need for a more efficient NeRF framework. Method: The authors propose a spike-based NeRF framework with a dynamic time step training strategy called Pretrain-Adaptive Time-step Adjustment (PATA), which automatically balances rendering quality and time step length during training. Their approach is based on the Instant-NGP architecture. Result: Experimental results show that PATA reduces inference time steps by 64% and running power by 61.55% while preserving rendering quality. Conclusion: The proposed PATA framework successfully reduces the computational resources required during inference while maintaining rendering fidelity, making it suitable for resource-constrained scenarios. Abstract: Neural Radiance Fields (NeRF)-based models have achieved remarkable success in 3D reconstruction and rendering tasks. However, during both training and inference, these models rely heavily on dense point sampling along rays from multiple viewpoints, resulting in a surge in floating-point operations and severely limiting their use in resource-constrained scenarios like edge computing. Spiking Neural Networks (SNNs), which communicate via binary spikes over discrete time steps, offer a promising alternative due to their energy-efficient nature. Given the inherent variability in scene scale and texture complexity in neural rendering and the prevailing practice of training separate models per scene, we propose a spike-based NeRF framework with a dynamic time step training strategy, termed Pretrain-Adaptive Time-step Adjustment (PATA). This approach automatically explores the trade-off between rendering quality and time step length during training. Consequently, it enables scene-adaptive inference with variable time steps and reduces the additional consumption of computational resources in the inference process. Anchoring to the established Instant-NGP architecture, we evaluate our method across diverse datasets. The experimental results show that PATA can preserve rendering fidelity while reducing inference time steps by 64\% and running power by 61.55\%.

[73] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

Santosh Patapati,Trisanth Srinivasan

Main category: cs.CV

TL;DR: NovaDrive是一种用于自主驾驶的单分支视觉-语言架构，通过高效的注意力机制和新的平滑损失函数，在处理复杂驾驶场景时实现了更高的成功率、路径效率和安全性。

Details

Motivation: 自主车辆需要在毫秒内作出反应，同时推理道路几何和交通意图以应对复杂情况。因此，需要一种高效的架构来处理多种输入信息并实现安全和高效的导航。 Method: NovaDrive采用了一种单分支的视觉-语言架构，利用轻量级的两阶段交叉注意力块来对齐路标标记与高清地图，并在细粒度的图像和深度补丁上优化注意力。此外，结合一种新的平滑损失函数，以避免突然的方向盘和速度变化。 Result: 在MD-NEX户外基准的nuScenes/Waymo子集上，NovaDrive将成功率提高到84%（+4%），路径效率（SPL）提高到0.66（+0.11），碰撞频率从2.6%降低到1.2%（-1.4%）。 Conclusion: NovaDrive的设计消除了对循环记忆的需要，并通过其轻量级的交叉注意力块和新颖的平滑损失，在自主驾驶任务中实现了显著的成功率和效率提升。 Abstract: Autonomous vehicles must react in milliseconds while reasoning about road geometry and traffic intent to navigate complex situations. We introduce NovaDrive, a single-branch vision-language architecture that processes front-camera images, HD-map tiles, LiDAR depth, and textual waypoints in a single branch. A lightweight, two-stage cross-attention block first aligns waypoint tokens with the HD map, then refines attention over fine-grained image and depth patches. Coupled with a novel smoothness loss that discourages abrupt steering and speed changes, this design eliminates the need for recurrent memory. We fine-tune the top 15 layers of an 11B LLaMA-3.2 vision-language backbone, enabling real-time inference. On the nuScenes / Waymo subset of the MD-NEX Outdoor benchmark, NovaDrive raises success rate to 84% (+4%), boosts path-efficiency (SPL) to 0.66 (+0.11), and reduces collision frequency from 2.6% to 1.2% (-1.4%) relative to the previous state-of-the-art. Our ablations confirm that waypoint tokens, partial VLM fine-tuning, and the cross-attention fusion each contribute the most to these gains. Beyond safety, NovaDrive's shorter routes (resulting from the novel smoothness loss) translate to lower fuel or battery usage, pointing toward leaner, more easily updated driving stacks. NovaDrive can be extended to other embodied-AI domains as well.

[74] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

Alexandru Buburuzan

Main category: cs.CV

TL;DR: The paper proposes MObI for autonomous driving and AnydoorMed for medical imaging, two diffusion-based synthetic data generation methods that enable realistic, controllable, and multimodal data creation for safety-critical testing scenarios.

Details

Motivation: Safety-critical applications like autonomous driving and medical image analysis require extensive multimodal testing data. Synthetic data generation is preferred due to the cost and complexity of acquiring real-world data, but it must offer high realism and controllability to be effective. This work aims to address these challenges by introducing methods that enable the creation of realistic and controllable synthetic data. Method: The paper introduces two novel methods, MObI for autonomous driving and AnydoorMed for medical image analysis. Both methods utilize diffusion-based models for reference-guided inpainting, with MObI focusing on 3D bounding box conditioning for object insertion in multimodal scenes, and AnydoorMed concentrating on anomaly inpainting in mammography scans while preserving structural integrity. Result: MObI successfully enables object insertion into multimodal scenes (camera and lidar) with semantic consistency and spatial accuracy. AnydoorMed achieves detailed anomaly inpainting in mammography scans while maintaining structural integrity and blending with surrounding tissue. Both methods demonstrate adaptability of foundation models to different perceptual modalities. Conclusion: The paper concludes that foundation models for reference-guided inpainting in natural images can be adapted to various perceptual modalities, enabling the creation of highly realistic, controllable, and multimodal counterfactual scenarios for safety-critical applications. Abstract: Safety-critical applications, such as autonomous driving and medical image analysis, require extensive multimodal data for rigorous testing. Synthetic data methods are gaining prominence due to the cost and complexity of gathering real-world data, but they demand a high degree of realism and controllability to be useful. This work introduces two novel methods for synthetic data generation in autonomous driving and medical image analysis, namely MObI and AnydoorMed, respectively. MObI is a first-of-its-kind framework for Multimodal Object Inpainting that leverages a diffusion model to produce realistic and controllable object inpaintings across perceptual modalities, demonstrated simultaneously for camera and lidar. Given a single reference RGB image, MObI enables seamless object insertion into existing multimodal scenes at a specified 3D location, guided by a bounding box, while maintaining semantic consistency and multimodal coherence. Unlike traditional inpainting methods that rely solely on edit masks, this approach uses 3D bounding box conditioning to ensure accurate spatial positioning and realistic scaling. AnydoorMed extends this paradigm to the medical imaging domain, focusing on reference-guided inpainting for mammography scans. It leverages a diffusion-based model to inpaint anomalies with impressive detail preservation, maintaining the reference anomaly's structural integrity while semantically blending it with the surrounding tissue. Together, these methods demonstrate that foundation models for reference-guided inpainting in natural images can be readily adapted to diverse perceptual modalities, paving the way for the next generation of systems capable of constructing highly realistic, controllable and multimodal counterfactual scenarios.

[75] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

Santosh Patapati,Trisanth Srinivasan,Murari Ambati

Main category: cs.CV

TL;DR: XYZ-Drive是一种新的自动驾驶模型，通过融合航路点、图像和地图信息，提高了导航的准确性和效率。

Details

Motivation: 自动驾驶汽车需要几何精度和语义理解来导航复杂环境，但大多数系统分别处理它们。 Method: 提出了一种名为XYZ-Drive的单一流模型，通过轻量级的目标中心交叉注意力层，将航路点、图像和地图信息融合，并使用部分微调的LLaMA-3.2 11B模型进行处理。 Result: 在MD-NEX户外驾驶基准测试中，XYZ-Drive的成功率达到95%，路径长度加权成功率为0.80，超过了PhysNav-DG 15%，并将碰撞率减半。 Conclusion: XYZ-Drive通过早期、token级别的意图和地图布局融合，实现了准确、透明、实时的驾驶。 Abstract: Autonomous cars need geometric accuracy and semantic understanding to navigate complex environments, yet most stacks handle them separately. We present XYZ-Drive, a single vision-language model that reads a front-camera frame, a 25m $\times$ 25m overhead map, and the next waypoint, then outputs steering and speed. A lightweight goal-centered cross-attention layer lets waypoint tokens highlight relevant image and map patches, supporting both action and textual explanations, before the fused tokens enter a partially fine-tuned LLaMA-3.2 11B model. On the MD-NEX Outdoor-Driving benchmark XYZ-Drive attains 95% success and 0.80 Success weighted by Path Length (SPL), surpassing PhysNav-DG by 15%. and halving collisions, all while significantly improving efficiency by using only a single branch. Sixteen ablations explain the gains. Removing any modality (vision, waypoint, map) drops success by up to 11%, confirming their complementary roles and rich connections. Replacing goal-centered attention with simple concatenation cuts 3% in performance, showing query-based fusion injects map knowledge more effectively. Keeping the transformer frozen loses 5%, showing the importance of fine-tuning when applying VLMs for specific tasks such as autonomous driving. Coarsening map resolution from 10 cm to 40 cm blurs lane edges and raises crash rate. Overall, these results demonstrate that early, token-level fusion of intent and map layout enables accurate, transparent, real-time driving.

[76] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

Dmitry Demidov,Zaigham Zaheer,Omkar Thawakar,Salman Khan,Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: E-FineR is a training-free method that leverages LLMs and VLMs for open-set, fine-grained image classification, achieving high performance and interpretability without predefined labels or human input.

Details

Motivation: Traditional approaches to fine-grained image classification are limited by fixed vocabularies and closed-set paradigms, making them less scalable in real-world settings with emerging classes. Recent methods underutilize LLMs and depend on unrefined guessed class names. Method: E-FineR combines large language models (LLMs) and vision-language models (VLMs) to enable open-set recognition without predefined class labels or training, focusing on thorough analysis and refinement of class names generated by LLMs. Result: E-FineR achieves state-of-the-art performance in fine-grained visual recognition while supporting zero-shot and few-shot classification without training or human intervention, shifting image classification towards flexible, language-driven understanding. Conclusion: Enriched-FineR (E-FineR) presents a training-free method for fine-grained visual recognition, offering state-of-the-art performance, interpretability, and adaptability in real-world scenarios with limited expert annotations. Abstract: Fine-grained image classification, the task of distinguishing between visually similar subcategories within a broader category (e.g., bird species, car models, flower types), is a challenging computer vision problem. Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms, limiting their scalability and adaptability in real-world settings where novel classes frequently emerge. Recent research has demonstrated that combining large language models (LLMs) with vision-language models (VLMs) makes open-set recognition possible without the need for predefined class labels. However, the existing methods are often limited in harnessing the power of LLMs at the classification phase, and also rely heavily on the guessed class names provided by an LLM without thorough analysis and refinement. To address these bottlenecks, we propose our training-free method, Enriched-FineR (or E-FineR for short), which demonstrates state-of-the-art results in fine-grained visual recognition while also offering greater interpretability, highlighting its strong potential in real-world scenarios and new domains where expert annotations are difficult to obtain. Additionally, we demonstrate the application of our proposed approach to zero-shot and few-shot classification, where it demonstrated performance on par with the existing SOTA while being training-free and not requiring human interventions. Overall, our vocabulary-free framework supports the shift in image classification from rigid label prediction to flexible, language-driven understanding, enabling scalable and generalizable systems for real-world applications. Well-documented code is available on https://github.com/demidovd98/e-finer.

[77] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation

Sanghun Jung,Jingjing Zheng,Ke Zhang,Nan Qiao,Albert Y. C. Chen,Lu Xia,Chi Liu,Yuyin Sun,Xiao Zeng,Hsiang-Wei Huang,Byron Boots,Min Sun,Cheng-Hao Kuo

Main category: cs.CV

TL;DR: 本文提出了一种新的开放词汇3D实例分割方法，结合多种互补概念，通过两阶段方案（提议生成和分类）实现了最先进的性能。

Details

Motivation: 开放词汇3D实例分割（OV-3DIS）通常依赖于视觉-语言模型（VLMs）生成并分类3D实例提议，而现有研究中提出的各种概念并非互斥而是互补的。因此，本文旨在通过整合这些概念并优化关键挑战，提出一种更优的解决方案。 Method: 该解决方案采用两阶段方案：3D提议生成和实例分类。在提议生成阶段，使用基于3D跟踪的鲁棒提议聚合来生成3D提议，并通过迭代合并/移除重叠或部分提议。在分类阶段，使用Alpha-CLIP模型，该模型通过将对象掩码作为alpha通道来减少背景噪声，并引入标准化最大相似度（SMS）分数来规范化文本到提议的相似度，从而有效过滤误检并提高精度。 Result: 所提出的框架在ScanNet200和S3DIS数据集上均取得了最先进的性能，在所有AP和AR指标上均优于现有方法，甚至超过了端到端的闭合词汇方法。 Conclusion: 本文提出了一种新的、最先进的开放词汇3D实例分割（OV-3DIS）解决方案，该方案结合了多种互补概念，并通过精心设计的策略优化了关键挑战。实验表明，该框架在ScanNet200和S3DIS数据集上的AP和AR指标上均达到了最先进的性能，甚至超越了端到端的闭合词汇方法。 Abstract: Unlike closed-vocabulary 3D instance segmentation that is often trained end-to-end, open-vocabulary 3D instance segmentation (OV-3DIS) often leverages vision-language models (VLMs) to generate 3D instance proposals and classify them. While various concepts have been proposed from existing research, we observe that these individual concepts are not mutually exclusive but complementary. In this paper, we propose a new state-of-the-art solution for OV-3DIS by carefully designing a recipe to combine the concepts together and refining them to address key challenges. Our solution follows the two-stage scheme: 3D proposal generation and instance classification. We employ robust 3D tracking-based proposal aggregation to generate 3D proposals and remove overlapped or partial proposals by iterative merging/removal. For the classification stage, we replace the standard CLIP model with Alpha-CLIP, which incorporates object masks as an alpha channel to reduce background noise and obtain object-centric representation. Additionally, we introduce the standardized maximum similarity (SMS) score to normalize text-to-proposal similarity, effectively filtering out false positives and boosting precision. Our framework achieves state-of-the-art performance on ScanNet200 and S3DIS across all AP and AR metrics, even surpassing an end-to-end closed-vocabulary method.

[78] X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

Xiaochen Zhao,Hongyi Xu,Guoxian Song,You Xie,Chenxu Zhang,Xiu Li,Linjie Luo,Jinli Suo,Yebin Liu

Main category: cs.CV

TL;DR: X-NeMo是一种全新的零样本扩散肖像动画流水线，通过1D身份无关潜在运动描述符和交叉注意力机制解决了身份泄露问题，实现了高质量的肖像动画生成。

Details

Motivation: 先前的方法存在身份泄露问题，并且难以捕捉细微和极端的表情变化。 Method: X-NeMo引入了一种全端到端的训练框架，从驱动图像中提取1D身份无关潜在运动描述符，并通过交叉注意力机制控制运动。此外，还使用了双GAN解码器和空间与颜色增强来提升表现力并分离运动潜在特征与身份线索。 Result: X-NeMo在广泛的实验中表现出色，超越了最先进的基线方法，生成了高度表现力的动画，并具有更好的身份相似性。 Conclusion: X-NeMo通过完全端到端的训练框架，有效解决了身份泄露问题，并通过1D潜在运动描述符和交叉注意力机制实现了高质量的肖像动画生成。 Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

[79] Neural Multi-View Self-Calibrated Photometric Stereo without Photometric Stereo Cues

Xu Cao,Takafumi Taketomi

Main category: cs.CV

TL;DR: This paper proposes a new method for reconstructing geometry and reflectance from multi-view images under varying lighting conditions without requiring light calibration or intermediate cues, achieving better accuracy than existing methods.

Details

Motivation: The motivation is to overcome the limitations of prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, by jointly optimizing all scene parameters from raw images in a single stage. Method: The paper proposes a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. They represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. Result: The result is a method that generalizes to view-unaligned multi-light images and achieves better accuracy in shape and lighting estimation compared to state-of-the-art normal-guided approaches. Conclusion: The paper concludes that their proposed method outperforms state-of-the-art approaches in shape and lighting estimation accuracy and can handle objects with challenging geometry and reflectance. Abstract: We propose a neural inverse rendering approach that jointly reconstructs geometry, spatially varying reflectance, and lighting conditions from multi-view images captured under varying directional lighting. Unlike prior multi-view photometric stereo methods that require light calibration or intermediate cues such as per-view normal maps, our method jointly optimizes all scene parameters from raw images in a single stage. We represent both geometry and reflectance as neural implicit fields and apply shadow-aware volume rendering. A spatial network first predicts the signed distance and a reflectance latent code for each scene point. A reflectance network then estimates reflectance values conditioned on the latent code and angularly encoded surface normal, view, and light directions. The proposed method outperforms state-of-the-art normal-guided approaches in shape and lighting estimation accuracy, generalizes to view-unaligned multi-light images, and handles objects with challenging geometry and reflectance.

[80] CNN-based solution for mango classification in agricultural environments

Beatriz Díaz Peón,Jorge Torres Gómez,Ariel Fajardo Márquez

Main category: cs.CV

TL;DR: This paper presents a system for fruit detection and classification using CNNs, particularly focusing on mangoes, for improved farm inventory management.

Details

Motivation: The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Method: A method for mango fruit classification was developed using image processing, with Resnet-18 as the preliminary architecture for classification and a cascade detector for detection. Result: Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. Conclusion: The integration of convolutional neural networks and cascade detectors provides a reliable solution for fruit classification and detection, with potential applications in agricultural quality control. Abstract: This article exemplifies the design of a fruit detection and classification system using Convolutional Neural Networks (CNN). The goal is to develop a system that automatically assesses fruit quality for farm inventory management. Specifically, a method for mango fruit classification was developed using image processing, ensuring both accuracy and efficiency. Resnet-18 was selected as the preliminary architecture for classification, while a cascade detector was used for detection, balancing execution speed and computational resource consumption. Detection and classification results were displayed through a graphical interface developed in MatLab App Designer, streamlining system interaction. The integration of convolutional neural networks and cascade detectors proffers a reliable solution for fruit classification and detection, with potential applications in agricultural quality control.

[81] Single Image Rain Streak Removal Using Harris Corner Loss and R-CBAM Network

Jongwook Si,Sungyoung Kim

Main category: cs.CV

TL;DR: This paper proposes a novel image restoration network using Corner Loss and R-CBAM Block to effectively remove rain streaks while preserving image details, achieving superior performance on benchmark datasets.

Details

Motivation: The challenge of single-image rain streak removal requires more than noise suppression, necessitating the preservation of fine structural details and overall visual quality. Method: A novel image restoration network incorporating Corner Loss to preserve object boundaries and detailed texture information, along with an R-CBAM Block to dynamically adjust feature importance in spatial and channel dimensions. Result: Quantitative evaluations on Rain100L and Rain100H datasets show significant performance improvements, with PSNR scores of 33.29 dB on Rain100L and 26.16 dB on Rain100H. Conclusion: The proposed image restoration network with Corner Loss and R-CBAM Block significantly improves the removal of rain streaks while preserving structural details and texture information. Abstract: The problem of single-image rain streak removal goes beyond simple noise suppression, requiring the simultaneous preservation of fine structural details and overall visual quality. In this study, we propose a novel image restoration network that effectively constrains the restoration process by introducing a Corner Loss, which prevents the loss of object boundaries and detailed texture information during restoration. Furthermore, we propose a Residual Convolutional Block Attention Module (R-CBAM) Block into the encoder and decoder to dynamically adjust the importance of features in both spatial and channel dimensions, enabling the network to focus more effectively on regions heavily affected by rain streaks. Quantitative evaluations conducted on the Rain100L and Rain100H datasets demonstrate that the proposed method significantly outperforms previous approaches, achieving a PSNR of 33.29 dB on Rain100L and 26.16 dB on Rain100H.

Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou

Main category: cs.CV

TL;DR: A new framework aligns four modalities (text, audio, video, and motion) for improved motion retrieval using sequence-level contrastive learning, showing significant performance gains over existing methods.

Details

Motivation: Existing motion retrieval methods rely on contrastive learning for a unified embedding space but lack intuitive interaction modes and overlook sequential representations for better performance. Audio has not been previously utilized in this context. Method: A sequence-level contrastive learning approach is used to align text, audio, video, and motion in a fine-grained joint embedding space. New multi-modal datasets are created by augmenting existing datasets with synthetic audio recordings. Result: Experimental results show a 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. The 4-modal framework outperforms its 3-modal version. Conclusion: The proposed four-modal framework significantly improves motion retrieval performance compared to existing methods and demonstrates the potential of multi-modal approaches in motion acquisition. Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

[83] A Novel Dataset for Flood Detection Robust to Seasonal Changes in Satellite Imagery

Youngsun Jang,Dongyoun Kim,Chulwoo Pack,Kwanghee Won

Main category: cs.CV

TL;DR: 本文介绍了一个新的洪水区域分割数据集，并评估了现有模型在该任务上的性能，结果表明需要更先进的多模态和时间学习方法。

Details

Motivation: 现有的卫星图像基准数据集中缺乏适合洪水区域分割任务的数据集，因此需要构建专门的数据集来填补这一空白。 Method: 收集了2019年美国中西部洪水的卫星图像，构建了一个包含10个地点、每个地点10张卫星图像的数据集，并测试了最先进的计算机视觉和遥感模型进行语义分割性能评估。 Result: 模型表现一般，表明当前方法在处理洪水区域分割任务上仍有挑战，需要进一步改进。 Conclusion: 该论文提出了一种新的用于分割卫星图像中洪水区域的数据集，并指出需要未来多模态和时间学习策略来提高模型性能。 Abstract: This study introduces a novel dataset for segmenting flooded areas in satellite images. After reviewing 77 existing benchmarks utilizing satellite imagery, we identified a shortage of suitable datasets for this specific task. To fill this gap, we collected satellite imagery of the 2019 Midwestern USA floods from Planet Explorer by Planet Labs (Image \c{opyright} 2024 Planet Labs PBC). The dataset consists of 10 satellite images per location, each containing both flooded and non-flooded areas. We selected ten locations from each of the five states: Iowa, Kansas, Montana, Nebraska, and South Dakota. The dataset ensures uniform resolution and resizing during data processing. For evaluating semantic segmentation performance, we tested state-of-the-art models in computer vision and remote sensing on our dataset. Additionally, we conducted an ablation study varying window sizes to capture temporal characteristics. Overall, the models demonstrated modest results, suggesting a requirement for future multimodal and temporal learning strategies. The dataset will be publicly available on .

[84] Adversarial-Guided Diffusion for Multimodal LLM Attacks

Chengwei Xia,Fan Ma,Ruijie Quan,Kun Zhan,Yi Yang

Main category: cs.CV

TL;DR: This paper proposes an adversarial-guided diffusion (AGD) method that effectively attacks multimodal large language models by embedding adversarial signals in the noise of a diffusion model, resulting in robust and less detectable attacks.

Details

Motivation: The study aims to address the challenge of generating adversarial images that can deceive multimodal large language models (MLLMs) into producing targeted responses while avoiding noticeable distortion in the clean image. Method: The authors proposed an adversarial-guided diffusion (AGD) approach that injects adversarial signals into the noise component of a diffusion model during the reverse diffusion process, leveraging the full-spectrum nature of the noise to enhance attack effectiveness and robustness. Result: Extensive experiments show that AGD surpasses state-of-the-art methods in terms of attack performance and robustness against defenses such as low-pass filtering. Conclusion: The proposed AGD method effectively generates adversarial images that are robust against various defenses, outperforming existing state-of-the-art methods in both attack performance and model robustness. Abstract: This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

[85] Confidence-aware agglomeration classification and segmentation of 2D microscopic food crystal images

Xiaoyu Ji,Ali Shakouri,Fengqing Zhu

Main category: cs.CV

TL;DR: This paper proposes a novel method for improving agglomeration classification accuracy in food crystal analysis by combining supervised modeling and post-processing techniques.

Details

Motivation: Manual annotation of food crystal agglomeration in 2D microscopic images is challenging due to transparency of water bonding and limited perspective. Method: A supervised baseline model for segmentation pseudo-labels and an instance classification model for pixel-wise segmentation, along with a post-processing module, are proposed. Result: The method successfully classifies potential agglomerated instances and improves true positive classification accuracy compared to existing methods. Conclusion: The proposed method improves the accuracy of agglomeration classification and size distribution predictions while considering the variability in confidence levels of manual annotations. Abstract: Food crystal agglomeration is a phenomenon occurs during crystallization which traps water between crystals and affects food product quality. Manual annotation of agglomeration in 2D microscopic images is particularly difficult due to the transparency of water bonding and the limited perspective focusing on a single slide of the imaged sample. To address this challenge, we first propose a supervised baseline model to generate segmentation pseudo-labels for the coarsely labeled classification dataset. Next, an instance classification model that simultaneously performs pixel-wise segmentation is trained. Both models are used in the inference stage to combine their respective strengths in classification and segmentation. To preserve crystal properties, a post processing module is designed and included to both steps. Our method improves true positive agglomeration classification accuracy and size distribution predictions compared to other existing methods. Given the variability in confidence levels of manual annotations, our proposed method is evaluated under two confidence levels and successfully classifies potential agglomerated instances.

[86] YOLO-ROC: A High-Precision and Ultra-Lightweight Model for Real-Time Road Damage Detection

Zicheng Lin,Weichao Pan

Main category: cs.CV

TL;DR: 本文提出了一种高效轻量的道路损伤检测模型YOLO-ROC，通过改进多尺度特征提取能力和降低计算复杂度，显著提升了小目标检测精度和模型效率。

Details

Motivation: 现有的道路损伤检测模型在多尺度特征提取和计算效率方面存在不足，难以实现高精度的小目标检测和实时部署。 Method: 设计了双向多尺度空间金字塔池化模块（BMS-SPPF）以增强多尺度特征提取能力，并采用分层通道压缩策略降低计算复杂度。 Result: 在RDD2022_China_Drone数据集上的实验表明，YOLO-ROC的mAP50为67.6%，比基线模型YOLOv8n高2.11%；小目标D40类别的mAP50提高了16.8%，模型大小仅为2.0 MB。此外，该模型在RDD2022_China_Motorbike数据集上也表现出优秀的泛化性能。 Conclusion: 本文提出了一种高效轻量的道路损伤检测模型YOLO-ROC，通过改进多尺度特征提取能力和降低计算复杂度，有效解决了现有模型在小目标检测和实时部署方面的不足。 Abstract: Road damage detection is a critical task for ensuring traffic safety and maintaining infrastructure integrity. While deep learning-based detection methods are now widely adopted, they still face two core challenges: first, the inadequate multi-scale feature extraction capabilities of existing networks for diverse targets like cracks and potholes, leading to high miss rates for small-scale damage; and second, the substantial parameter counts and computational demands of mainstream models, which hinder their deployment for efficient, real-time detection in practical applications. To address these issues, this paper proposes a high-precision and lightweight model, YOLO - Road Orthogonal Compact (YOLO-ROC). We designed a Bidirectional Multi-scale Spatial Pyramid Pooling Fast (BMS-SPPF) module to enhance multi-scale feature extraction and implemented a hierarchical channel compression strategy to reduce computational complexity. The BMS-SPPF module leverages a bidirectional spatial-channel attention mechanism to improve the detection of small targets. Concurrently, the channel compression strategy reduces the parameter count from 3.01M to 0.89M and GFLOPs from 8.1 to 2.6. Experiments on the RDD2022_China_Drone dataset demonstrate that YOLO-ROC achieves a mAP50 of 67.6%, surpassing the baseline YOLOv8n by 2.11%. Notably, the mAP50 for the small-target D40 category improved by 16.8%, and the final model size is only 2.0 MB. Furthermore, the model exhibits excellent generalization performance on the RDD2022_China_Motorbike dataset.

[87] Toward Safe, Trustworthy and Realistic Augmented Reality User Experience

Yanming Xiu

Main category: cs.CV

TL;DR: 本研究开发了两种检测有害AR内容的系统，并提出未来方向，旨在构建一个可扩展的AR安全保障框架。

Details

Motivation: 随着AR技术日益融入日常生活，确保其虚拟内容的安全性和可信度变得至关重要，尤其是针对那些可能阻碍关键信息或操纵用户感知的AR内容。 Method: 研究开发了两种系统ViDDAR和VIM-Sense，利用视觉-语言模型（VLMs）和多模态推理模块来检测有害的AR内容。 Result: 研究成果包括两种检测有害AR内容的系统，并提出了三个未来发展方向：自动化虚拟内容的质量评估、多模态攻击检测以及VLMs的高效用户中心化适配。 Conclusion: 该研究旨在建立一个可扩展且符合人类感知的增强现实（AR）安全保障框架，并寻求在感知建模、多模态AR内容实现和轻量级模型适配方面的反馈。 Abstract: As augmented reality (AR) becomes increasingly integrated into everyday life, ensuring the safety and trustworthiness of its virtual content is critical. Our research addresses the risks of task-detrimental AR content, particularly that which obstructs critical information or subtly manipulates user perception. We developed two systems, ViDDAR and VIM-Sense, to detect such attacks using vision-language models (VLMs) and multimodal reasoning modules. Building on this foundation, we propose three future directions: automated, perceptually aligned quality assessment of virtual content; detection of multimodal attacks; and adaptation of VLMs for efficient and user-centered deployment on AR devices. Overall, our work aims to establish a scalable, human-aligned framework for safeguarding AR experiences and seeks feedback on perceptual modeling, multimodal AR content implementation, and lightweight model adaptation.

[88] Ambiguity-Guided Learnable Distribution Calibration for Semi-Supervised Few-Shot Class-Incremental Learning

Fan Lyu,Linglan Zhao,Chengyan Liu,Yinying Mei,Zhang Zhang,Jian Zhang,Fuyuan Hu,Liang Wang

Main category: cs.CV

TL;DR: This paper introduces a new framework called Generalized Semi-supervised Few-Shot Class-Incremental Learning (Gsemi-FSCIL) that better aligns with real-world scenarios. The authors propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy to handle the challenge of distinguishing unlabeled samples from different classes, achieving superior performance on benchmark datasets.

Details

Motivation: The paper aims to address the limitations of existing Semi-FSCIL methods, which assume that unlabeled data comes only from novel classes, thereby not reflecting real-world scenarios. The authors redefine the problem as Generalized Semi-FSCIL (Gsemi-FSCIL) to better reflect practical settings. Method: The authors propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy, which dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. They evaluate their method on three benchmark datasets. Result: The experiments demonstrate that the proposed ALDC method outperforms existing approaches on three benchmark datasets, setting new state-of-the-art results in the Gsemi-FSCIL setting. Conclusion: The proposed ALDC strategy effectively addresses the challenge of distinguishing unlabeled samples from base and novel classes in the Gsemi-FSCIL setting, leading to improved performance over existing methods. Abstract: Few-Shot Class-Incremental Learning (FSCIL) focuses on models learning new concepts from limited data while retaining knowledge of previous classes. Recently, many studies have started to leverage unlabeled samples to assist models in learning from few-shot samples, giving rise to the field of Semi-supervised Few-shot Class-Incremental Learning (Semi-FSCIL). However, these studies often assume that the source of unlabeled data is only confined to novel classes of the current session, which presents a narrow perspective and cannot align well with practical scenarios. To better reflect real-world scenarios, we redefine Semi-FSCIL as Generalized Semi-FSCIL (GSemi-FSCIL) by incorporating both base and all the ever-seen novel classes in the unlabeled set. This change in the composition of unlabeled samples poses a new challenge for existing methods, as they struggle to distinguish between unlabeled samples from base and novel classes. To address this issue, we propose an Ambiguity-guided Learnable Distribution Calibration (ALDC) strategy. ALDC dynamically uses abundant base samples to correct biased feature distributions for few-shot novel classes. Experiments on three benchmark datasets show that our method outperforms existing works, setting new state-of-the-art results.

[89] Generalized Reinforcement Learning for Retriever-Specific Query Rewriter with Unstructured Real-World Documents

Sungguk Cha,DongWook Kim,Taeseung Hahn,Mintae Kim,Youngsub Han,Byoung-Ki Jeon

Main category: cs.CV

TL;DR: 本文提出了一种名为RL-QR的强化学习框架，用于检索器特定的查询重写，旨在提升检索增强生成（RAG）系统在多样化的、非结构化的现实世界文档中的表现。

Details

Motivation: 检索增强生成（RAG）系统严重依赖有效的查询构建来解锁外部知识，但如何为多样化的、非结构化的现实世界文档优化查询仍然是一个挑战。 Method: RL-QR是一种强化学习框架，专门用于检索器特定的查询重写，通过合成场景-问题对并利用广义奖励策略优化（GRPO）来训练查询重写器。 Result: 在工业内部数据上的实验表明了显著的改进，其中RL-QR_multi-modal在多模态RAG上实现了NDCG@3的11%相对增益，而RL-QR_lexical为词法检索器带来了9%的增益。然而，对于语义和混合检索器仍然存在挑战，重写器未能提升性能。 Conclusion: RL-QR有望彻底改变RAG系统的查询优化，为现实世界的检索任务提供可扩展的、无需标注的解决方案，同时也在语义检索领域指出了需要进一步改进的方向。 Abstract: Retrieval-Augmented Generation (RAG) systems rely heavily on effective query formulation to unlock external knowledge, yet optimizing queries for diverse, unstructured real-world documents remains a challenge. We introduce \textbf{RL-QR}, a reinforcement learning framework for retriever-specific query rewriting that eliminates the need for human-annotated datasets and extends applicability to both text-only and multi-modal databases. By synthesizing scenario-question pairs and leveraging Generalized Reward Policy Optimization (GRPO), RL-QR trains query rewriters tailored to specific retrievers, enhancing retrieval performance across varied domains. Experiments on industrial in-house data demonstrate significant improvements, with $\text{RL-QR}_{\text{multi-modal}}$ achieving an 11\% relative gain in NDCG@3 for multi-modal RAG and $\text{RL-QR}_{\text{lexical}}$ yielding a 9\% gain for lexical retrievers. However, challenges persist with semantic and hybrid retrievers, where rewriters failed to improve performance, likely due to training misalignments. Our findings highlight RL-QR's potential to revolutionize query optimization for RAG systems, offering a scalable, annotation-free solution for real-world retrieval tasks, while identifying avenues for further refinement in semantic retrieval contexts.

[90] Automated Mapping the Pathways of Cranial Nerve II, III, V, and VII/VIII: A Multi-Parametric Multi-Stage Diffusion Tractography Atlas

Lei Xie,Jiahao Huang,Jiawei Zhang,Jianzhong He,Yiang Pan,Guoqiang Xie,Mengjun Li,Qingrun Zeng,Mingchu Li,Yuanjing Feng

Main category: cs.CV

TL;DR: 本研究提出了一种自动化绘制人脑颅神经通路的全面扩散纤维束成像图谱，显著提高了脑结构分析的效率和准确性。

Details

Motivation: 颅神经在人脑各种重要功能中起关键作用，从扩散MRI（dMRI）中绘制其通路可为术前提供重要信息。但由于每对颅神经的独特解剖结构和颅底环境的复杂性，构建全面详细的颅神经图谱面临挑战。 Method: 通过多参数纤维束成像技术生成每对颅神经的纤维路径，并采用多阶段纤维聚类策略对约100万条纤维路径进行分析，构建颅神经图谱。 Result: 该图谱能够自动识别与5对颅神经相关的8个纤维束，包括视神经CN II、动眼神经CN III、三叉神经CN V和面神经-前庭耳蜗神经CN VII/VIII，实验验证了其鲁棒性，并在多个数据集和临床病例中展现出与专家手动标注高度一致的空间对应性。 Conclusion: 本研究成功开发了首个全面的扩散纤维束成像图谱，用于自动绘制人脑中的颅神经（CN）通路，提高了对复杂脑结构及其与周围解剖结构空间关系的可视化分析和理解。 Abstract: Cranial nerves (CNs) play a crucial role in various essential functions of the human brain, and mapping their pathways from diffusion MRI (dMRI) provides valuable preoperative insights into the spatial relationships between individual CNs and key tissues. However, mapping a comprehensive and detailed CN atlas is challenging because of the unique anatomical structures of each CN pair and the complexity of the skull base environment.In this work, we present what we believe to be the first study to develop a comprehensive diffusion tractography atlas for automated mapping of CN pathways in the human brain. The CN atlas is generated by fiber clustering by using the streamlines generated by multi-parametric fiber tractography for each pair of CNs. Instead of disposable clustering, we explore a new strategy of multi-stage fiber clustering for multiple analysis of approximately 1,000,000 streamlines generated from the 50 subjects from the Human Connectome Project (HCP). Quantitative and visual experiments demonstrate that our CN atlas achieves high spatial correspondence with expert manual annotations on multiple acquisition sites, including the HCP dataset, the Multi-shell Diffusion MRI (MDM) dataset and two clinical cases of pituitary adenoma patients. The proposed CN atlas can automatically identify 8 fiber bundles associated with 5 pairs of CNs, including the optic nerve CN II, oculomotor nerve CN III, trigeminal nerve CN V and facial-vestibulocochlear nerve CN VII/VIII, and its robustness is demonstrated experimentally. This work contributes to the field of diffusion imaging by facilitating more efficient and automated mapping the pathways of multiple pairs of CNs, thereby enhancing the analysis and understanding of complex brain structures through visualization of their spatial relationships with nearby anatomy.

[91] A Deep Dive into Generic Object Tracking: A Survey

Fereshteh Aghaee Meibodi,Shadi Alijani,Homayoun Najjaran

Main category: cs.CV

TL;DR: The paper provides a comprehensive review of object tracking methods in computer vision, with a focus on transformer-based approaches, highlighting their advancements in handling complex spatio-temporal dynamics and occlusions.

Details

Motivation: The motivation behind the paper is to provide a comprehensive review of all major tracking paradigms, particularly focusing on the rapidly evolving transformer-based methods, due to the challenges posed by complex spatio-temporal dynamics in generic object tracking. Method: The paper employs both qualitative and quantitative comparisons to analyze core design principles, innovations, and limitations of various tracking approaches, introducing a novel categorization and providing a unified visual and tabular comparison of representative methods. Result: The result is a detailed analysis and categorization of different object tracking approaches, including Siamese-based, discriminative, and transformer-based methods, along with a unified comparison and summary of major evaluation benchmarks. Conclusion: The paper concludes that transformer-based tracking methods have significantly advanced the field of generic object tracking by leveraging robust spatio-temporal modeling capabilities, and it emphasizes the importance of a comprehensive review covering various tracking paradigms. Abstract: Generic object tracking remains an important yet challenging task in computer vision due to complex spatio-temporal dynamics, especially in the presence of occlusions, similar distractors, and appearance variations. Over the past two decades, a wide range of tracking paradigms, including Siamese-based trackers, discriminative trackers, and, more recently, prominent transformer-based approaches, have been introduced to address these challenges. While a few existing survey papers in this field have either concentrated on a single category or widely covered multiple ones to capture progress, our paper presents a comprehensive review of all three categories, with particular emphasis on the rapidly evolving transformer-based methods. We analyze the core design principles, innovations, and limitations of each approach through both qualitative and quantitative comparisons. Our study introduces a novel categorization and offers a unified visual and tabular comparison of representative methods. Additionally, we organize existing trackers from multiple perspectives and summarize the major evaluation benchmarks, highlighting the fast-paced advancements in transformer-based tracking driven by their robust spatio-temporal modeling capabilities.

[92] Towards Measuring and Modeling Geometric Structures in Time Series Forecasting via Image Modality

Mingyang Yu,Xiahui Guo,Peng chen,Zhenkai Li,Yang Shu

Main category: cs.CV

TL;DR: This paper introduces a new time series evaluation metric (TGSI) and a training loss function (SATL) that improves model performance in capturing geometric structures of time series data.

Details

Motivation: Traditional metrics like MSE do not evaluate the geometric structure of time series data, which is crucial for understanding temporal dynamics. Method: Developed a new evaluation metric called time series Geometric Structure Index (TGSI) and a new loss function, Shape-Aware Temporal Loss (SATL), which combines first-order difference loss, frequency domain loss, and perceptual feature loss. Result: Models trained with SATL outperformed baseline methods on multiple datasets in both MSE and TGSI metrics. Conclusion: The proposed SATL loss function effectively enhances time series forecasting models' performance in capturing geometric structures without increasing inference cost. Abstract: Time Series forecasting is critical in diverse domains such as weather forecasting, financial investment, and traffic management. While traditional numerical metrics like mean squared error (MSE) can quantify point-wise accuracy, they fail to evaluate the geometric structure of time series data, which is essential to understand temporal dynamics. To address this issue, we propose the time series Geometric Structure Index (TGSI), a novel evaluation metric that transforms time series into images to leverage their inherent two-dimensional geometric representations. However, since the image transformation process is non-differentiable, TGSI cannot be directly integrated as a training loss. We further introduce the Shape-Aware Temporal Loss (SATL), a multi-component loss function operating in the time series modality to bridge this gap and enhance structure modeling during training. SATL combines three components: a first-order difference loss that measures structural consistency through the MSE between first-order differences, a frequency domain loss that captures essential periodic patterns using the Fast Fourier Transform while minimizing noise, and a perceptual feature loss that measures geometric structure difference in time-series by aligning temporal features with geometric structure features through a pre-trained temporal feature extractor and time-series image autoencoder. Experiments across multiple datasets demonstrate that models trained with SATL achieve superior performance in both MSE and the proposed TGSI metrics compared to baseline methods, without additional computational cost during inference.

[93] Learning Semantic-Aware Threshold for Multi-Label Image Recognition with Partial Labels

Haoxian Ruan,Zhihua Xu,Zhijing Yang,Guang Ma,Jieming Xie,Changxiang Fan,Tianshui Chen

Main category: cs.CV

TL;DR: The paper proposes SATL, a novel method for multi-label image recognition with partial labels, which dynamically learns category-specific thresholds and improves label prediction accuracy.

Details

Motivation: Traditional MLR-PL methods create inaccurate and incomplete pseudo-labels due to fixed thresholds that fail to account for varying score distributions across categories. Method: The study introduces the Semantic-Aware Threshold Learning (SATL) algorithm, which calculates score distributions for positive and negative samples in each category, dynamically updates category-specific thresholds, and employs a differential ranking loss to enhance threshold discrimination. Result: Experiments on large-scale datasets like Microsoft COCO and VG-200 show that the SATL method significantly outperforms existing approaches in scenarios with limited labels. Conclusion: The proposed SATL algorithm significantly improves the performance of multi-label image recognition with partial labels by dynamically learning category-specific thresholds and implementing a differential ranking loss. Abstract: Multi-label image recognition with partial labels (MLR-PL) is designed to train models using a mix of known and unknown labels. Traditional methods rely on semantic or feature correlations to create pseudo-labels for unidentified labels using pre-set thresholds. This approach often overlooks the varying score distributions across categories, resulting in inaccurate and incomplete pseudo-labels, thereby affecting performance. In our study, we introduce the Semantic-Aware Threshold Learning (SATL) algorithm. This innovative approach calculates the score distribution for both positive and negative samples within each category and determines category-specific thresholds based on these distributions. These distributions and thresholds are dynamically updated throughout the learning process. Additionally, we implement a differential ranking loss to establish a significant gap between the score distributions of positive and negative samples, enhancing the discrimination of the thresholds. Comprehensive experiments and analysis on large-scale multi-label datasets, such as Microsoft COCO and VG-200, demonstrate that our method significantly improves performance in scenarios with limited labels.

[94] PixNerd: Pixel Neural Field Diffusion

Shuai Wang,Ziteng Gao,Chenhui Zhu,Weilin Huang,Limin Wang

Main category: cs.CV

TL;DR: PixelNerd is a single-stage, end-to-end framework for image generation that avoids the need for a VAE or complex pipelines, achieving strong performance on image-to-image and text-to-image tasks.

Details

Motivation: The motivation is to overcome the limitations of the two-stage training paradigm in diffusion transformers, which relies on a pre-trained VAE and introduces accumulated errors and decoding artifacts. Method: The paper introduces PixelNerd, which models patch-wise decoding using a neural field, operating directly in pixel space in a single-scale, single-stage manner. It also extends this framework to text-to-image applications. Result: PixelNerd achieved 2.15 FID on ImageNet 256×256 and 2.84 FID on ImageNet 512×512 without using a VAE or cascade pipeline. Additionally, the PixelNerd-XXL/16 variant achieved a 0.73 overall score on the GenEval benchmark and 80.9 on the DPG benchmark in text-to-image applications. Conclusion: The paper concludes that PixelNerd provides an efficient, end-to-end solution for image generation without the need for a VAE or complex cascade pipelines, achieving competitive results in both image-to-image and text-to-image applications. Abstract: The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

[95] Towards Affordable Tumor Segmentation and Visualization for 3D Breast MRI Using SAM2

Solha Kang,Eugene Kim,Joris Vankerschaver,Utku Ozbulak

Main category: cs.CV

TL;DR: This study explores the use of SAM2 for low-cost, minimal-input 3D tumor segmentation in breast MRI, achieving promising results with minimal supervision and identifying center-outward propagation as the most effective strategy.

Details

Motivation: Manual interpretation of 3D breast MRI scans is labor-intensive and subjective, and commercial AI tools for medical image analysis are often too costly and infrastructure-heavy for adoption in low- and middle-income countries. Method: Using a single bounding box annotation on one slice, segmentation predictions were propagated across the 3D volume using three slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. These strategies were evaluated across a large patient cohort. Result: The center-outward propagation strategy yielded the most consistent and accurate segmentations. The study found that SAM2 achieved strong segmentation performance despite not being trained on volumetric medical data, and analyzed how performance varied with tumor size, location, and shape. Conclusion: The study concludes that the Segment Anything Model 2 (SAM2), despite being a zero-shot model not trained for volumetric medical data, can effectively support 3D tumor segmentation in breast MRI with minimal supervision, offering a cost-effective and accessible solution for resource-limited settings. Abstract: Breast MRI provides high-resolution volumetric imaging critical for tumor assessment and treatment planning, yet manual interpretation of 3D scans remains labor-intensive and subjective. While AI-powered tools hold promise for accelerating medical image analysis, adoption of commercial medical AI products remains limited in low- and middle-income countries due to high license costs, proprietary software, and infrastructure demands. In this work, we investigate whether the Segment Anything Model 2 (SAM2) can be adapted for low-cost, minimal-input 3D tumor segmentation in breast MRI. Using a single bounding box annotation on one slice, we propagate segmentation predictions across the 3D volume using three different slice-wise tracking strategies: top-to-bottom, bottom-to-top, and center-outward. We evaluate these strategies across a large cohort of patients and find that center-outward propagation yields the most consistent and accurate segmentations. Despite being a zero-shot model not trained for volumetric medical data, SAM2 achieves strong segmentation performance under minimal supervision. We further analyze how segmentation performance relates to tumor size, location, and shape, identifying key failure modes. Our results suggest that general-purpose foundation models such as SAM2 can support 3D medical image analysis with minimal supervision, offering an accessible and affordable alternative for resource-constrained settings.

[96] iLRM: An Iterative Large 3D Reconstruction Model

Gyeongjin Kang,Seungtae Nam,Xiangyu Sun,Sameh Khamis,Abdelrahman Mohamed,Eunbyung Park

Main category: cs.CV

TL;DR: iLRM is an efficient and scalable method for high-quality 3D reconstruction using iterative refinement and optimized attention mechanisms.

Details

Motivation: Current state-of-the-art methods based on transformers face scalability issues due to full attention across image tokens, leading to high computational costs with increasing views or resolution. Method: iLRM uses an iterative refinement mechanism with three principles: decoupling scene representation from input views, decomposing multi-view interactions into two-stage attention, and injecting high-resolution information at each layer. Result: iLRM achieves superior reconstruction quality and speed compared to existing methods on datasets like RE10K and DL3DV, with notable improvements in scalability. Conclusion: iLRM provides a scalable and efficient solution for 3D reconstruction, outperforming existing methods in both quality and speed while efficiently leveraging a larger number of input views. Abstract: Feed-forward 3D modeling has emerged as a promising approach for rapid and high-quality 3D reconstruction. In particular, directly generating explicit 3D representations, such as 3D Gaussian splatting, has attracted significant attention due to its fast and high-quality rendering, as well as numerous applications. However, many state-of-the-art methods, primarily based on transformer architectures, suffer from severe scalability issues because they rely on full attention across image tokens from multiple input views, resulting in prohibitive computational costs as the number of views or image resolution increases. Toward a scalable and efficient feed-forward 3D reconstruction, we introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D Gaussian representations through an iterative refinement mechanism, guided by three core principles: (1) decoupling the scene representation from input-view images to enable compact 3D representations; (2) decomposing fully-attentional multi-view interactions into a two-stage attention scheme to reduce computational costs; and (3) injecting high-resolution information at every layer to achieve high-fidelity reconstruction. Experimental results on widely used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms existing methods in both reconstruction quality and speed. Notably, iLRM exhibits superior scalability, delivering significantly higher reconstruction quality under comparable computational cost by efficiently leveraging a larger number of input views.

[97] UniLiP: Adapting CLIP for Unified Multimodal Understanding, Generation and Editing

Hao Tang,Chenwei Xie,Xiaoyi Bao,Tingyu Weng,Pandeng Li,Yun Zheng,Liwei Wang

Main category: cs.CV

TL;DR: UniLIP通过两阶段训练和自蒸馏策略，以及双条件架构，有效扩展了CLIP的功能，在生成和编辑任务上表现出色，同时保持了优秀的理解能力。

Details

Motivation: 先前基于CLIP的统一方法需要额外的扩散解码器或量化来支持重建和生成任务，导致重建不一致或原始理解性能下降。 Method: 引入了一个两阶段训练方案和自蒸馏策略，结合双条件架构连接MLLM和扩散变压器。 Result: 在文本到图像生成任务中，UniLIP在GenEval和WISE基准上的得分分别为0.87和0.53；在图像编辑中，其得分为3.62，超越了BAGEL和UniWorld-V1等最先进模型。 Conclusion: UniLIP扩展了CLIP的应用范围，使其不仅适用于理解任务，还在生成和编辑任务中表现出色。 Abstract: In this paper, we propose UniLIP, which extends CLIP to reconstruction, generation and editing, thereby building a unified tokenizer upon its exceptional comprehension capabilities. Previous CLIP-based unified methods often require additional diffusion decoders or quantization to support reconstruction and generation tasks, leading to inconsistent reconstruction or degradation of original comprehension performance.In contrast, we introduce a two-stage training scheme and a self-distillation strategy that progressively integrates reconstruction capabilities into CLIP, allowing it to maintain original comprehension performance while achieving effective image reconstruction. Furthermore, we propose a dual-condition architecture to connect the MLLM and diffusion transformer, using both learnable queries and the last layer multimodal hidden states as joint conditions. This method not only enables the utilization of the MLLM's strong reasoning capabilities in generation tasks, but also maximizes the exploitation of the rich information in UniLIP features during editing tasks. In text-to-image generation tasks, UniLIP obtains scores of 0.87 and 0.53 on GenEval and WISE benchmark respectively, surpassing all previous unified models of similar scale. In image editing, UniLIP also achieves a score of 3.62 on the ImgEdit Benchmark, surpassing recent state-of-the-art models such as BAGEL and UniWorld-V1. UniLIP effectively expand the application scope of CLIP, enabling continuous CLIP features to not only serve as the optimal choice for understanding tasks but also achieve highly competitive performance in generation and editing tasks.

Dohwan Ko,Ji Soo Lee,Minhyuk Choi,Zihang Meng,Hyunwoo J. Kim

Main category: cs.CV

TL;DR: 本文提出BLiM框架和CPN模块，有效解决文本-视频检索中的候选先验偏差问题，并在多个基准测试中取得优异表现。

Details

Motivation: 观察到MLLM的朴素应用会在检索过程中引入候选先验偏差，因此需要一种新方法来改善文本-视频检索的相关性。 Method: 提出了一种新的检索框架BLiM，利用查询和候选可能性，通过训练模型从给定视频生成文本以及从给定文本生成视频特征。引入了候选先验归一化（CPN），作为有效的训练免费评分校准模块。 Result: BLiM与CPN在文本-视频检索基准测试中表现出色，同时分析显示CPN在多种多模态任务中具有广泛的适用性。 Conclusion: BLiM结合CPN在四个文本-视频检索基准测试中平均超过之前的最先进模型6.4 R@1，有效缓解了候选先验偏差并强调了查询-候选相关性。 Abstract: Text-Video Retrieval aims to find the most relevant text (or video) candidate given a video (or text) query from large-scale online databases. Recent work leverages multi-modal large language models (MLLMs) to improve retrieval, especially for long or complex query-candidate pairs. However, we observe that the naive application of MLLMs, i.e., retrieval based on candidate likelihood, introduces candidate prior bias, favoring candidates with inherently higher priors over those more relevant to the query. To this end, we propose a novel retrieval framework, Bidirectional Likelihood Estimation with MLLM (BLiM), which leverages both query and candidate likelihoods by training the model to generate text from a given video as well as video features from a given text. Furthermore, we introduce Candidate Prior Normalization (CPN), a simple yet effective training-free score calibration module designed to mitigate candidate prior bias in candidate likelihood. On four Text-Video Retrieval benchmarks, our BLiM equipped with CPN outperforms previous state-of-the-art models by 6.4 R@1 on average, effectively alleviating candidate prior bias and emphasizing query-candidate relevance. Our in-depth analysis across various multi-modal tasks beyond retrieval highlights the broad applicability of CPN which enhances visual understanding by reducing reliance on textual priors. Code is available at https://github.com/mlvlab/BLiM.

[99] LED Benchmark: Diagnosing Structural Layout Errors for Document Layout Analysis

Inbum Heo,Taewook Hwang,Jeesu Jung,Sangkeun Jung

Main category: cs.CV

TL;DR: This paper proposes LED, a novel benchmark for evaluating structural robustness in document layout predictions, which reveals hidden biases and trade-offs not captured by traditional metrics.

Details

Motivation: Challenges remain in addressing structural errors in Document Layout Analysis, and conventional metrics like IoU and mAP are insufficient for detecting these errors. Method: LED defines eight standardized error types and formulates three tasks: error existence detection, error type classification, and element-wise error type classification. LED-Dataset is a synthetic dataset with realistic structural errors. Result: Experimental results show that LED differentiates structural understanding capabilities across LMMs. Conclusion: LED effectively evaluates the structural robustness of document layout predictions and reveals modality biases and performance trade-offs. Abstract: Recent advancements in Document Layout Analysis through Large Language Models and Multimodal Models have significantly improved layout detection. However, despite these improvements, challenges remain in addressing critical structural errors, such as region merging, splitting, and missing content. Conventional evaluation metrics like IoU and mAP, which focus primarily on spatial overlap, are insufficient for detecting these errors. To address this limitation, we propose Layout Error Detection (LED), a novel benchmark designed to evaluate the structural robustness of document layout predictions. LED defines eight standardized error types, and formulates three complementary tasks: error existence detection, error type classification, and element-wise error type classification. Furthermore, we construct LED-Dataset, a synthetic dataset generated by injecting realistic structural errors based on empirical distributions from DLA models. Experimental results across a range of LMMs reveal that LED effectively differentiates structural understanding capabilities, exposing modality biases and performance trade-offs not visible through traditional metrics.

[100] Training-free Geometric Image Editing on Diffusion Models

Hanshen Zhu,Zhen Zhu,Kaile Zhang,Yiming Gong,Yuliang Liu,Xiang Bai

Main category: cs.CV

TL;DR: 本文提出了一种名为FreeFine的几何图像编辑方法，通过解耦处理实现高质量的图像编辑，尤其适用于复杂变换场景。

Details

Motivation: 现有的基于扩散的图像编辑方法在处理复杂的几何变换时存在困难，需要一种更有效的解决方案。 Method: 提出了一种解耦的几何图像编辑流水线，分离对象变换、源区域修复和目标区域优化，并采用一种无需训练的扩散方法FreeFine进行处理。 Result: 在GeoBench基准测试中，FreeFine在图像质量和编辑精度方面优于现有方法，特别是在处理复杂变换时。 Conclusion: FreeFine实现了更高质量的几何图像编辑，特别是在处理复杂变换时表现优越。 Abstract: We tackle the task of geometric image editing, where an object within an image is repositioned, reoriented, or reshaped while preserving overall scene coherence. Previous diffusion-based editing methods often attempt to handle all relevant subtasks in a single step, proving difficult when transformations become large or structurally complex. We address this by proposing a decoupled pipeline that separates object transformation, source region inpainting, and target region refinement. Both inpainting and refinement are implemented using a training-free diffusion approach, FreeFine. In experiments on our new GeoBench benchmark, which contains both 2D and 3D editing scenarios, FreeFine outperforms state-of-the-art alternatives in image fidelity, and edit precision, especially under demanding transformations. Code and benchmark are available at: https://github.com/CIawevy/FreeFine

[101] ST-SAM: SAM-Driven Self-Training Framework for Semi-Supervised Camouflaged Object Detection

Xihang Hu,Fuming Sun,Jiazhe Liu,Feilong Xu,Xiaoli Zhang

Main category: cs.CV

TL;DR: ST-SAM是一种高效的半监督伪装目标检测框架，通过自训练策略和混合提示机制，在仅需1%标注数据的情况下实现先进的性能。

Details

Motivation: 现有的基于教师-学生框架的SSCOD方法在监督有限的情况下存在严重的预测偏差和错误传播问题，且其多网络架构计算开销大、可扩展性有限。 Method: ST-SAM采用自训练策略，动态筛选和扩展高置信度伪标签，并将伪标签转化为包含领域知识的混合提示，以利用Segment Anything Model进行专门任务的优化。 Result: 实验表明，ST-SAM在COD基准数据集上仅使用1%的标注数据就能超越现有的SSCOD方法，甚至与全监督方法相媲美。 Conclusion: ST-SAM是一种高效的半监督伪装目标检测框架，仅需1%的标注数据即可实现最先进的性能，且不依赖特定模型或损失函数。 Abstract: Semi-supervised Camouflaged Object Detection (SSCOD) aims to reduce reliance on costly pixel-level annotations by leveraging limited annotated data and abundant unlabeled data. However, existing SSCOD methods based on Teacher-Student frameworks suffer from severe prediction bias and error propagation under scarce supervision, while their multi-network architectures incur high computational overhead and limited scalability. To overcome these limitations, we propose ST-SAM, a highly annotation-efficient yet concise framework that breaks away from conventional SSCOD constraints. Specifically, ST-SAM employs Self-Training strategy that dynamically filters and expands high-confidence pseudo-labels to enhance a single-model architecture, thereby fundamentally circumventing inter-model prediction bias. Furthermore, by transforming pseudo-labels into hybrid prompts containing domain-specific knowledge, ST-SAM effectively harnesses the Segment Anything Model's potential for specialized tasks to mitigate error accumulation in self-training. Experiments on COD benchmark datasets demonstrate that ST-SAM achieves state-of-the-art performance with only 1\% labeled data, outperforming existing SSCOD methods and even matching fully supervised methods. Remarkably, ST-SAM requires training only a single network, without relying on specific models or loss functions. This work establishes a new paradigm for annotation-efficient SSCOD. Codes will be available at https://github.com/hu-xh/ST-SAM.

[102] PriorFusion: Unified Integration of Priors for Robust Road Perception in Autonomous Driving

Xuewei Tang,Mengmeng Yang,Tuopu Wen,Peijin Jia,Le Cui,Mingshang Luo,Kehua Sheng,Bo Zhang,Diange Yang,Kun Jiang

Main category: cs.CV

TL;DR: PriorFusion enhances road element perception for autonomous driving by integrating semantic, geometric, and generative priors, leading to more accurate and coherent predictions.

Details

Motivation: Existing road perception approaches struggle in complex environments due to limited use of structured priors, leading to irregular and inaccurate predictions. Method: PriorFusion uses an instance-aware attention mechanism, a data-driven shape template space, and a diffusion-based framework to incorporate prior knowledge for road element perception. Result: Experiments showed that PriorFusion significantly improves perception accuracy, especially under challenging conditions, with more regular and coherent predictions compared to existing methods. Conclusion: PriorFusion is an effective framework for road element perception that generates accurate and coherent predictions by integrating semantic, geometric, and generative priors. Abstract: With the growing interest in autonomous driving, there is an increasing demand for accurate and reliable road perception technologies. In complex environments without high-definition map support, autonomous vehicles must independently interpret their surroundings to ensure safe and robust decision-making. However, these scenarios pose significant challenges due to the large number, complex geometries, and frequent occlusions of road elements. A key limitation of existing approaches lies in their insufficient exploitation of the structured priors inherently present in road elements, resulting in irregular, inaccurate predictions. To address this, we propose PriorFusion, a unified framework that effectively integrates semantic, geometric, and generative priors to enhance road element perception. We introduce an instance-aware attention mechanism guided by shape-prior features, then construct a data-driven shape template space that encodes low-dimensional representations of road elements, enabling clustering to generate anchor points as reference priors. We design a diffusion-based framework that leverages these prior anchors to generate accurate and complete predictions. Experiments on large-scale autonomous driving datasets demonstrate that our method significantly improves perception accuracy, particularly under challenging conditions. Visualization results further confirm that our approach produces more accurate, regular, and coherent predictions of road elements.

[103] Forgetting of task-specific knowledge in model merging-based continual learning

Timm Hess,Gido M van de Ven,Tinne Tuytelaars

Main category: cs.CV

TL;DR: This paper shows that merging models in continual learning preserves shared knowledge and that incremental training is better than parallel training for effective model merging.

Details

Motivation: The motivation behind this study is to understand how model merging in continual learning affects knowledge preservation and to determine the better approach for merging models—incremental training or parallel training. Method: The authors used controlled visual cues in computer vision experiments to investigate the effects of linear merging of models in continual learning. Result: The results show that merging models in continual learning largely preserves or enhances shared knowledge and that incremental training-based merging outperforms parallel training-based merging. Conclusion: The paper concludes that linear merging of models in continual learning helps preserve or enhance shared knowledge while causing rapid degradation of unshared, task-specific knowledge. Merging models from incremental training outperforms merging models trained in parallel. Abstract: This paper investigates the linear merging of models in the context of continual learning (CL). Using controlled visual cues in computer vision experiments, we demonstrate that merging largely preserves or enhances shared knowledge, while unshared task-specific knowledge rapidly degrades. We further find that merging models from an incremental training process consistently outperforms merging models trained in parallel.

[104] The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Alfio Ferrara,Sergio Picascia,Elisabetta Rocchetti

Main category: cs.CV

TL;DR: 本文研究了基于扩散模型的文本到图像生成模型在生成艺术作品时如何表示内容和风格的概念。

Details

Motivation: 文本到图像扩散模型在艺术内容生成方面表现出色，但其内部如何表示内容和风格等概念仍不清楚。 Method: 利用交叉注意力热图将生成图像中的像素归因于特定的提示词，从而分离出受内容描述词和风格描述词影响的图像区域。 Result: 研究发现，在许多情况下，内容词主要影响物体相关区域，而风格词则影响背景和纹理区域。 Conclusion: 扩散模型在生成艺术作品时表现出不同程度的内容与风格分离，表明其对内容-风格区分具有一定的理解能力。 Abstract: Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

[105] Impact of Hyperparameter Optimization on the Accuracy of Lightweight Deep Learning Models for Real-Time Image Classification

Vineet Kumar Rakesh,Soumya Mazumdar,Tapas Samanta,Sarbajit Pal,Amitabha Das

Main category: cs.CV

TL;DR: 本文研究了轻量级深度学习模型的超参数调整对准确性和收敛行为的影响，并展示了RepVGG-A2在实时图像分类中的高效性能。

Details

Motivation: 轻量级卷积和基于Transformer的模型对于资源受限环境下的实时图像分类至关重要，因此需要研究超参数调整对模型性能的影响。 Method: 通过在ImageNet-1K数据集上训练七种高效的深度学习架构，并进行系统的消融研究，以评估不同超参数对性能的影响。 Result: 余弦学习率衰减和可调整批量大小可以显著提高准确性和收敛速度，同时保持低延迟和内存成本；RepVGG-A2实现了超过80%的Top-1准确率。 Conclusion: RepVGG-A2表现出在准确性和部署成本之间的良好平衡，适用于实时图像处理管道。 Abstract: Lightweight convolutional and transformer-based models have become vital for real-time image classification in resource-constrained applications, such as embedded systems and edge devices. This work analyzes the influence of hyperparameter adjustment on the accuracy and convergence behavior of seven efficient deep learning architectures: EfficientNetV2-S, ConvNeXt-T, MobileViT v2 (XXS/XS/S), MobileNetV3-L, TinyViT-21M, and RepVGG-A2. All models are trained on the ImageNet-1K dataset under consistent training settings, with an emphasis on real-time practicality. An comprehensive ablation study is undertaken to separate the effect of critical hyperparameters, including learning rate schedules, batch sizes, input resolution, data augmentation, regularization approaches, and optimizer choice. To assess appropriateness for real-time applications, each model is assessed not only in terms of Top-1 and Top-5 classification accuracy, but also in terms of inference time, parameter count, model size, and frames-per-second (FPS) on a GPU-accelerated edge deployment simulation. Results demonstrate that cosine learning rate decay and adjustable batch size may greatly boost both accuracy and convergence speed, while keeping low latency and memory cost. Notably, RepVGG-A2 achieves over 80% Top-1 accuracy with efficient inference performance, offering a compelling balance between accuracy and deployment cost for VGG-style models. The results give practical guidance for constructing resource-efficient deep learning models appropriate for real-time image processing pipelines. All code and training logs are publicly accessible at https://github.com/VineetKumarRakesh/lcnn-opt.

[106] FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Jiajun Cao,Qizhe Zhang,Peidong Jia,Xuhui Zhao,Bo Lan,Xiaoan Zhang,Xiaobao Wei,Sixiang Chen,Zhuo Li,Yang Wang,Liyun Li,Xianming Liu,Ming Lu,Shanghang Zhang

Main category: cs.CV

TL;DR: FastDriveVLA是一种基于重建的视觉标记修剪框架，专为自主驾驶设计，通过优先保留前景信息，实现了高效的计算并达到了先进的性能。

Details

Motivation: 当前的视觉标记修剪方法在自主驾驶场景中表现不佳，而人类驾驶员驾驶时关注的是相关的前景区域，因此保留包含前景信息的视觉标记对有效决策至关重要。 Method: FastDriveVLA包括一个插件式的视觉标记修剪器ReconPruner，通过MAE风格的像素重建来优先保留前景信息。采用对抗前景-背景重建策略训练ReconPruner。 Result: 该方法在nuScenes闭环规划基准测试中不同修剪比例下均达到了最先进的结果。 Conclusion: FastDriveVLA框架在自主驾驶场景中展现出了卓越的视觉标记修剪效果，并且能够适用于不同的VLA模型，同时无需重新训练。 Abstract: Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes closed-loop planning benchmark across different pruning ratios.

[107] FASTopoWM: Fast-Slow Lane Segment Topology Reasoning with Latent World Models

Yiming Yang,Hongbin Lin,Yueru Luo,Suzhong Fu,Chao Zheng,Xinrui Yan,Shuqi Mei,Kun Tang,Shuguang Cui,Zhen Li

Main category: cs.CV

TL;DR: FASTopoWM improves lane segment topology reasoning by combining a fast-slow framework with world models, surpassing existing methods in performance on OpenLane-V2.

Details

Motivation: Existing lane topology reasoning methods struggle with effectively utilizing temporal information. The proposed method aims to overcome limitations such as over-reliance on historical queries and vulnerability to pose estimation failures. Method: FASTopoWM uses a fast-slow framework with latent query and BEV world models conditioned on action latent to propagate state representations, addressing pose estimation issues and enhancing temporal perception. Result: On the OpenLane-V2 benchmark, FASTopoWM outperformed state-of-the-art methods with 37.4% mAP for lane segment detection and 46.3% OLS for centerline perception. Conclusion: FASTopoWM is a new framework for lane segment topology reasoning that improves detection and reasoning performance by incorporating latent world models and parallel supervision. Abstract: Lane segment topology reasoning provides comprehensive bird's-eye view (BEV) road scene understanding, which can serve as a key perception module in planning-oriented end-to-end autonomous driving systems. Existing lane topology reasoning methods often fall short in effectively leveraging temporal information to enhance detection and reasoning performance. Recently, stream-based temporal propagation method has demonstrated promising results by incorporating temporal cues at both the query and BEV levels. However, it remains limited by over-reliance on historical queries, vulnerability to pose estimation failures, and insufficient temporal propagation. To overcome these limitations, we propose FASTopoWM, a novel fast-slow lane segment topology reasoning framework augmented with latent world models. To reduce the impact of pose estimation failures, this unified framework enables parallel supervision of both historical and newly initialized queries, facilitating mutual reinforcement between the fast and slow systems. Furthermore, we introduce latent query and BEV world models conditioned on the action latent to propagate the state representations from past observations to the current timestep. This design substantially improves the performance of temporal perception within the slow pipeline. Extensive experiments on the OpenLane-V2 benchmark demonstrate that FASTopoWM outperforms state-of-the-art methods in both lane segment detection (37.4% v.s. 33.6% on mAP) and centerline perception (46.3% v.s. 41.5% on OLS).

[108] Learning Semantic Directions for Feature Augmentation in Domain-Generalized Medical Segmentation

Yingkai Wang,Yaoyao Zhu,Xiuding Cai,Yuhao Xiao,Haotian Wu,Yu Yao

Main category: cs.CV

TL;DR: This paper proposes a domain generalization framework for medical image segmentation that improves robustness to domain-specific variations by modulating features while preserving anatomical consistency, outperforming existing methods on multi-center benchmarks.

Details

Motivation: Domain shifts caused by variations in imaging conditions, scanner types, and acquisition protocols degrade the performance of segmentation models, limiting their practical deployment in clinical settings. Method: The method introduces implicit feature perturbations guided by domain statistics using a learnable semantic direction selector and a covariance-based semantic intensity sampler. Additionally, an adaptive consistency constraint is applied to stabilize feature selection. Result: Extensive experiments on two public multi-center benchmarks show that the proposed framework consistently outperforms existing domain generalization approaches in segmentation performance. Conclusion: The proposed domain generalization framework enhances the robustness and reliability of medical image segmentation across diverse clinical domains by modulating domain-variant features while preserving anatomical consistency. Abstract: Medical image segmentation plays a crucial role in clinical workflows, but domain shift often leads to performance degradation when models are applied to unseen clinical domains. This challenge arises due to variations in imaging conditions, scanner types, and acquisition protocols, limiting the practical deployment of segmentation models. Unlike natural images, medical images typically exhibit consistent anatomical structures across patients, with domain-specific variations mainly caused by imaging conditions. This unique characteristic makes medical image segmentation particularly challenging. To address this challenge, we propose a domain generalization framework tailored for medical image segmentation. Our approach improves robustness to domain-specific variations by introducing implicit feature perturbations guided by domain statistics. Specifically, we employ a learnable semantic direction selector and a covariance-based semantic intensity sampler to modulate domain-variant features while preserving task-relevant anatomical consistency. Furthermore, we design an adaptive consistency constraint that is selectively applied only when feature adjustment leads to degraded segmentation performance. This constraint encourages the adjusted features to align with the original predictions, thereby stabilizing feature selection and improving the reliability of the segmentation. Extensive experiments on two public multi-center benchmarks show that our framework consistently outperforms existing domain generalization approaches, achieving robust and generalizable segmentation performance across diverse clinical domains.

Qiang Lu,Waikit Xiu,Xiying Li,Shenyu Hu,Shengbo Sun

Main category: cs.CV

TL;DR: This paper proposes a novel framework for traffic sign recognition that combines open-vocabulary detection and cross-modal learning, effectively addressing challenges such as long-tail data distribution and multi-scale target recognition, achieving superior performance on benchmark datasets.

Details

Motivation: The motivation stems from the challenges faced by current traffic sign recognition technologies, including the long-tail distribution of datasets and the difficulty in recognizing small and multi-scale targets in real-world scenarios. Method: The method involves a two-stage framework: first, the NanoVerse YOLO model with RepVL-PAN and SPD-Conv modules for improved detection of small and multi-scale traffic signs; second, the TSR-MCL model that uses cross-modal contrastive learning between visual and semantic features to enhance classification robustness. Result: On the TT100K dataset, the proposed method achieved a state-of-the-art 78.4% mAP for long-tail detection, 91.8% accuracy, and 88.9% recall, outperforming mainstream algorithms in accuracy and generalization. Conclusion: The paper concludes that the proposed two-stage framework combining open-vocabulary detection and cross-modal learning significantly improves traffic sign recognition performance in challenging conditions, such as long-tail distributions and multi-scale targets. Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.

[110] MagicRoad: Semantic-Aware 3D Road Surface Reconstruction via Obstacle Inpainting

Xingyue Peng,Yuandong Lyu,Lang Zhang,Jian Zhu,Songtao Wang,Jiaxin Deng,Songxin Lu,Weiliang Ma,Dangen She,Peng Jia,XianPeng Lang

Main category: cs.CV

TL;DR: 本文提出了一种新的道路表面重建框架，通过集成遮挡感知和语义引导技术，有效解决了动态遮挡和环境变化导致的问题，提高了重建质量。

Details

Motivation: 现有基于网格渲染或3D高斯随机投影的方法在干净和静态条件下取得了一定成果，但在面对动态代理的遮挡、静态障碍物的视觉杂乱以及光照和天气变化导致的外观退化时仍显脆弱。 Method: 集成遮挡感知的2D高斯表面元和语义引导的颜色增强方法，利用平面自适应的高斯表示进行高效的大规模建模，采用分割引导的视频修复去除动态和静态前景对象，并通过在HSV空间中的语义感知校正增强颜色一致性。 Result: 在城市规模数据集上的大量实验表明，该框架能够生成视觉上连贯且几何上逼真的重建结果。 Conclusion: 该论文提出了一种鲁棒的道路表面重建框架，该框架在真实世界条件下显著优于先前的方法。 Abstract: Road surface reconstruction is essential for autonomous driving, supporting centimeter-accurate lane perception and high-definition mapping in complex urban environments.While recent methods based on mesh rendering or 3D Gaussian splatting (3DGS) achieve promising results under clean and static conditions, they remain vulnerable to occlusions from dynamic agents, visual clutter from static obstacles, and appearance degradation caused by lighting and weather changes. We present a robust reconstruction framework that integrates occlusion-aware 2D Gaussian surfels with semantic-guided color enhancement to recover clean, consistent road surfaces. Our method leverages a planar-adapted Gaussian representation for efficient large-scale modeling, employs segmentation-guided video inpainting to remove both dynamic and static foreground objects, and enhances color coherence via semantic-aware correction in HSV space. Extensive experiments on urban-scale datasets demonstrate that our framework produces visually coherent and geometrically faithful reconstructions, significantly outperforming prior methods under real-world conditions.

[111] The Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN, YOLOv XI and YOLOv XII models

Ahmet Can Ömercikoğlu,Mustafa Mansur Yönügül,Pakize Erdoğmuş

Main category: cs.CV

TL;DR: 本研究系统分析了输入分辨率对YOLOv11、YOLOv12和MTCNN人脸检测器性能的影响，并发现YOLOv11在高分辨率下精度最佳，而MTCNN实时性能不足。

Details

Motivation: 现实条件（如低分辨率图像）会显著影响人脸检测性能，因此需要系统地研究输入分辨率对深度学习人脸检测器的影响。 Method: 使用WIDER FACE数据集，在多种图像分辨率下（160x160、320x320和640x640）对YOLOv11、YOLOv12和MTCNN三种人脸检测器进行广泛的评估。 Result: YOLOv11在精度和稳健性上表现最佳，YOLOv12在召回率上略优，而MTCNN在地标定位上具有竞争力但推理速度较慢。 Conclusion: 研究发现YOLOv11在检测精度上优于YOLOv12和MTCNN，尤其是在较高分辨率下；YOLOv12在召回率上略胜一筹；MTCNN在实时推理速度上不足。 Abstract: Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model's performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.

[112] Who is a Better Talker: Subjective and Objective Quality Assessment for AI-Generated Talking Heads

Yingjie Zhou,Jiezhang Cao,Zicheng Zhang,Farong Wen,Yanwei Jiang,Jun Jia,Xiaohong Liu,Xiongkuo Min,Guangtao Zhai

Main category: cs.CV

TL;DR: 本文创建了迄今为止最大的AI生成说话头（AGTHs）质量评估数据集THQA-10K，并提出了一个基于第一帧、Y-T切片和音调-嘴唇一致性的客观质量评估方法，该方法在AGTH质量评估中表现卓越。此外，还进行了主观实验评估说话者的性能和现有AGTHs的失真情况。

Details

Motivation: 尽管AI生成的说话头（AGTHs）逐渐成为一种新兴的数字人类媒体，但关于这些生成物的质量问题和全面研究仍然有限。 Method: 本文创建了迄今为止最大的AGTH质量评估数据集THQA-10K，并招募志愿者对AGTHs进行主观评分并分类失真类型。 Result: 实验结果显示，该提出的客观质量评估方法在AGTH质量评估中达到了最先进的性能。 Conclusion: 本文提出了一个基于第一帧、Y-T切片和音调-嘴唇一致性的客观质量评估方法，该方法在AGTH质量评估中达到了最先进的性能。 Abstract: Speech-driven methods for portraits are figuratively known as "Talkers" because of their capability to synthesize speaking mouth shapes and facial movements. Especially with the rapid development of the Text-to-Image (T2I) models, AI-Generated Talking Heads (AGTHs) have gradually become an emerging digital human media. However, challenges persist regarding the quality of these talkers and AGTHs they generate, and comprehensive studies addressing these issues remain limited. To address this gap, this paper presents the largest AGTH quality assessment dataset THQA-10K to date, which selects 12 prominent T2I models and 14 advanced talkers to generate AGTHs for 14 prompts. After excluding instances where AGTH generation is unsuccessful, the THQA-10K dataset contains 10,457 AGTHs. Then, volunteers are recruited to subjectively rate the AGTHs and give the corresponding distortion categories. In our analysis for subjective experimental results, we evaluate the performance of talkers in terms of generalizability and quality, and also expose the distortions of existing AGTHs. Finally, an objective quality assessment method based on the first frame, Y-T slice and tone-lip consistency is proposed. Experimental results show that this method can achieve state-of-the-art (SOTA) performance in AGTH quality assessment. The work is released at https://github.com/zyj-2000/Talker.

[113] IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

Radu-Andrei Bourceanu,Neil De La Fuente,Jan Grimm,Andrei Jardan,Andriy Manucharyan,Cornelius Weiss,Roman Pflugfelder

Main category: cs.CV

TL;DR: 本文通过分析六篇代表性论文，总结了计算机视觉领域关键设计模式的演变，包括ResNet、ViT、GANs、LDMs、DINO和MAE的发展与影响。

Details

Motivation: 为了理解计算机视觉领域中关键技术的发展与演进，本文分析了具有代表性的设计模式及其对应的论文。 Method: 本文通过分析六篇有代表性的论文，包括ResNet、Vision Transformer (ViT)、Generative Adversarial Networks (GANs)、Latent Diffusion Models (LDMs)、DINO和Masked Autoencoders (MAE)，探讨了计算机视觉领域的设计模式演变。 Result: 本文揭示了计算机视觉领域中几种关键架构和模型的演进，包括图像识别、生成模型以及自监督学习技术的进展，展示了这些模式如何推动领域的发展。 Conclusion: 本文总结了计算机视觉领域中关键设计模式的演变，并通过分析六篇有影响力的论文展示了这些模式的发展历程和现状。 Abstract: This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

[114] Short-LVLM: Compressing and Accelerating Large Vision-Language Models by Pruning Redundant Layers

Ji Ma,Wei Suo,Peng Wang,Yanning Zhang

Main category: cs.CV

TL;DR: This paper introduces Short-LVLM (SVL), a training-free and model-agnostic framework that effectively compresses large vision-language models by focusing on important vision-language tokens and reducing inter-layer feature discrepancies.

Details

Motivation: Large vision-language models (LVLMs) are limited in practical applications due to their massive parameters and high computational cost. While layer pruning has been effective in NLP models, its applicability to LVLMs is unclear due to modality divergence between vision and language. Method: Through empirical analysis, the authors identify the limitations of applying NLP-based layer pruning directly to LVLMs. They propose Short-LVLM (SVL), which focuses on key vision-language tokens and reduces feature gaps between layers to improve compression effectiveness. Result: The proposed Short-LVLM achieves a superior balance between performance and efficiency in compressing LVLMs. It is also training-free, model-agnostic, and highly compatible with existing architectures. Conclusion: Short-LVLM (SVL) is a training-free, model-agnostic, and highly compatible framework that effectively compresses large vision-language models (LVLMs) by utilizing important vision-language tokens and mitigating inter-layer feature gaps. Abstract: Although large vision-language models (LVLMs) have demonstrated impressive capabilities in multi-modal understanding and reasoning, their practical applications are still limited by massive model parameters and high computational costs. Recent efforts from natural language processing (NLP) have shown the effectiveness of layer pruning, offering a plausible training-free compression solution. However, due to the modality divergence between vision and language, it is unclear whether these NLP techniques are still effective in LVLMs. In this paper, we empirically prove that directly applying these layer pruning methods to LVLMs is ineffective. Through extensive experiments, we find that non-essential vision-language (VL) tokens and inter-layer feature gaps pose critical challenges to pruning layers in LVLMs. Based on these insights, we propose a novel framework Short-LVLM (SVL) that can utilize important VL tokens and mitigate the layer-wise feature gaps. Notably, Short-LVLM not only achieves a superior trade-off between performance and efficiency but also exhibits several potential advantages, i.e., training-free, model-agnostic, and highly compatible. The code for this work is publicly available at https://github.com/ASGO-MM/Short-LVLM.

[115] VMatcher: State-Space Semi-Dense Local Feature Matching

Ali Youssef

Main category: cs.CV

TL;DR: VMatcher 是一种结合 Mamba 和 Transformer 的高效特征匹配网络，适用于实时应用。

Details

Motivation: 现有的基于 Transformer 的特征匹配方法计算复杂度高，而 Mamba 具有线性复杂度且性能相当甚至更好，因此提出一种高效且有效的混合方法。 Method: VMatcher 使用 Mamba 的选择性状态空间模型和 Transformer 的注意力机制相结合的混合架构。 Result: VMatcher 在多个配置下实现了新的基准性能，同时具有高效性、鲁棒性和实时应用的实用性。 Conclusion: VMatcher 提出了一种结合 Mamba 和 Transformer 的半稠密特征匹配方法，在保持性能的同时显著提高了计算效率。 Abstract: This paper introduces VMatcher, a hybrid Mamba-Transformer network for semi-dense feature matching between image pairs. Learning-based feature matching methods, whether detector-based or detector-free, achieve state-of-the-art performance but depend heavily on the Transformer's attention mechanism, which, while effective, incurs high computational costs due to its quadratic complexity. In contrast, Mamba introduces a Selective State-Space Model (SSM) that achieves comparable or superior performance with linear complexity, offering significant efficiency gains. VMatcher leverages a hybrid approach, integrating Mamba's highly efficient long-sequence processing with the Transformer's attention mechanism. Multiple VMatcher configurations are proposed, including hierarchical architectures, demonstrating their effectiveness in setting new benchmarks efficiently while ensuring robustness and practicality for real-time applications where rapid inference is crucial. Source Code is available at: https://github.com/ayoussf/VMatcher

[116] UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Yijie Zhu,Lingsen Zhang,Zitong Yu,Rui Shao,Tao Tan,Liqiang Nie

Main category: cs.CV

TL;DR: 本文提出UniEmo，一个统一情感理解和生成的框架，通过分层理解链和扩散模型，结合双重反馈机制，在多项任务中表现优越。

Details

Motivation: 情感理解和生成任务通常是互补的，但传统方法将它们分开处理，因此需要一种能够同时提升两者的方法。 Method: 提出UniEmo统一框架，包含分层情感理解链、情感扩散模型以及情感相关系数和条件损失函数，同时采用数据过滤算法优化训练。 Result: 实验表明，UniEmo在情感理解和生成任务上均显著优于现有最先进方法，并实现双向反馈增强。 Conclusion: UniEmo框架通过结合情感理解和生成任务，显著提升了两者的表现，并引入了生成驱动的双重反馈机制来增强模型的理解能力。 Abstract: Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model's understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

[117] Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation

Haoran Chen,Zexiao Wang,Haidong Cao,Zuxuan Wu,Yu-Gang Jiang

Main category: cs.CV

TL;DR: MP^2A提出了一种渐进式的对齐策略来解决使用CLIP模型进行无监督领域适应时遇到的噪声和误差传播问题。

Details

Motivation: 现有的方法试图同时使用所有伪标记的数据进行领域对齐，这在面对噪声和难以分类的样本时会导致误差传播和次优特征学习。多源场景下，不同源领域之间的领域差距和噪声水平差异进一步加剧了对齐过程的不稳定。 Method: MP^2A方法从训练目标样本的高置信度子集开始，逐渐引入更多具有挑战性的样本来优化模型理解。 Result: MP^2A在三个流行的UDA基准测试中取得了最先进的性能，即ImageCLEF、Office-Home和最具挑战性的DomainNet。 Conclusion: MP^2A通过逐步对齐策略有效地减轻了确认偏差，促进了更稳健的收敛，从而学习到真正与领域无关的特征。 Abstract: Large Vision-Language Models like CLIP have become a powerful foundation for Unsupervised Domain Adaptation due to their strong zero-shot generalization. State-of-the-art methods typically leverage CLIP to generate pseudo-labels for the target domain, then fine-tune the model to learn domain-invariant features. However, these methods attempt to align source and target domains using all pseudo-labeled data simultaneously. This one-shot alignment struggles with noisy, hard-to-classify samples, leading to error propagation and suboptimal feature learning. The problem is even more amplified in the multi-source scenario, where diverse domain gaps and varying noise levels across multiple source domains further destabilize the alignment process. To address this issue, in this work, we propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task. Our method begins by training the model on a high-confidence subset of target samples, allowing it to first learn a well-aligned representation from the most reliable data. As training progresses, it gradually incorporates more challenging samples, guiding the model to refine its understanding without being overwhelmed by initial label noise. This progressive approach effectively mitigates confirmation bias and promotes a more robust convergence, allowing for the learning of genuinely domain-invariant features. We name our approach MP^2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet. Experiments showcase that MP^2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches, demonstrating the effectiveness of our approach.

[118] NeRF Is a Valuable Assistant for 3D Gaussian Splatting

Shuangkang Fang,I-Chao Shen,Takeo Igarashi,Yufeng Wang,ZeSheng Wang,Yi Yang,Wenrui Ding,Shuchang Zhou

Main category: cs.CV

TL;DR: 本文提出 NeRF-GS，通过结合 NeRF 和 3DGS 的优势，有效提升了 3D 场景表示的性能，并证明两者是互补的。

Details

Motivation: 3DGS 存在对高斯初始化敏感、空间感知有限、高斯间相关性弱等局限性，而 NeRF 具有连续空间表示的优势，因此提出 NeRF-GS 来弥补 3DGS 的不足。 Method: 重新设计 3DGS 的结构，通过共享的 3D 空间信息逐步对齐 NeRF 的空间特征，同时优化隐式特征和高斯位置的残差向量，以增强 3DGS 的个性化能力。 Result: 在基准数据集上的实验结果表明，NeRF-GS 超越了现有方法，达到了最先进的性能。 Conclusion: NeRF-GS 通过联合优化 NeRF 和 3DGS，表明两者是互补而非竞争的关系，并为高效的 3D 场景表示提供了新的混合方法视角。 Abstract: We introduce NeRF-GS, a novel framework that jointly optimizes Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). This framework leverages the inherent continuous spatial representation of NeRF to mitigate several limitations of 3DGS, including sensitivity to Gaussian initialization, limited spatial awareness, and weak inter-Gaussian correlations, thereby enhancing its performance. In NeRF-GS, we revisit the design of 3DGS and progressively align its spatial features with NeRF, enabling both representations to be optimized within the same scene through shared 3D spatial information. We further address the formal distinctions between the two approaches by optimizing residual vectors for both implicit features and Gaussian positions to enhance the personalized capabilities of 3DGS. Experimental results on benchmark datasets show that NeRF-GS surpasses existing methods and achieves state-of-the-art performance. This outcome confirms that NeRF and 3DGS are complementary rather than competing, offering new insights into hybrid approaches that combine 3DGS and NeRF for efficient 3D scene representation.

Wei Li,Xun Gong,Jiao Li,Xiaobin Sun

Main category: cs.CV

TL;DR: This paper proposes Adaptive Grouped Alignment (AGA), a novel framework for learning structured visual-language representations from medical images and reports, achieving strong results without relying on external negative samples.

Details

Motivation: Current vision-language pretraining methods in the medical domain oversimplify clinical reports and rely on large numbers of hard negative samples, which are impractical for small-scale datasets. The authors aim to address these limitations by capturing structured semantics from paired images and reports without requiring external negatives. Method: AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix, which dynamically forms visual and language groups using threshold gating modules. It computes group representations as weighted averages of similarity scores and employs an Instance Aware Group Alignment loss to align tokens with their group representations. A Bidirectional Cross-modal Grouped Alignment module further enhances fine-grained alignment. Result: Extensive experiments on public and private datasets demonstrate that AGA achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings. Conclusion: The proposed Adaptive Grouped Alignment (AGA) framework effectively captures structured semantics from paired medical images and reports, achieving strong performance on image-text retrieval and classification tasks in both fine-tuning and zero-shot settings. Abstract: Learning medical visual representations from paired images and reports is a promising direction in representation learning. However, current vision-language pretraining methods in the medical domain often simplify clinical reports into single entities or fragmented tokens, ignoring their inherent structure. In addition, contrastive learning frameworks typically depend on large quantities of hard negative samples, which is impractical for small-scale medical datasets. To tackle these challenges, we propose Adaptive Grouped Alignment (AGA), a new framework that captures structured semantics from paired medical images and reports. AGA introduces a bidirectional grouping mechanism based on a sparse similarity matrix. For each image-report pair, we compute fine-grained similarities between text tokens and image patches. Each token selects its top-matching patches to form a visual group, and each patch selects its most related tokens to form a language group. To enable adaptive grouping, we design two threshold gating modules, called Language Grouped Threshold Gate and Vision Grouped Threshold Gate, which learn grouping thresholds dynamically. Group representations are computed as weighted averages based on similarity scores. To align each token with its group representation, we introduce an Instance Aware Group Alignment loss that operates within each image-text pair, removing the need for external negatives. Finally, a Bidirectional Cross-modal Grouped Alignment module is applied to enhance fine-grained alignment between visual and linguistic group representations. Extensive experiments on public and private datasets show that our method achieves strong performance on image-text retrieval and classification tasks under both fine-tuning and zero-shot settings.

[120] Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories

Lemar Abdi,Francisco Caetano,Amaan Valiuddin,Christiaan Viviers,Hamdi Joudeh,Fons van der Sommen

Main category: cs.CV

TL;DR: 提出了一种无需重建的医学影像异常检测方法，利用基于Stein分数的去噪扩散模型的前向扩散轨迹，实现了高效、准确的异常评分。

Details

Motivation: 医学影像中，无监督的分布外（OOD）检测是一种有吸引力的方法，可以识别发病率极低的病理病例。现有的生成方法通常依赖于似然估计或重建误差，但这些方法计算成本高、不可靠，并且如果内点数据发生变化，需要重新训练。 Method: 提出了一种无需重建的OOD检测方法，该方法利用了基于Stein分数的去噪扩散模型（SBDDM）的前向扩散轨迹。通过估计的Stein分数捕捉轨迹曲率，仅需五次扩散步骤即可实现准确的异常评分。 Result: 在多个近OOD和远OOD基准测试中，该方法表现出色，推理过程中大幅降低了计算成本。与现有方法相比，SBDDM在近OOD和远OOD检测中的相对改进分别达到了10.43%和18.10%。 Conclusion: 基于Stein分数的去噪扩散模型（SBDDM）在医学影像的OOD检测中表现出了卓越的性能，为实时、可靠的计算机辅助诊断提供了一个实用的构建模块。 Abstract: In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.

[121] Honey Adulteration Detection using Hyperspectral Imaging and Machine Learning

Mokhtar A. Al-Awadhi,Ratnadeep R. Deshmukh

Main category: cs.CV

TL;DR: 论文开发了一种基于机器学习的蜂蜜掺假检测系统，利用高光谱成像数据，通过LDA和KNN模型实现了高精度检测。

Details

Motivation: 开发一种基于机器学习的自动检测蜂蜜掺假的方法，以替代传统的化学检测方法，提高检测效率和准确性。 Method: 论文采用了基于线性判别分析（LDA）和K近邻（KNN）模型的双步骤方法，首先对蜂蜜样本的植物来源进行分类，然后识别并量化糖浆掺假的程度。 Result: 该系统在公开的蜂蜜高光谱图像数据集上表现良好，准确率高达96.39%。 Conclusion: 该论文提出的机器学习系统能够有效检测蜂蜜掺假，整体交叉验证准确率达到96.39%，可以作为现有化学检测方法的替代方案。 Abstract: This paper aims to develop a machine learning-based system for automatically detecting honey adulteration with sugar syrup, based on honey hyperspectral imaging data. First, the floral source of a honey sample is classified by a botanical origin identification subsystem. Then, the sugar syrup adulteration is identified, and its concentration is quantified by an adulteration detection subsystem. Both subsystems consist of two steps. The first step involves extracting relevant features from the honey sample using Linear Discriminant Analysis (LDA). In the second step, we utilize the K-Nearest Neighbors (KNN) model to classify the honey botanical origin in the first subsystem and identify the adulteration level in the second subsystem. We assess the proposed system performance on a public honey hyperspectral image dataset. The result indicates that the proposed system can detect adulteration in honey with an overall cross-validation accuracy of 96.39%, making it an appropriate alternative to the current chemical-based detection methods.

[122] Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification

Abdellah Zakaria Sellam,Salah Eddine Bekhouche,Cosimo Distante,Abdelmalik Taleb-Ahmed

Main category: cs.CV

TL;DR: This paper proposes an improved dual-teacher self-supervised framework using Kolmogorov-Arnold Networks (KANs) to better capture complex stylistic features in art classification, achieving higher accuracy on benchmark datasets.

Details

Motivation: The motivation stems from the challenge of art style classification due to limited labeled datasets and the inability of existing models to effectively capture complex style features and global context. Method: The researchers modified a dual-teacher self-supervised learning framework by incorporating Kolmogorov-Arnold Networks (KANs) to better capture global stylistic hierarchies and nonlinear feature interactions. Result: Experiments on the WikiArt and Pandora18k datasets showed that the proposed method outperforms the baseline dual-teacher architecture in Top-1 accuracy, particularly due to KANs' ability to model nonlinear correlations. Conclusion: The study concludes that using KANs within a dual-teacher framework enhances the modeling of nonlinear feature correlations, leading to improved performance in art style classification. Abstract: Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov-Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs' spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.

[123] Adjustable Spatio-Spectral Hyperspectral Image Compression Network

Martin Hermann Paul Fuchs,Behnood Rasti,Begüm Demir

Main category: cs.CV

TL;DR: This paper proposes HyCASS, a learning-based model for adjustable hyperspectral image compression in both spectral and spatial dimensions. The model captures both short-range and long-range redundancies and provides guidelines for balancing spectral and spatial compression at different compression ratios.

Details

Motivation: The rapid growth of hyperspectral data archives necessitates efficient storage, and there is a lack of comprehensive studies on the effects of spectral and spatial compression on HSI compression performance. Method: The paper proposes HyCASS, a learning-based model for HSI compression in both spectral and spatial dimensions. It consists of six modules, including spectral encoder, spatial encoder, CR adapter encoder, CR adapter decoder, spatial decoder, and spectral decoder, which use convolutional layers and transformer blocks to capture redundancies. Result: Experimental results on two HSI benchmark datasets demonstrate the effectiveness of the proposed model compared to existing methods. Conclusion: HyCASS can effectively balance spectral and spatial compression across different CRs, and the proposed model is efficient compared to existing learning-based compression models. Abstract: With the rapid growth of hyperspectral data archives in remote sensing (RS), the need for efficient storage has become essential, driving significant attention toward learning-based hyperspectral image (HSI) compression. However, a comprehensive investigation of the individual and joint effects of spectral and spatial compression on learning-based HSI compression has not been thoroughly examined yet. Conducting such an analysis is crucial for understanding how the exploitation of spectral, spatial, and joint spatio-spectral redundancies affects HSI compression. To address this issue, we propose Adjustable Spatio-Spectral Hyperspectral Image Compression Network (HyCASS), a learning-based model designed for adjustable HSI compression in both spectral and spatial dimensions. HyCASS consists of six main modules: 1) spectral encoder; 2) spatial encoder; 3) compression ratio (CR) adapter encoder; 4) CR adapter decoder; 5) spatial decoder; and 6) spectral decoder module. The modules employ convolutional layers and transformer blocks to capture both short-range and long-range redundancies. Experimental results on two HSI benchmark datasets demonstrate the effectiveness of our proposed adjustable model compared to existing learning-based compression models. Based on our results, we establish a guideline for effectively balancing spectral and spatial compression across different CRs, taking into account the spatial resolution of the HSIs. Our code and pre-trained model weights are publicly available at https://git.tu-berlin.de/rsim/hycass .

[124] Machine learning and machine learned prediction in chest X-ray images

Shereiff Garrett,Abhinav Adhikari,Sarina Gautam,DaShawn Marquis Morris,Chandra Mani Adhikari

Main category: cs.CV

TL;DR: 本研究分析了使用5824张胸部X光图像的两种机器学习算法，证明DenseNet-121在决策过程中比基线CNN更准确地关注关键图像区域。

Details

Motivation: 机器学习和人工智能是快速发展的研究领域，通过数据训练算法、学习模式并做出预测，以解决复杂的实际问题。 Method: 我们实施了两种机器学习算法，即基线卷积神经网络（CNN）和DenseNet-121，并使用5824张胸部X光图像进行分析。 Result: 基线CNN和DenseNet-121在本文提出的二分类问题中表现都非常出色。 Conclusion: DenseNet-121在决策过程中比基线CNN更能正确关注输入胸部X光图像的关键部分。 Abstract: Machine learning and artificial intelligence are fast-growing fields of research in which data is used to train algorithms, learn patterns, and make predictions. This approach helps to solve seemingly intricate problems with significant accuracy without explicit programming by recognizing complex relationships in data. Taking an example of 5824 chest X-ray images, we implement two machine learning algorithms, namely, a baseline convolutional neural network (CNN) and a DenseNet-121, and present our analysis in making machine-learned predictions in predicting patients with ailments. Both baseline CNN and DenseNet-121 perform very well in the binary classification problem presented in this work. Gradient-weighted class activation mapping shows that DenseNet-121 correctly focuses on essential parts of the input chest X-ray images in its decision-making more than the baseline CNN.

[125] Mitigating Resolution-Drift in Federated Learning: Case of Keypoint Detection

Taeheon Lim,Joohyung Lee,Kyungjae Lee,Jungchan Cho

Main category: cs.CV

TL;DR: This paper proposes RAF, a resolution-adaptive federated learning method to address resolution drift in non-classification tasks like human pose estimation, demonstrating improved performance and compatibility with existing frameworks.

Details

Motivation: The paper aims to address the underexplored issue of resolution drift in federated learning, which significantly degrades performance in non-classification tasks like human pose estimation. Method: The paper introduces RAF, a resolution-adaptive federated learning method leveraging heatmap-based knowledge distillation between higher-resolution (teachers) and lower-resolution (students) outputs. Result: Extensive experiments and theoretical analysis show that RAF effectively mitigates resolution drift, improves performance, and is generalizable to other tasks requiring spatial detail preservation. Conclusion: The paper concludes that RAF effectively mitigates resolution drift in federated learning for high-resolution representation tasks like human pose estimation and can be integrated into existing FL frameworks. Abstract: The Federated Learning (FL) approach enables effective learning across distributed systems, while preserving user data privacy. To date, research has primarily focused on addressing statistical heterogeneity and communication efficiency, through which FL has achieved success in classification tasks. However, its application to non-classification tasks, such as human pose estimation, remains underexplored. This paper identifies and investigates a critical issue termed ``resolution-drift,'' where performance degrades significantly due to resolution variability across clients. Unlike class-level heterogeneity, resolution drift highlights the importance of resolution as another axis of not independent or identically distributed (non-IID) data. To address this issue, we present resolution-adaptive federated learning (RAF), a method that leverages heatmap-based knowledge distillation. Through multi-resolution knowledge distillation between higher-resolution outputs (teachers) and lower-resolution outputs (students), our approach enhances resolution robustness without overfitting. Extensive experiments and theoretical analysis demonstrate that RAF not only effectively mitigates resolution drift and achieves significant performance improvements, but also can be integrated seamlessly into existing FL frameworks. Furthermore, although this paper focuses on human pose estimation, our t-SNE analysis reveals distinct characteristics between classification and high-resolution representation tasks, supporting the generalizability of RAF to other tasks that rely on preserving spatial detail.

[126] CST Anti-UAV: A Thermal Infrared Benchmark for Tiny UAV Tracking in Complex Scenes

Bin Xie,Congxuan Zhang,Fagan Wang,Peng Liu,Feng Lu,Zhen Chen,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出了CST Anti-UAV数据集，专门用于复杂场景下微型无人机的单目标跟踪，推动反无人机系统和跟踪算法的发展。

Details

Motivation: 现有的无人机跟踪数据集目标明显且场景复杂度不足，限制了其在现实场景中的应用。 Method: 提出了CST Anti-UAV热红外数据集，包含220个视频序列和超过24万个人工标注的边界框，用于单目标跟踪研究。 Result: 实验结果显示，最先进的跟踪方法在该数据集上的准确率为35.92%，远低于在Anti-UAV410数据集上的67.69%。 Conclusion: CST Anti-UAV数据集的提出填补了复杂场景下微型无人机跟踪数据的空白，为未来研究提供了重要基准。 Abstract: The widespread application of Unmanned Aerial Vehicles (UAVs) has raised serious public safety and privacy concerns, making UAV perception crucial for anti-UAV tasks. However, existing UAV tracking datasets predominantly feature conspicuous objects and lack diversity in scene complexity and attribute representation, limiting their applicability to real-world scenarios. To overcome these limitations, we present the CST Anti-UAV, a new thermal infrared dataset specifically designed for Single Object Tracking (SOT) in Complex Scenes with Tiny UAVs (CST). It contains 220 video sequences with over 240k high-quality bounding box annotations, highlighting two key properties: a significant number of tiny-sized UAV targets and the diverse and complex scenes. To the best of our knowledge, CST Anti-UAV is the first dataset to incorporate complete manual frame-level attribute annotations, enabling precise evaluations under varied challenges. To conduct an in-depth performance analysis for CST Anti-UAV, we evaluate 20 existing SOT methods on the proposed dataset. Experimental results demonstrate that tracking tiny UAVs in complex environments remains a challenge, as the state-of-the-art method achieves only 35.92% state accuracy, much lower than the 67.69% observed on the Anti-UAV410 dataset. These findings underscore the limitations of existing benchmarks and the need for further advancements in UAV tracking research. The CST Anti-UAV benchmark is about to be publicly released, which not only fosters the development of more robust SOT methods but also drives innovation in anti-UAV systems.

[127] 3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

Ting Huang,Zeyu Zhang,Hao Tang

Main category: cs.CV

TL;DR: 本文提出了 3D-R1，一种通过构建高质量合成数据集 Scene-30K 和采用 RLHF 策略训练的 3D 视觉语言模型，该方法在多个 3D 场景理解任务上取得了平均 10% 的提升。

Details

Motivation: 当前 3D VLM 在空间数据质量和视角假设静态性方面的限制导致其在推理和泛化方面表现不佳，因此需要提出一种新方法来解决这些问题。 Method: 构建了一个高质量的合成数据集 Scene-30K，采用 RLHF 策略（如 GRPO）进行训练，并设计了三种奖励函数（感知奖励、语义相似性奖励和格式奖励）以保持检测准确性和答案语义精度，此外，还引入了动态视角选择策略。 Result: 3D-R1 在多个 3D 场景理解任务上平均提升了 10%，证明了其在增强推理和泛化能力方面的有效性。 Conclusion: 3D-R1 提出了一种有效的方法来增强 3D VLM 的推理能力和泛化能力，通过构建高质量的合成数据集 Scene-30K 和采用 RLHF 策略进行训练，实验结果表明 3D-R1 在多个 3D 场景理解任务上平均提升了 10%。 Abstract: Large vision-language models (VLMs) have made significant strides in 2D visual understanding tasks, sparking interest in extending these capabilities to 3D scene understanding. However, current 3D VLMs often struggle with robust reasoning and generalization due to limitations in high-quality spatial data and the static nature of viewpoint assumptions. To address these challenges, we propose 3D-R1, a foundation model that enhances the reasoning capabilities of 3D VLMs. Specifically, we first construct a high-quality synthetic dataset with CoT, named Scene-30K, leveraging existing 3D-VL datasets and a data engine based on Gemini 2.5 Pro. It serves as cold-start initialization data for 3D-R1. Moreover, we leverage RLHF policy such as GRPO in the reinforcement learning training process to enhance reasoning capabilities and introduce three reward functions: a perception reward, a semantic similarity reward and a format reward to maintain detection accuracy and answer semantic precision. Furthermore, we introduce a dynamic view selection strategy that adaptively chooses the most informative perspectives for 3D scene understanding. Extensive experiments demonstrate that 3D-R1 delivers an average improvement of 10% across various 3D scene benchmarks, highlighting its effectiveness in enhancing reasoning and generalization in 3D scene understanding. Code: https://github.com/AIGeeksGroup/3D-R1. Website: https://aigeeksgroup.github.io/3D-R1.

[128] Seeing More with Less: Video Capsule Endoscopy with Multi-Task Learning

Julia Werner,Oliver Bause,Julius Oexle,Maxime Le Floch,Franz Brinkmann,Jochen Hampe,Oliver Bringmann

Main category: cs.CV

TL;DR: The paper proposes a multi-task AI model for video capsule endoscopy, improving battery life and performance in self-localization and anomaly detection within the gastrointestinal tract.

Details

Motivation: The motivation is to address the challenge of short battery life in video capsule endoscopy devices by integrating artificial intelligence for intelligent real-time decision-making. Method: The paper introduces a multi-task neural network combining self-localization and anomaly detection within a single model, using the Galar dataset and Viterbi decoding for time-series analysis. Result: The multi-task neural network achieves an accuracy of 93.63% on localization and 87.48% on anomaly detection, using only 1 million parameters while surpassing current baselines. Conclusion: The paper concludes that integrating AI into video capsule endoscopy helps overcome battery life limitations by enabling intelligent real-time decision-making, reducing energy consumption, and prolonging battery life. Abstract: Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision- making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant ad- vance in AI-based approaches in this field. Our model achieves an accu- racy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

[129] FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction

Donghyun Lee,Dawoon Jeong,Jae W. Lee,Hongil Yoon

Main category: cs.CV

TL;DR: FastPoint是一种基于距离趋势预测的新型软件加速技术，用于高效处理3D点云问题。

Details

Motivation: 尽管深度神经网络彻底改变了3D点云处理，但高效处理大且不规则的点云仍然是一个挑战。 Method: FastPoint利用采样点之间距离趋势的可预测性，通过预测距离曲线来高效识别后续采样点，而无需详尽计算所有成对距离。 Result: 将FastPoint集成到最先进的3D点云模型中后，在NVIDIA RTX 3090 GPU上实现了2.55倍的端到端加速。 Conclusion: FastPoint实现了在不牺牲精度的情况下加速最先进的3D点云模型的端到端处理。 Abstract: Deep neural networks have revolutionized 3D point cloud processing, yet efficiently handling large and irregular point clouds remains challenging. To tackle this problem, we introduce FastPoint, a novel software-based acceleration technique that leverages the predictable distance trend between sampled points during farthest point sampling. By predicting the distance curve, we can efficiently identify subsequent sample points without exhaustively computing all pairwise distances. Our proposal substantially accelerates farthest point sampling and neighbor search operations while preserving sampling quality and model performance. By integrating FastPoint into state-of-the-art 3D point cloud models, we achieve 2.55x end-to-end speedup on NVIDIA RTX 3090 GPU without sacrificing accuracy.

[130] Stable-Sim2Real: Exploring Simulation of Real-Captured 3D Data with Two-Stage Depth Diffusion

Mutian Xu,Chongjie Ye,Haolin Liu,Yushuang Wu,Jiahao Chang,Xiaoguang Han

Main category: cs.CV

TL;DR: Error

Details

Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: 3D data simulation aims to bridge the gap between simulated and real-captured 3D data, which is a fundamental problem for real-world 3D visual tasks. Most 3D data simulation methods inject predefined physical priors but struggle to capture the full complexity of real data. An optimal approach involves learning an implicit mapping from synthetic to realistic data in a data-driven manner, but progress in this solution has met stagnation in recent studies. This work explores a new solution path of data-driven 3D simulation, called Stable-Sim2Real, based on a novel two-stage depth diffusion model. The initial stage finetunes Stable-Diffusion to generate the residual between the real and synthetic paired depth, producing a stable but coarse depth, where some local regions may deviate from realistic patterns. To enhance this, both the synthetic and initial output depth are fed into a second-stage diffusion, where diffusion loss is adjusted to prioritize these distinct areas identified by a 3D discriminator. We provide a new benchmark scheme to evaluate 3D data simulation methods. Extensive experiments show that training the network with the 3D simulated data derived from our method significantly enhances performance in real-world 3D visual tasks. Moreover, the evaluation demonstrates the high similarity between our 3D simulated data and real-captured patterns. Project page: https://mutianxu.github.io/stable-sim2real/.

[131] Online Estimation of Table-Top Grown Strawberry Mass in Field Conditions with Occlusions

Jinshan Zhen,Yuanyue Ge,Tianxiao Zhu,Hui Zhao,Ya Xiong

Main category: cs.CV

TL;DR: 本研究提出了一种基于视觉的管道方法，结合RGB-D传感和深度学习，用于解决田间环境下草莓质量估计中的遮挡和姿态变化问题，实现了高精度的非破坏性质量估计。

Details

Motivation: 由于遮挡和姿态变化频繁，田间环境下草莓质量的准确估计仍然具有挑战性，因此需要一种非破坏性、实时且在线的质量估计方法。 Method: 研究采用了YOLOv8-Seg进行实例分割，使用CycleGAN进行遮挡区域补全，并通过倾角校正优化正面投影面积计算，最后利用多项式回归模型将几何特征映射到质量估计。 Result: 实验结果显示，该方法对孤立草莓的质量估计平均误差为8.11%，对遮挡情况下的误差为10.47%。CycleGAN在遮挡恢复方面优于LaMa模型，取得了更高的像素面积比（PAR）和交并比（IoU）得分。 Conclusion: 该研究提出了一种结合RGB-D传感和深度学习的视觉管道，有效解决了遮挡和姿态变化下的草莓质量估计问题，为自动化采摘和产量监测提供了鲁棒性解决方案。 Abstract: Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for isolated strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9-1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

[132] Hyperbolic Cycle Alignment for Infrared-Visible Image Fusion

Timing Li,Bing Cao,Jiahe Feng,Haifang Cao,Qinghau Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: This paper proposes a novel image registration framework, Hy-CycleAlign, that utilizes hyperbolic space to effectively align cross-modal images, leading to superior fusion performance over traditional Euclidean-based methods.

Details

Motivation: Existing image registration methods based on Euclidean space translation struggle with cross-modal misalignment, leading to suboptimal fusion quality. This work aims to improve registration accuracy by exploring alignment in non-Euclidean (specifically hyperbolic) space. Method: The method introduces a dual-path cross-modal cyclic registration framework named Hy-CycleAlign, which includes forward and backward registration networks, and a Hyperbolic Hierarchy Contrastive Alignment (H²CA) module that maps images into hyperbolic space to reduce modality discrepancies. Result: Extensive experiments on misaligned multi-modal images show that the proposed method significantly improves image alignment and fusion quality compared to existing techniques. Conclusion: The proposed Hy-CycleAlign method significantly outperforms existing approaches in both image alignment and fusion by leveraging hyperbolic space for cross-modal registration. Abstract: Image fusion synthesizes complementary information from multiple sources, mitigating the inherent limitations of unimodal imaging systems. Accurate image registration is essential for effective multi-source data fusion. However, existing registration methods, often based on image translation in Euclidean space, fail to handle cross-modal misalignment effectively, resulting in suboptimal alignment and fusion quality. To overcome this limitation, we explore image alignment in non-Euclidean space and propose a Hyperbolic Cycle Alignment Network (Hy-CycleAlign). To the best of our knowledge, Hy-CycleAlign is the first image registration method based on hyperbolic space. It introduces a dual-path cross-modal cyclic registration framework, in which a forward registration network aligns cross-modal inputs, while a backward registration network reconstructs the original image, forming a closed-loop registration structure with geometric consistency. Additionally, we design a Hyperbolic Hierarchy Contrastive Alignment (H$^{2}$CA) module, which maps images into hyperbolic space and imposes registration constraints, effectively reducing interference caused by modality discrepancies. We further analyze image registration in both Euclidean and hyperbolic spaces, demonstrating that hyperbolic space enables more sensitive and effective multi-modal image registration. Extensive experiments on misaligned multi-modal images demonstrate that our method significantly outperforms existing approaches in both image alignment and fusion. Our code will be publicly available.

[133] I Am Big, You Are Little; I Am Right, You Are Wrong

David A. Kelly,Akchunya Chanchal,Nathan Blake

Main category: cs.CV

TL;DR: This paper uses minimal sufficient pixel sets to study the decision-making process in image classification models, revealing differences in 'concentration' across architectures and a link between larger pixel sets and misclassification.

Details

Motivation: As machine learning models for image classification become more varied and complex, understanding how these models make decisions is increasingly important. This study aims to provide insight into model decision-making through the concept of 'concentration' of essential pixels. Method: The researchers used minimal sufficient pixel sets to analyze the decision-making process of different vision models, comparing pixel set characteristics such as position, overlap, and size. Result: The analysis identified statistically significant differences in concentration among different model architectures, with ConvNext and EVA showing distinct behavior. Additionally, it was found that misclassified images generally require larger sets of pixels for model decision-making. Conclusion: The study concludes that different vision model architectures exhibit statistically different concentration in terms of pixel set size and position, with ConvNext and EVA differing markedly from others. Misclassified images are associated with larger pixel sets. Abstract: Machine learning for image classification is an active and rapidly developing field. With the proliferation of classifiers of different sizes and different architectures, the problem of choosing the right model becomes more and more important. While we can assess a model's classification accuracy statistically, our understanding of the way these models work is unfortunately limited. In order to gain insight into the decision-making process of different vision models, we propose using minimal sufficient pixels sets to gauge a model's `concentration': the pixels that capture the essence of an image through the lens of the model. By comparing position, overlap, and size of sets of pixels, we identify that different architectures have statistically different concentration, in both size and position. In particular, ConvNext and EVA models differ markedly from the others. We also identify that images which are misclassified are associated with larger pixels sets than correct classifications.

[134] ART: Adaptive Relation Tuning for Generalized Relation Prediction

Gopika Sudhakaran,Hikaru Shindo,Patrick Schramowski,Simone Schaub-Meyer,Kristian Kersting,Stefan Roth

Main category: cs.CV

TL;DR: This paper proposes ART, an instruction tuning framework with adaptive sampling for visual relation detection, enabling better generalization and inference on unseen relations.

Details

Motivation: VRD models trained only on detection data lack generalization, and prompt tuning struggles with novel relations. Instruction tuning is proposed as a more effective adaptation method for vision-language models in VRD tasks. Method: The authors introduce ART, which converts VRD datasets into an instruction tuning format and employs an adaptive sampling algorithm to focus on informative relations. They perform relation classification using subject-object boxes as input. Result: ART outperforms baselines, successfully infers unseen relation concepts, and demonstrates practical value in segmenting complex scenes. Conclusion: The proposed ART framework effectively adapts vision-language models for visual relation detection through instruction tuning and adaptive sampling, outperforming existing methods and enabling inference on unseen relation concepts. Abstract: Visual relation detection (VRD) is the task of identifying the relationships between objects in a scene. VRD models trained solely on relation detection data struggle to generalize beyond the relations on which they are trained. While prompt tuning has been used to adapt vision-language models (VLMs) for VRD, it uses handcrafted prompts and struggles with novel or complex relations. We argue that instruction tuning offers a more effective solution by fine-tuning VLMs on diverse instructional data. We thus introduce ART, an Adaptive Relation Tuning framework that adapts VLMs for VRD through instruction tuning and strategic instance selection. By converting VRD datasets into an instruction tuning format and employing an adaptive sampling algorithm, ART directs the VLM to focus on informative relations while maintaining generalizability. Specifically, we focus on the relation classification, where subject-object boxes are given and the model predicts the predicate between them. We tune on a held-in set and evaluate across multiple held-out datasets of varying complexity. Our approach strongly improves over its baselines and can infer unseen relation concepts, a capability absent in mainstream VRD methods. We demonstrate ART's practical value by using the predicted relations for segmenting complex scenes.

[135] 3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection

Yung-Hsu Yang,Luigi Piccinelli,Mattia Segu,Siyuan Li,Rui Huang,Yuqian Fu,Marc Pollefeys,Hermann Blum,Zuria Bauer

Main category: cs.CV

TL;DR: The paper introduces 3D-MOOD, an end-to-end 3D Monocular Open-set Object Detector, which achieves state-of-the-art results in both closed-set and open-set settings.

Details

Motivation: The motivation is to address the challenge of applying existing monocular 3D object detection methods to real-world applications that involve new environments and novel object categories. Method: The paper proposes 3D-MOOD, an end-to-end 3D Monocular Open-set Object Detector, which lifts open-set 2D detection into 3D space and designs a canonical image space for efficient cross-dataset training. Result: The results show that 3D-MOOD achieves new state-of-the-art results on both closed-set (Omni3D) and open-set (Omni3D to Argoverse 2, ScanNet) settings. Conclusion: The paper concludes that 3D-MOOD achieves state-of-the-art results in both closed-set and open-set settings for monocular 3D object detection. Abstract: Monocular 3D object detection is valuable for various applications such as robotics and AR/VR. Existing methods are confined to closed-set settings, where the training and testing sets consist of the same scenes and/or object categories. However, real-world applications often introduce new environments and novel object categories, posing a challenge to these methods. In this paper, we address monocular 3D object detection in an open-set setting and introduce the first end-to-end 3D Monocular Open-set Object Detector (3D-MOOD). We propose to lift the open-set 2D detection into 3D space through our designed 3D bounding box head, enabling end-to-end joint training for both 2D and 3D tasks to yield better overall performance. We condition the object queries with geometry prior and overcome the generalization for 3D estimation across diverse scenes. To further improve performance, we design the canonical image space for more efficient cross-dataset training. We evaluate 3D-MOOD on both closed-set settings (Omni3D) and open-set settings (Omni3D to Argoverse 2, ScanNet), and achieve new state-of-the-art results. Code and models are available at royyang0714.github.io/3D-MOOD.

[136] Gaussian Splatting Feature Fields for Privacy-Preserving Visual Localization

Maxime Pietrantoni,Gabriela Csurka,Torsten Sattler

Main category: cs.CV

TL;DR: The paper proposes a new method for accurate and privacy-preserving visual localization using Gaussian Splatting Feature Fields. The method combines an explicit geometry model with an implicit feature field and uses a contrastive framework and 3D structure-informed clustering to learn robust feature representations. The resulting pipelines achieve state-of-the-art performance on real-world datasets.

Details

Motivation: The motivation for this paper is to develop a method for accurate and privacy-preserving visual localization, which involves estimating a camera pose in a known environment. Method: The authors propose Gaussian Splatting Feature Fields (GSFFs), which combines an explicit geometry model (3DGS) with an implicit feature field. They use a contrastive framework to align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space. They also use a 3D structure-informed clustering procedure to regularize the representation learning and convert the features to segmentations for privacy-preserving localization. Result: The authors evaluated their method on multiple real-world datasets and found that their privacy- and non-privacy-preserving localization pipelines achieved state-of-the-art performances. Conclusion: The paper concludes that their proposed method, Gaussian Splatting Feature Fields, achieves state-of-the-art performance in visual localization tasks while also allowing for privacy-preserving localization. Abstract: Visual localization is the task of estimating a camera pose in a known environment. In this paper, we utilize 3D Gaussian Splatting (3DGS)-based representations for accurate and privacy-preserving visual localization. We propose Gaussian Splatting Feature Fields (GSFFs), a scene representation for visual localization that combines an explicit geometry model (3DGS) with an implicit feature field. We leverage the dense geometric information and differentiable rasterization algorithm from 3DGS to learn robust feature representations grounded in 3D. In particular, we align a 3D scale-aware feature field and a 2D feature encoder in a common embedding space through a contrastive framework. Using a 3D structure-informed clustering procedure, we further regularize the representation learning and seamlessly convert the features to segmentations, which can be used for privacy-preserving visual localization. Pose refinement, which involves aligning either feature maps or segmentations from a query image with those rendered from the GSFFs scene representation, is used to achieve localization. The resulting privacy- and non-privacy-preserving localization pipelines, evaluated on multiple real-world datasets, show state-of-the-art performances.

[137] Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Sobhan Asasi,Mohamed Ilyas Lakhal,Ozge Mercanoglu Sincan,Richard Bowden

Main category: cs.CV

TL;DR: 本文提出了一种新的无需gloss标注的手语翻译框架BeyondGloss，通过生成细粒度的手势描述和对比学习方法，显著提升了翻译性能。

Details

Motivation: 手语翻译（SLT）是一项具有挑战性的任务，需要弥合视觉信息和语言信息之间的模态差距，同时捕捉手势和动作的细微变化。 Method: 提出了一种生成细粒度、时序感知的手势描述的新方法，并采用对比对齐模块和对比损失来减少模态差距。此外，从HaMeR中提取细粒度特征以丰富手部特定表示。 Result: BeyondGloss在Phoenix14T和CSL-Daily基准测试中达到了最先进的性能。 Conclusion: BeyondGloss是一种新的SLT框架，无需依赖gloss标注，通过利用VideoLLMs的时空推理能力，实现了最先进的性能。 Abstract: Sign Language Translation (SLT) is a challenging task that requires bridging the modality gap between visual and linguistic information while capturing subtle variations in hand shapes and movements. To address these challenges, we introduce \textbf{BeyondGloss}, a novel gloss-free SLT framework that leverages the spatio-temporal reasoning capabilities of Video Large Language Models (VideoLLMs). Since existing VideoLLMs struggle to model long videos in detail, we propose a novel approach to generate fine-grained, temporally-aware textual descriptions of hand motion. A contrastive alignment module aligns these descriptions with video features during pre-training, encouraging the model to focus on hand-centric temporal dynamics and distinguish signs more effectively. To further enrich hand-specific representations, we distill fine-grained features from HaMeR. Additionally, we apply a contrastive loss between sign video representations and target language embeddings to reduce the modality gap in pre-training. \textbf{BeyondGloss} achieves state-of-the-art performance on the Phoenix14T and CSL-Daily benchmarks, demonstrating the effectiveness of the proposed framework. We will release the code upon acceptance of the paper.

[138] MamV2XCalib: V2X-based Target-less Infrastructure Camera Calibration with State Space Model

Yaoye Zhu,Zhe Wang,Yan Wang

Main category: cs.CV

TL;DR: 本文提出了一种基于V2X的基础设施相机校准方法MamV2XCalib，通过车辆端的LiDAR辅助实现大规模精确校准，无需特定参考物体或人工干预。

Details

Motivation: 随着利用路边摄像头辅助自动驾驶感知的协作系统日益普及，大规模精确校准基础设施摄像头成为一个关键问题。传统手动校准方法通常耗时、劳动强度大，可能需要封闭道路。 Method: 引入了一种新的无目标LiDAR-相机校准方法，结合多尺度特征和4D相关体积来估计车辆端点云和路边图像之间的相关性，并利用Mamba建模时间信息并估计旋转角度。 Result: 在V2X-Seq和TUMTraf-V2X真实世界数据集上评估了MamV2XCalib，展示了该V2X自动校准方法的有效性和鲁棒性。与以往针对单辆车设计的LiDAR-相机校准方法相比，该方法在V2X场景中实现了更好且更稳定的校准性能，并且参数更少。 Conclusion: MamV2XCalib是一种有效的V2X基础设施相机校准方法，解决了车辆端数据缺陷（如遮挡）和视角差异大导致的校准失败问题。 Abstract: As cooperative systems that leverage roadside cameras to assist autonomous vehicle perception become increasingly widespread, large-scale precise calibration of infrastructure cameras has become a critical issue. Traditional manual calibration methods are often time-consuming, labor-intensive, and may require road closures. This paper proposes MamV2XCalib, the first V2X-based infrastructure camera calibration method with the assistance of vehicle-side LiDAR. MamV2XCalib only requires autonomous vehicles equipped with LiDAR to drive near the cameras to be calibrated in the infrastructure, without the need for specific reference objects or manual intervention. We also introduce a new targetless LiDAR-camera calibration method, which combines multi-scale features and a 4D correlation volume to estimate the correlation between vehicle-side point clouds and roadside images. We model the temporal information and estimate the rotation angles with Mamba, effectively addressing calibration failures in V2X scenarios caused by defects in the vehicle-side data (such as occlusions) and large differences in viewpoint. We evaluate MamV2XCalib on the V2X-Seq and TUMTraf-V2X real-world datasets, demonstrating the effectiveness and robustness of our V2X-based automatic calibration approach. Compared to previous LiDAR-camera methods designed for calibration on one car, our approach achieves better and more stable calibration performance in V2X scenarios with fewer parameters. The code is available at https://github.com/zhuyaoye/MamV2XCalib.

[139] MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Zijian Dong,Longteng Duan,Jie Song,Michael J. Black,Andreas Geiger

Main category: cs.CV

TL;DR: MoGA利用生成模型和2D扩散模型从单视角图像重建高质量的3D高斯化身，解决了3D一致性问题并实现了优异的性能。

Details

Motivation: 解决从单视角图像重建高保真3D高斯化身的问题，尤其是推断未见外观和几何细节，同时确保3D一致性和真实性。 Method: 将生成模型作为先验，通过将输入图像投影到其潜在空间并施加额外的3D外观和几何约束，实现3D一致性。 Result: 实验表明该方法超越了现有技术，并且生成的3D化身具有可动画化的特性。 Conclusion: MoGA通过结合生成模型和2D扩散模型实现了高质量的3D高斯化身重建，并且在真实世界场景中表现出良好的泛化能力。 Abstract: We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable

[140] DA-Occ: Efficient 3D Voxel Occupancy Prediction via Directional 2D for Geometric Structure Preservation

Yuchen Zhou,Yan Luo,Xiangang Wang,Xingjian Gu,Mingzhou Lu

Main category: cs.CV

TL;DR: This paper proposes a directional pure 2D approach for 3D occupancy prediction in autonomous driving, effectively balancing accuracy and efficiency by preserving vertical geometric information and using directional attention mechanisms.

Details

Motivation: The motivation is to address the challenge of balancing accuracy and inference speed in current 3D occupancy prediction methods for autonomous driving. Method: The method involves slicing 3D voxel features to preserve vertical geometric information and employs a directional attention mechanism to extract features from different orientations. Result: The method achieved an mIoU of 39.3% and an inference speed of 27.7 FPS on the Occ3D-nuScenes dataset, with 14.8 FPS on edge devices, demonstrating its real-time applicability. Conclusion: The proposed directional pure 2D approach effectively balances accuracy and computational efficiency for 3D occupancy prediction in autonomous driving. Abstract: Efficient and high-accuracy 3D occupancy prediction is crucial for ensuring the performance of autonomous driving (AD) systems. However, many current methods focus on high accuracy at the expense of real-time processing needs. To address this challenge of balancing accuracy and inference speed, we propose a directional pure 2D approach. Our method involves slicing 3D voxel features to preserve complete vertical geometric information. This strategy compensates for the loss of height cues in Bird's-Eye View (BEV) representations, thereby maintaining the integrity of the 3D geometric structure. By employing a directional attention mechanism, we efficiently extract geometric features from different orientations, striking a balance between accuracy and computational efficiency. Experimental results highlight the significant advantages of our approach for autonomous driving. On the Occ3D-nuScenes, the proposed method achieves an mIoU of 39.3% and an inference speed of 27.7 FPS, effectively balancing accuracy and efficiency. In simulations on edge devices, the inference speed reaches 14.8 FPS, further demonstrating the method's applicability for real-time deployment in resource-constrained environments.

[141] Mamba-based Efficient Spatio-Frequency Motion Perception for Video Camouflaged Object Detection

Xin Li,Keren Fu,Qijun Zhao

Main category: cs.CV

TL;DR: Vcamba improves video camouflaged object detection by combining spatial and frequency features using a Mamba-based architecture, achieving superior performance with lower computational cost.

Details

Motivation: Existing VCOD methods struggle due to limited discriminability of spatial appearance features. Frequency features and the Mamba state space model offer enhanced motion perception and feature representation, motivating their integration in a novel framework. Method: The method introduces a visual state space model (Mamba) with modules for multi-scale spatial feature extraction (RFVSS), frequency component enhancement (AFE), motion perception in spatial (SLMP) and frequency (FLMP) domains, and dual-domain feature fusion (SFMF). Result: Experimental results show that Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with reduced computational cost. Conclusion: The proposed Vcamba model effectively integrates spatial and frequency features for improved video camouflaged object detection (VCOD), outperforming existing methods with lower computational cost. Abstract: Existing video camouflaged object detection (VCOD) methods primarily rely on spatial appearance features to perceive motion cues for breaking camouflage. However, the high similarity between foreground and background in VCOD results in limited discriminability of spatial appearance features (e.g., color and texture), restricting detection accuracy and completeness. Recent studies demonstrate that frequency features can not only enhance feature representation to compensate for appearance limitations but also perceive motion through dynamic variations in frequency energy. Furthermore, the emerging state space model called Mamba, enables efficient perception of motion cues in frame sequences due to its linear-time long-sequence modeling capability. Motivated by this, we propose a novel visual camouflage Mamba (Vcamba) based on spatio-frequency motion perception that integrates frequency and spatial features for efficient and accurate VCOD. Specifically, we propose a receptive field visual state space (RFVSS) module to extract multi-scale spatial features after sequence modeling. For frequency learning, we introduce an adaptive frequency component enhancement (AFE) module with a novel frequency-domain sequential scanning strategy to maintain semantic consistency. Then we propose a space-based long-range motion perception (SLMP) module and a frequency-based long-range motion perception (FLMP) module to model spatio-temporal and frequency-temporal sequences in spatial and frequency phase domains. Finally, the space and frequency motion fusion module (SFMF) integrates dual-domain features for unified motion representation. Experimental results show that our Vcamba outperforms state-of-the-art methods across 6 evaluation metrics on 2 datasets with lower computation cost, confirming the superiority of Vcamba. Our code is available at: https://github.com/BoydeLi/Vcamba.

[142] Medical Image De-Identification Benchmark Challenge

Linmin Pei,Granger Sutton,Michael Rutherford,Ulrike Wagner,Tracy Nolan,Kirk Smith,Phillip Farmer,Peter Gu,Ambar Rana,Kailing Chen,Thomas Ferleman,Brian Park,Ye Wu,Jordan Kojouharov,Gargi Singh,Jon Lemon,Tyler Willis,Milos Vukadinovic,Grant Duffy,Bryan He,David Ouyang,Marco Pereanez,Daniel Samber,Derek A. Smith,Christopher Cannistraci,Zahi Fayad,David S. Mendelson,Michele Bufano,Elmar Kotter,Hamideh Haghiri,Rajesh Baidya,Stefan Dvoretskii,Klaus H. Maier-Hein,Marco Nolden,Christopher Ablett,Silvia Siggillino,Sandeep Kaushik,Hongzhu Jiang,Sihan Xie,Zhiyu Wan,Alex Michie,Simon J Doran,Angeline Aurelia Waly,Felix A. Nathaniel Liang,Humam Arshad Mustagfirin,Michelle Grace Felicia,Kuo Po Chih,Rahul Krish,Ghulam Rasool,Nidhal Bouaynaya,Nikolas Koutsoubis,Kyle Naddeo,Kartik Pandit,Tony O'Sullivan,Raj Krish,Qinyan Pan,Scott Gustafson,Benjamin Kopchick,Laura Opsahl-Ong,Andrea Olvera-Morales,Jonathan Pinney,Kathryn Johnson,Theresa Do,Juergen Klenk,Maria Diaz,Arti Singh,Rong Chai,David A. Clunie,Fred Prior,Keyvan Farahani

Main category: cs.CV

TL;DR: The MIDI-B Challenge benchmarked DICOM image de-identification tools with high success rates while preserving essential metadata for AI research.

Details

Motivation: To ensure compliance with privacy laws while sharing medical images and preserving metadata critical for AI development in biomedical research. Method: The MIDI-B Challenge consisted of three phases (training, validation, and test) using real de-identified radiology images with synthetic PHI/PII. Scores were calculated based on the percentage of correct actions taken by participants' tools. Result: Ten teams successfully completed the challenge, achieving high accuracy scores using various tools including open-source, proprietary, large language models, and OCR. Conclusion: MIDI-B successfully provided a standardized platform for benchmarking DICOM image deID tools while maintaining research-critical metadata, achieving high accuracy scores between 97.91% and 99.93%. Abstract: The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted. The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge's design, implementation, results, and lessons learned.

[143] Consistent Point Matching

Halid Ziya Yerebakan,Gerardo Hermosillo Valadez

Main category: cs.CV

TL;DR: 该研究提出了一种新的医学图像点匹配算法，通过引入一致性启发式方法，提高了匹配解剖位置和标志点的鲁棒性和精度，且无需机器学习模型或训练数据。

Details

Motivation: 研究动机是为了解决医学图像中匹配解剖位置和定位标志点的问题，同时提高算法的鲁棒性和效率，而无需依赖机器学习模型或训练数据。 Method: 该研究的方法是将一致性启发式方法集成到点匹配算法中，并在多种纵向内部和公共数据集上进行了验证，这些数据集涵盖了CT和MRI模态。 Result: 结果显示，该方法在Deep Lesion Tracking数据集上超越了最先进的结果，并且能够有效解决标志点定位问题，同时在标准CPU硬件上运行效率高，可以灵活权衡速度与鲁棒性。 Conclusion: 该研究得出的结论是，将一致性启发式方法纳入点匹配算法可以提高在医学图像对之间匹配解剖位置的鲁棒性，并且无需机器学习模型或训练数据即可实现高精度的医学图像导航。 Abstract: This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

[144] DivControl: Knowledge Diversion for Controllable Image Generation

Yucheng Xie,Fu Feng,Ruixiao Shi,Jing Wang,Yong Rui,Xin Geng

Main category: cs.CV

TL;DR: DivControl is a unified and efficient framework for controllable image generation that enables zero-shot adaptation and parameter-efficient learning through SVD factorization and dynamic knowledge diversion.

Details

Motivation: Existing methods for controllable image generation either require separate models per condition or suffer from entangled representations, leading to poor generalization and high adaptation costs. This work aims to enable efficient and unified controllable generation. Method: DivControl factorizes ControlNet using SVD into condition-agnostic learnable components and condition-specific tailors, with a dynamic gate for soft routing based on condition semantics. A representation alignment loss is also introduced to enhance fidelity and efficiency. Result: DivControl achieves state-of-the-art controllability with 36.4× less training cost, improved performance on basic conditions, and strong zero-shot/few-shot results on unseen conditions. Conclusion: DivControl provides a scalable and efficient framework for controllable image generation, allowing zero-shot and few-shot adaptation to new conditions with reduced training costs. Abstract: Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

[145] Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda,Stefan Roth,Simone Schaub-Meyer

Main category: cs.CV

TL;DR: The paper proposes EMAT, an improved method for few-shot classification and segmentation that excels at detecting and segmenting small objects with fewer parameters. Additionally, the authors introduce new evaluation settings that better reflect real-world conditions.

Details

Motivation: Current FS-CS methods struggle with small objects despite achieving high overall accuracy, and existing evaluation settings do not utilize all available annotations. Method: EMAT incorporates a memory-efficient masked attention mechanism, learnable downscaling strategy, and parameter-efficiency enhancements. Result: EMAT outperforms all FS-CS methods on PASCAL-5$^i$ and COCO-20$^i$ datasets with fewer trainable parameters, and two new evaluation settings are proposed. Conclusion: EMAT is a more effective and parameter-efficient method for FS-CS tasks, particularly in handling small objects. Abstract: Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the Efficient Masked Attention Transformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-5$^i$ and COCO-20$^i$ datasets, using at least four times fewer trainable parameters. Moreover, as the current FS-CS evaluation setting discards available annotations, despite their costly collection, we introduce two novel evaluation settings that consider these annotations to better reflect practical scenarios.

[146] FFGAF-SNN: The Forward-Forward Based Gradient Approximation Free Training Framework for Spiking Neural Networks

Changqing Xu,Ziqiang Yang,Yi Liu,Xinfang Liao,Guiqi Mo,Hao Zeng,Yintang Yang

Main category: cs.CV

TL;DR: The paper introduces a gradient-free training framework for Spiking Neural Networks using a Forward-Forward approach, achieving high accuracy and efficiency across multiple datasets while reducing computational complexity.

Details

Motivation: The motivation stems from the challenge of training Spiking Neural Networks (SNNs) due to their non-differentiability and the computational inefficiency of gradient approximation methods, especially for edge devices. The authors aim to develop an energy-efficient and accurate training framework. Method: The paper proposes a gradient approximation-free training framework using a Forward-Forward (FF) approach for SNNs, treating spiking activations as black-box modules. It also introduces a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics. Result: The proposed training framework achieved test accuracies of 99.58%, 92.13%, and 75.64% on MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively. It also demonstrated significant improvements in memory access and computational power consumption. Conclusion: The paper concludes that the proposed Forward-Forward (FF)-based training framework for Spiking Neural Networks (SNNs) significantly improves efficiency and accuracy, surpassing existing FF-based SNN approaches while offering advantages in memory access and computational power consumption. Abstract: Spiking Neural Networks (SNNs) offer a biologically plausible framework for energy-efficient neuromorphic computing. However, it is a challenge to train SNNs due to their non-differentiability, efficiently. Existing gradient approximation approaches frequently sacrifice accuracy and face deployment limitations on edge devices due to the substantial computational requirements of backpropagation. To address these challenges, we propose a Forward-Forward (FF) based gradient approximation-free training framework for Spiking Neural Networks, which treats spiking activations as black-box modules, thereby eliminating the need for gradient approximation while significantly reducing computational complexity. Furthermore, we introduce a class-aware complexity adaptation mechanism that dynamically optimizes the loss function based on inter-class difficulty metrics, enabling efficient allocation of network resources across different categories. Experimental results demonstrate that our proposed training framework achieves test accuracies of 99.58%, 92.13%, and 75.64% on the MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively, surpassing all existing FF-based SNN approaches. Additionally, our proposed method exhibits significant advantages in terms of memory access and computational power consumption.

[147] Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis

Kunpeng Qiu,Zhiying Zhou,Yongxin Guo

Main category: cs.CV

TL;DR: 本文提出了一种名为 Adaptively Distilled ControlNet 的新框架，通过双模型蒸馏加速训练和优化，解决医学图像标注的隐私问题和劳动密集型标记问题，同时在两个医学数据集上实现了最先进的性能。

Details

Motivation: 医学图像标注受到隐私问题和劳动密集型标记的限制，现有的掩码可控扩散模型在合成方面表现出色，但在精确的病灶掩码对齐方面存在困难。 Method: 提出了一种新的框架，利用教师模型和学生模型进行双模型蒸馏，教师模型基于掩码-图像对进行训练，学生模型仅使用掩码进行训练，并通过预测噪声对齐和自适应正则化进行优化。 Result: 在两个不同的医学数据集上进行了全面评估，结果表明该方法在性能上有所提升，TransUNet 在 KiTS19 数据集上的 mDice/mIoU 提升了 2.4%/4.2%，SANet 在 Polyps 数据集上的 mDice/mIoU 提升了 2.6%/3.5%。 Conclusion: Adaptively Distilled ControlNet 是一种任务无关的框架，通过双模型蒸馏加速训练和优化，在医疗图像生成中实现最先进的性能，同时保护隐私。 Abstract: Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.

[148] OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction

Yang Gao,Po-Chien Luan,Kaouther Messaoud,Lan Feng,Alexandre Alahi

Main category: cs.CV

TL;DR: 本文提出了OmniTraj，一个基于Transformer的轨迹预测模型，通过显式条件化帧率实现了在不同数据集上的高效零样本迁移，显著提高了预测性能。

Details

Motivation: 解决现有预训练轨迹预测模型在零样本迁移至具有不同时间动态的未见数据集时表现不佳的问题。 Method: 提出OmniTraj模型，基于Transformer，大规模异构数据集预训练，通过显式条件化时间元数据来增强时间泛化能力。 Result: OmniTraj在跨设置场景中减少了超过70%的预测误差，零样本迁移性能达到最先进水平；微调后，在NBA、JTA、WorldPose和ETH-UCY四个数据集上表现最佳。 Conclusion: OmniTraj实现了在不同数据集间的零样本迁移，通过显式条件化帧率，显著降低了预测误差，并在多个数据集上实现了最先进的性能。 Abstract: While large-scale pre-training has advanced human trajectory prediction, a critical challenge remains: zero-shot transfer to unseen dataset with varying temporal dynamics. State-of-the-art pre-trained models often require fine-tuning to adapt to new datasets with different frame rates or observation horizons, limiting their scalability and practical utility. In this work, we systematically investigate this limitation and propose a robust solution. We first demonstrate that existing data-aware discrete models struggle when transferred to new scenarios with shifted temporal setups. We then isolate the temporal generalization from dataset shift, revealing that a simple, explicit conditioning mechanism for temporal metadata is a highly effective solution. Based on this insight, we present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset. Our experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance, reducing prediction error by over 70\% in challenging cross-setup scenarios. After fine-tuning, OmniTraj achieves state-of-the-art results on four datasets, including NBA, JTA, WorldPose, and ETH-UCY. The code is publicly available: https://github.com/vita-epfl/omnitraj

[149] SAMSA: Segment Anything Model Enhanced with Spectral Angles for Hyperspectral Interactive Medical Image Segmentation

Alfie Roddan,Tobias Czempiel,Chi Xu,Daniel S. Elson,Stamatia Giannarou

Main category: cs.CV

TL;DR: SAMSA is a novel interactive segmentation framework for hyperspectral medical images that integrates RGB and spectral analysis, effectively handling data limitations and hardware variations while achieving high segmentation accuracy with minimal user input.

Details

Motivation: The motivation is to address the challenges of data limitations and hardware variations in hyperspectral imaging (HSI) to improve medical imaging analysis. Method: SAMSA combines an RGB foundation model with spectral analysis and uses user clicks to guide segmentation and spectral similarity computations, employing a spectral feature fusion strategy that is independent of spectral band count and resolution. Result: Performance evaluation showed high DICE scores on neurosurgical and intraoperative porcine datasets, with 81.0% 1-click and 93.4% 5-click DICE on one dataset, and 81.1% 1-click and 89.2% 5-click DICE on another, demonstrating effectiveness in few-shot and zero-shot learning scenarios. Conclusion: The paper concludes that SAMSA is an effective framework for hyperspectral medical image analysis, particularly in scenarios with limited data and varying spectral characteristics. Abstract: Hyperspectral imaging (HSI) provides rich spectral information for medical imaging, yet encounters significant challenges due to data limitations and hardware variations. We introduce SAMSA, a novel interactive segmentation framework that combines an RGB foundation model with spectral analysis. SAMSA efficiently utilizes user clicks to guide both RGB segmentation and spectral similarity computations. The method addresses key limitations in HSI segmentation through a unique spectral feature fusion strategy that operates independently of spectral band count and resolution. Performance evaluation on publicly available datasets has shown 81.0% 1-click and 93.4% 5-click DICE on a neurosurgical and 81.1% 1-click and 89.2% 5-click DICE on an intraoperative porcine hyperspectral dataset. Experimental results demonstrate SAMSA's effectiveness in few-shot and zero-shot learning scenarios and using minimal training examples. Our approach enables seamless integration of datasets with different spectral characteristics, providing a flexible framework for hyperspectral medical image analysis.

[150] I2V-GS: Infrastructure-to-Vehicle View Transformation with Gaussian Splatting for Autonomous Driving Data Generation

Jialei Chen,Wuhao Xu,Sipeng He,Baoru Huang,Dongchun Ren

Main category: cs.CV

TL;DR: 本文提出了一种新的自动驾驶数据生成方法I2V-GS，通过从基础设施视角图像生成车辆视角数据，有效提升了合成质量。

Details

Motivation: 解决当前自动驾驶数据收集成本高且效率低的问题，探索从真实道路图像生成驾驶数据的潜力。 Method: 提出了一种名为I2V-GS的新方法，利用高斯点绘进行基础设施视图到车辆视图的转换，并采用自适应深度变形、级联策略以及交叉视图信息优化来提升生成效果。 Result: 实验结果表明，I2V-GS在NTA-Iou、NTL-Iou和FID指标上分别比StreetGaussian高出45.7%、34.2%和14.9%。 Conclusion: I2V-GS显著提升了车辆视角下的数据合成质量，并且是首个实现基础设施-车辆视角转换生成自动驾驶数据集的框架。 Abstract: Vast and high-quality data are essential for end-to-end autonomous driving systems. However, current driving data is mainly collected by vehicles, which is expensive and inefficient. A potential solution lies in synthesizing data from real-world images. Recent advancements in 3D reconstruction demonstrate photorealistic novel view synthesis, highlighting the potential of generating driving data from images captured on the road. This paper introduces a novel method, I2V-GS, to transfer the Infrastructure view To the Vehicle view with Gaussian Splatting. Reconstruction from sparse infrastructure viewpoints and rendering under large view transformations is a challenging problem. We adopt the adaptive depth warp to generate dense training views. To further expand the range of views, we employ a cascade strategy to inpaint warped images, which also ensures inpainting content is consistent across views. To further ensure the reliability of the diffusion model, we utilize the cross-view information to perform a confidenceguided optimization. Moreover, we introduce RoadSight, a multi-modality, multi-view dataset from real scenarios in infrastructure views. To our knowledge, I2V-GS is the first framework to generate autonomous driving datasets with infrastructure-vehicle view transformation. Experimental results demonstrate that I2V-GS significantly improves synthesis quality under vehicle view, outperforming StreetGaussian in NTA-Iou, NTL-Iou, and FID by 45.7%, 34.2%, and 14.9%, respectively.

[151] UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Zihan Cheng,Liangtai Zhou,Dian Chen,Ni Tang,Xiaotong Luo,Yanyun Qu

Main category: cs.CV

TL;DR: 研究提出了一种统一的图像修复框架，通过结合潜在扩散模型和设计特定模块（DAFF 和 DAEM），有效应对多种退化类型，并恢复图像细节，性能达到当前领先水平。

Details

Motivation: 为了解决统一图像修复方向中的核心挑战，即利用扩散模型的强大生成能力处理多样化的退化问题。 Method: 设计了退化感知特征融合模块（DAFF）和细节感知专家模块（DAEM），分别用于适应不同退化类型和增强纹理及精细结构恢复。 Result: 在多任务和混合退化设置下的广泛实验表明，该方法始终实现最先进的性能。 Conclusion: 该研究提出了一种基于潜在扩散模型的新型统一图像修复框架，能够适应多种退化类型并恢复图像细节，展现出扩散先验在统一图像修复中的实用潜力。 Abstract: All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

[152] Enhanced Velocity Field Modeling for Gaussian Video Reconstruction

Zhenyang Li,Xiaoyang Bai,Tongchen Zhang,Pengfei Shen,Weiwei Xu,Yifan Peng

Main category: cs.CV

TL;DR: 本文提出了一种基于流的速度场建模方案FlowGaussian-VR，用于高保真3D视频重建，解决了复杂运动和尺度变化下高斯轨迹不规则导致的过拟合问题。

Details

Motivation: 现有的3D高斯点绘方法在处理复杂运动和显著尺度变化的视频时，由于不规则的高斯轨迹容易过拟合，导致视觉质量下降；同时静态场景重建的密度化策略无法应对动态内容的缺失。 Method: FlowGaussian-VR包含两个核心组件：速度场渲染（VFR）管道，通过光流优化高斯轨迹；流辅助自适应密度化（FAD）策略，根据动态区域调整高斯数量和大小。 Result: 在多视角动态重建和新视角合成任务中，FlowGaussian-VR在多个真实世界数据集上验证了有效性，不仅在PSNR上提升了2.5 dB以上，还减少了动态纹理中的模糊伪影，并实现了更规则和可追踪的高斯轨迹。 Conclusion: FlowGaussian-VR通过引入光流驱动的速度场建模和自适应密度化策略，显著提升了复杂动态场景下的3D视频重建质量。 Abstract: High-fidelity 3D video reconstruction is essential for enabling real-time rendering of dynamic scenes with realistic motion in virtual and augmented reality (VR/AR). The deformation field paradigm of 3D Gaussian splatting has achieved near-photorealistic results in video reconstruction due to the great representation capability of deep deformation networks. However, in videos with complex motion and significant scale variations, deformation networks often overfit to irregular Gaussian trajectories, leading to suboptimal visual quality. Moreover, the gradient-based densification strategy designed for static scene reconstruction proves inadequate to address the absence of dynamic content. In light of these challenges, we propose a flow-empowered velocity field modeling scheme tailored for Gaussian video reconstruction, dubbed FlowGaussian-VR. It consists of two core components: a velocity field rendering (VFR) pipeline which enables optical flow-based optimization, and a flow-assisted adaptive densification (FAD) strategy that adjusts the number and size of Gaussians in dynamic regions. We validate our model's effectiveness on multi-view dynamic reconstruction and novel view synthesis with multiple real-world datasets containing challenging motion scenarios, demonstrating not only notable visual improvements (over 2.5 dB gain in PSNR) and less blurry artifacts in dynamic textures, but also regularized and trackable per-Gaussian trajectories.

[153] Explainable Image Classification with Reduced Overconfidence for Tissue Characterisation

Alfie Roddan,Chi Xu,Serine Ajlouni,Irini Kakaletri,Patra Charalampaki,Stamatia Giannarou

Main category: cs.CV

TL;DR: 这篇论文提出了一种新的图像分类可解释性方法，通过迭代生成像素归因（PA）图并估计像素级风险，提升了模型的可解释性并验证了其优于现有技术的表现。

Details

Motivation: 深度学习模型在预测中可能过于自信，这种过度自信也会传递到像素归因上，因此需要一种能够结合风险估计的像素归因方法来提升可解释性。 Method: 该论文提出的方法通过迭代应用分类模型和像素归因方法生成PA图的体积，并首次利用像素级分布生成增强的PA图，同时使用变异系数（CV）估计像素级风险。 Result: 该方法不仅提供了改进的PA图，还对输出的PA值进行了风险估计，实验结果表明其在pCLE数据和ImageNet上的表现优于当前最先进的方法。 Conclusion: 该论文提出了一种新的像素归因方法，通过结合风险估计来提升图像分类的可解释性，并证明了其优于现有技术。 Abstract: The deployment of Machine Learning models intraoperatively for tissue characterisation can assist decision making and guide safe tumour resections. For image classification models, pixel attribution methods are popular to infer explainability. However, overconfidence in deep learning model's predictions translates to overconfidence in pixel attribution. In this paper, we propose the first approach which incorporates risk estimation into a pixel attribution method for improved image classification explainability. The proposed method iteratively applies a classification model with a pixel attribution method to create a volume of PA maps. This volume is used for the first time, to generate a pixel-wise distribution of PA values. We introduce a method to generate an enhanced PA map by estimating the expectation values of the pixel-wise distributions. In addition, the coefficient of variation (CV) is used to estimate pixel-wise risk of this enhanced PA map. Hence, the proposed method not only provides an improved PA map but also produces an estimation of risk on the output PA values. Performance evaluation on probe-based Confocal Laser Endomicroscopy (pCLE) data and ImageNet verifies that our improved explainability method outperforms the state-of-the-art.

[154] DiffuMatch: Category-Agnostic Spectral Diffusion Priors for Robust Non-rigid Shape Matching

Emery Pierson,Lei Li,Angela Dai,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 本文提出了一种新的数据驱动方法，通过训练功能映射的生成模型，并使用基于扩散模型的蒸馏策略，实现对非刚性形状匹配任务中的功能映射正则化学习，从而提高匹配准确性并扩大应用范围。

Details

Motivation: 现有的深度功能映射方法在模型正则化和功能映射训练方面仍依赖于公理化建模，这限制了其准确性和适用性。 Method: 首先使用基于分数的生成模型训练一个功能映射的生成模型，然后利用该模型在新的形状集合上促进真实功能映射的结构特性。 Result: 实验表明，所提出的方法在零样本非刚性形状匹配任务中比传统的公理化方法表现更好。 Conclusion: 本文提出了一种新的基于数据驱动的方法，用于解决非刚性形状匹配任务，通过使用基于扩散模型的蒸馏策略，实现了对功能映射的正则化学习，该方法能够完全替代常用的拉普拉斯交换性和正交性约束策略。 Abstract: Deep functional maps have recently emerged as a powerful tool for solving non-rigid shape correspondence tasks. Methods that use this approach combine the power and flexibility of the functional map framework, with data-driven learning for improved accuracy and generality. However, most existing methods in this area restrict the learning aspect only to the feature functions and still rely on axiomatic modeling for formulating the training loss or for functional map regularization inside the networks. This limits both the accuracy and the applicability of the resulting approaches only to scenarios where assumptions of the axiomatic models hold. In this work, we show, for the first time, that both in-network regularization and functional map training can be replaced with data-driven methods. For this, we first train a generative model of functional maps in the spectral domain using score-based generative modeling, built from a large collection of high-quality maps. We then exploit the resulting model to promote the structural properties of ground truth functional maps on new shape collections. Remarkably, we demonstrate that the learned models are category-agnostic, and can fully replace commonly used strategies such as enforcing Laplacian commutativity or orthogonality of functional maps. Our key technical contribution is a novel distillation strategy from diffusion models in the spectral domain. Experiments demonstrate that our learned regularization leads to better results than axiomatic approaches for zero-shot non-rigid shape matching. Our code is available at: https://github.com/daidedou/diffumatch/

[155] RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

Dongming Wu,Yanping Fu,Saike Huang,Yingfei Liu,Fan Jia,Nian Liu,Feng Dai,Tiancai Wang,Rao Muhammad Anwer,Fahad Shahbaz Khan,Jianbing Shen

Main category: cs.CV

TL;DR: This paper introduces RAGNet and AffordanceNet, which improve robotic grasping in open-world scenarios by enhancing affordance perception with large-scale data and reasoning.

Details

Motivation: Current robotic grasping studies lack reasoning-based large-scale data for affordance prediction, limiting their effectiveness in open-world scenarios. Method: The authors built a large-scale benchmark called RAGNet with diverse data and reasoning instructions. They also proposed AffordanceNet, a framework combining a VLM pre-trained on affordance data and a grasping network. Result: Experiments showed that the model achieved strong performance on affordance segmentation and real-robot manipulation tasks, demonstrating its open-world generalization ability. Conclusion: The study concludes that the proposed AffordanceNet framework, together with the RAGNet benchmark, significantly enhances open-world generalization for robotic grasping tasks. Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

[156] Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao,Yi Zhao,Juho Kannala,Joni Pajarinen

Main category: cs.CV

TL;DR: DIAS通过改进Slot Attention中的冗余问题和引入自蒸馏机制，提高了对象中心学习在多个视觉任务上的性能。

Details

Motivation: 现有的基于密集特征图的解决方案在对象中心学习（OCL）任务中存在冗余槽竞争、对象错误分割以及监督信号仅来自输入重建的问题，忽略了内部信息的潜在监督。 Method: Slot Attention with re-Initialization and self-Distillation (DIAS)：1）减少聚合槽中的冗余，并重新初始化额外的聚合以更新剩余的槽；2）驱动第一次聚合迭代的注意力图以近似最后一次迭代的注意力图以实现自蒸馏。 Result: DIAS在OCL任务如对象发现和识别上达到了最先进的水平，同时改善了高级视觉预测和推理。 Conclusion: DIAS通过减少冗余和引入自蒸馏机制，在对象中心学习任务上实现了最先进的性能，同时提高了高级视觉预测和推理。 Abstract: Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input's reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

[157] SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting

Di Li,Jie Feng,Jiahao Chen,Weisheng Dong,Guanbin Li,Yuhui Zheng,Mingtao Feng,Guangming Shi

Main category: cs.CV

TL;DR: This paper introduces Sequential 3D Gaussian Affordance Reasoning and the SeqAffordSplat benchmark, proposing the SeqSplatNet framework that advances 3D affordance reasoning to handle complex, sequential tasks at the scene level.

Details

Motivation: Current methods of 3D affordance reasoning are limited to single-object, single-step interactions, which do not suffice for the long-horizon, multi-object tasks needed in real-world applications. Method: SeqSplatNet is introduced, which uses a large language model to autoregressively generate text with segmentation tokens, guiding a conditional decoder for 3D mask production. A pre-training strategy called Conditional Geometric Reconstruction and a feature injection mechanism are also employed. Result: Extensive experiments show that the proposed method sets a new state-of-the-art on the SeqAffordSplat benchmark, advancing affordance reasoning to complex, sequential tasks. Conclusion: The proposed SeqSplatNet effectively advances affordance reasoning to handle complex, sequential tasks at the scene level, setting a new state-of-the-art on the introduced SeqAffordSplat benchmark. Abstract: 3D affordance reasoning, the task of associating human instructions with the functional regions of 3D objects, is a critical capability for embodied agents. Current methods based on 3D Gaussian Splatting (3DGS) are fundamentally limited to single-object, single-step interactions, a paradigm that falls short of addressing the long-horizon, multi-object tasks required for complex real-world applications. To bridge this gap, we introduce the novel task of Sequential 3D Gaussian Affordance Reasoning and establish SeqAffordSplat, a large-scale benchmark featuring 1800+ scenes to support research on long-horizon affordance understanding in complex 3DGS environments. We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks. SeqSplatNet employs a large language model that autoregressively generates text interleaved with special segmentation tokens, guiding a conditional decoder to produce the corresponding 3D mask. To handle complex scene geometry, we introduce a pre-training strategy, Conditional Geometric Reconstruction, where the model learns to reconstruct complete affordance region masks from known geometric observations, thereby building a robust geometric prior. Furthermore, to resolve semantic ambiguities, we design a feature injection mechanism that lifts rich semantic features from 2D Vision Foundation Models (VFM) and fuses them into the 3D decoder at multiple scales. Extensive experiments demonstrate that our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.

[158] Half-Physics: Enabling Kinematic 3D Human Model with Physical Interactions

Li Siyao,Yao Feng,Omid Tehari,Chen Change Loy,Michael J. Black

Main category: cs.CV

TL;DR: This paper introduces a 'half-physics' approach to enable SMPL-X, a 3D human model, to interact physically with environments, solving issues like interpenetration without compromising motion accuracy or requiring complex training.

Details

Motivation: The motivation is to address the lack of physical interaction capability in current 3D human models, which leads to issues like interpenetration and unrealistic object dynamics during interactions. Method: The method involves embedding the SMPL-X model into a tangible entity using a 'half-physics' mechanism that transforms 3D kinematic motion into physics simulations while maintaining kinematic control. Result: The result is a learning-free, real-time approach that generalizes to any body shape and motion, ensures physically plausible interactions, and eliminates interpenetration and unrealistic dynamics. Conclusion: The paper concludes that the proposed half-physics method effectively overcomes the limitations of current 3D human models by enabling dynamic physical interactions without compromising motion fidelity or requiring complex training. Abstract: While current general-purpose 3D human models (e.g., SMPL-X) efficiently represent accurate human shape and pose, they lacks the ability to physically interact with the environment due to the kinematic nature. As a result, kinematic-based interaction models often suffer from issues such as interpenetration and unrealistic object dynamics. To address this limitation, we introduce a novel approach that embeds SMPL-X into a tangible entity capable of dynamic physical interactions with its surroundings. Specifically, we propose a "half-physics" mechanism that transforms 3D kinematic motion into a physics simulation. Our approach maintains kinematic control over inherent SMPL-X poses while ensuring physically plausible interactions with scenes and objects, effectively eliminating penetration and unrealistic object dynamics. Unlike reinforcement learning-based methods, which demand extensive and complex training, our half-physics method is learning-free and generalizes to any body shape and motion; meanwhile, it operates in real time. Moreover, it preserves the fidelity of the original kinematic motion while seamlessly integrating physical interactions

[159] Phi-Ground Tech Report: Advancing Perception in GUI Grounding

Miaosen Zhang,Ziqiang Xu,Jialiang Zhu,Qi Dai,Kai Qiu,Yifan Yang,Chong Luo,Tianyi Chen,Justin Wagle,Tim Franklin,Baining Guo

Main category: cs.CV

TL;DR: 本文介绍了一种名为 Phi-Ground 的新模型家族，它在代理设置下的所有五个基础基准测试中均达到了最先进的性能。

Details

Motivation: 当前的端到端基础模型在像 ScreenSpot-pro 和 UI-Vision 这样的具有挑战性的基准测试中仍取得不到65%的准确率，表明它们远未准备好部署。 Method: 进行了一项实证研究，研究了从数据收集到模型训练的细节。 Result: 开发了 Phi-Ground 模型家族，该模型在代理设置下的所有五个基础基准测试中均达到了最先进的性能。在端到端模型设置中，该模型在 ScreenSpot-pro 上的得分为 43.2，在 UI-Vision 上的得分为 27.2，结果仍然达到了最先进的水平。 Conclusion: Phi-Ground 模型家族在代理设置下的所有五个基础基准测试中都达到了最先进的性能。 Abstract: With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical control in robotics, and it directly leads to the success or failure of the system. It determines actions such as clicking and typing, as well as related parameters like the coordinates for clicks. Current end-to-end grounding models still achieve less than 65\% accuracy on challenging benchmarks like ScreenSpot-pro and UI-Vision, indicating they are far from being ready for deployment. % , as a single misclick can result in unacceptable consequences. In this work, we conduct an empirical study on the training of grounding models, examining details from data collection to model training. Ultimately, we developed the \textbf{Phi-Ground} model family, which achieves state-of-the-art performance across all five grounding benchmarks for models under $10B$ parameters in agent settings. In the end-to-end model setting, our model still achieves SOTA results with scores of \textit{\textbf{43.2}} on ScreenSpot-pro and \textit{\textbf{27.2}} on UI-Vision. We believe that the various details discussed in this paper, along with our successes and failures, not only clarify the construction of grounding models but also benefit other perception tasks. Project homepage: \href{https://zhangmiaosen2000.github.io/Phi-Ground/}{https://zhangmiaosen2000.github.io/Phi-Ground/}

[160] MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion

Zihan Wang,Jeff Tan,Tarasha Khurana,Neehar Peri,Deva Ramanan

Main category: cs.CV

TL;DR: 本文提出了一种从少量固定相机捕捉的视频中重建动态场景的新方法，通过优化单目重建的对齐实现高质量结果。

Details

Motivation: 现有的密集多视角重建方法需要数百个校准相机的捕捉，成本高昂且难以在野外使用。本文旨在解决这一问题，从稀疏视角中重建动态场景，尤其是动态的人类行为。 Method: 该方法主要依赖于对每个相机的独立单目重建进行对齐，以解决稀疏视角下视图间重叠有限的问题。 Result: 实验表明，该方法在PanopticStudio和Ego-Exo4D数据集上实现了比现有技术更高品质的重建效果，尤其是在新视角渲染方面。 Conclusion: 该论文提出了一种从稀疏视角视频中重建动态场景的方法，通过独立对每个相机的单目重建进行精细对齐，实现了时间与视角一致的高质量重建，优于现有技术。 Abstract: We address the problem of dynamic scene reconstruction from sparse-view videos. Prior work often requires dense multi-view captures with hundreds of calibrated cameras (e.g. Panoptic Studio). Such multi-view setups are prohibitively expensive to build and cannot capture diverse scenes in-the-wild. In contrast, we aim to reconstruct dynamic human behaviors, such as repairing a bike or dancing, from a small set of sparse-view cameras with complete scene coverage (e.g. four equidistant inward-facing static cameras). We find that dense multi-view reconstruction methods struggle to adapt to this sparse-view setup due to limited overlap between viewpoints. To address these limitations, we carefully align independent monocular reconstructions of each camera to produce time- and view-consistent dynamic scene reconstructions. Extensive experiments on PanopticStudio and Ego-Exo4D demonstrate that our method achieves higher quality reconstructions than prior art, particularly when rendering novel views. Code, data, and data-processing scripts are available on https://github.com/ImNotPrepared/MonoFusion.

[161] SUB: Benchmarking CBM Generalization via Synthetic Attribute Substitutions

Jessica Bader,Leander Girrbach,Stephan Alaniz,Zeynep Akata

Main category: cs.CV

TL;DR: This paper introduces a new benchmark called SUB for evaluating Concept Bottleneck Models' robustness to concept variations using synthetic images generated with a novel Tied Diffusion Guidance method.

Details

Motivation: The motivation is to assess the robustness of Concept Bottleneck Models (CBMs) and other concept-based interpretable models to concept variations, especially in fields like medicine where transparency in AI applications is essential. Method: The authors create a fine-grained image and concept benchmark called SUB with 38,400 synthetic images based on the CUB dataset using a novel Tied Diffusion Guidance (TDG) method. Result: The result is the creation of the SUB dataset with synthetic images that substitute specific concepts, enabling a detailed evaluation of CBMs' performance under concept variations. Conclusion: The paper concludes that CBMs struggle to reliably identify the correct concepts under distribution shifts, and the introduced SUB benchmark enables rigorous evaluation of CBMs and similar interpretable models. Abstract: Concept Bottleneck Models (CBMs) and other concept-based interpretable models show great promise for making AI applications more transparent, which is essential in fields like medicine. Despite their success, we demonstrate that CBMs struggle to reliably identify the correct concepts under distribution shifts. To assess the robustness of CBMs to concept variations, we introduce SUB: a fine-grained image and concept benchmark containing 38,400 synthetic images based on the CUB dataset. To create SUB, we select a CUB subset of 33 bird classes and 45 concepts to generate images which substitute a specific concept, such as wing color or belly pattern. We introduce a novel Tied Diffusion Guidance (TDG) method to precisely control generated images, where noise sharing for two parallel denoising processes ensures that both the correct bird class and the correct attribute are generated. This novel benchmark enables rigorous evaluation of CBMs and similar interpretable models, contributing to the development of more robust methods. Our code is available at https://github.com/ExplainableML/sub and the dataset at http://huggingface.co/datasets/Jessica-bader/SUB.

[162] Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Bowen Zhang,Sicheng Xu,Chuxin Wang,Jiaolong Yang,Feng Zhao,Dong Chen,Baining Guo

Main category: cs.CV

TL;DR: 这篇论文介绍了一种新的视频到4D生成方法，能够高效生成高质量的动态3D内容，并在合成和真实数据上都表现出色。

Details

Motivation: 由于直接进行4D扩散建模面临数据构建成本高和高维表示动态3D形状、外观和运动的挑战，因此需要一种更高效的方法来生成高质量的动态3D内容。 Method: 引入了一个Direct 4DMesh-to-GS Variation Field VAE，直接从3D动画数据中编码规范的高斯点（GS）及其时间变化，并利用时间感知的扩散变压器模型进行训练。 Result: 该模型在Objaverse数据集上的训练效果优于现有方法，且在未见过的真实世界视频输入中也表现出显著的泛化能力。 Conclusion: 该论文提出了一种创新的视频到4D生成框架，能够从单个视频输入生成高质量的动态3D内容，并展示了其在生成质量和泛化能力方面的优势。 Abstract: In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

eess.IV [Back]

[163] Rethink Domain Generalization in Heterogeneous Sequence MRI Segmentation

Zheyuan Zhang,Linkai Peng,Wanying Dou,Cuiling Sun,Halil Ertugrul Aktas,Andrea M. Bejar,Elif Keles,Gorkem Durak,Ulas Bagci

Main category: eess.IV

TL;DR: PancreasDG是一个大规模多中心3D MRI胰腺分割数据集，用于研究医学影像中的领域泛化问题。

Details

Motivation: 现有的领域泛化基准主要关注跨中心变化，而忽视了主要的变异性来源。胰腺分割在腹部成像中仍是一个重大挑战，尽管其在早期癌症检测、手术和糖尿病研究中的临床重要性，但该器官在公共跨领域基准中被系统性地低估。 Method: PancreasDG数据集包含来自六个机构的563次MRI扫描，涵盖静脉期和非同相序列，通过双盲、两阶段协议创建像素级精确的胰腺掩码。此外，提出了一种半监督方法来解决跨序列的分割问题。 Result: 提出的半监督方法显著优于现有技术，在跨序列分割中Dice得分提高了61.63%，在两个测试中心达到了87.00%。 Conclusion: PancreasDG为医学影像中的领域泛化设定了新的基准，并提出了一个利用解剖不变性的半监督方法，显著优于现有领域泛化技术。 Abstract: Clinical magnetic-resonance (MR) protocols generate many T1 and T2 sequences whose appearance differs more than the acquisition sites that produce them. Existing domain-generalization benchmarks focus almost on cross-center shifts and overlook this dominant source of variability. Pancreas segmentation remains a major challenge in abdominal imaging: the gland is small, irregularly, surrounded by organs and fat, and often suffers from low T1 contrast. State-of-the-art deep networks that already achieve >90% Dice on the liver or kidneys still miss 20-30% of the pancreas. The organ is also systematically under-represented in public cross-domain benchmarks, despite its clinical importance in early cancer detection, surgery, and diabetes research. To close this gap, we present PancreasDG, a large-scale multi-center 3D MRI pancreas segmentation dataset for investigating domain generalization in medical imaging. The dataset comprises 563 MRI scans from six institutions, spanning both venous phase and out-of-phase sequences, enabling study of both cross-center and cross-sequence variations with pixel-accurate pancreas masks created by a double-blind, two-pass protocol. Through comprehensive analysis, we reveal three insights: (i) limited sampling introduces significant variance that may be mistaken for distribution shifts, (ii) cross-center performance correlates with source domain performance for identical sequences, and (iii) cross-sequence shifts require specialized solutions. We also propose a semi-supervised approach that leverages anatomical invariances, significantly outperforming state-of-the-art domain generalization techniques with 61.63% Dice score improvements and 87.00% on two test centers for cross-sequence segmentation. PancreasDG sets a new benchmark for domain generalization in medical imaging. Dataset, code, and models will be available at https://pancreasdg.netlify.app.

[164] Towards High-Resolution Alignment and Super-Resolution of Multi-Sensor Satellite Imagery

Philip Wootaek Shin,Vishal Gaur,Rahul Ramachandran,Manil Maskey,Jack Sampson,Vijaykrishnan Narayanan,Sujit Roy

Main category: eess.IV

TL;DR: 该研究提出了一种利用HLS10图像作为参考来提高HLS30图像分辨率的初步框架，有效解决了异构卫星传感器的分辨率差异问题。

Details

Motivation: 由于不同卫星传感器在空间分辨率上的差异对数据融合和下游应用提出了挑战，因此需要解决这一问题。 Method: 开发了一个初步框架，利用HLS10作为参考来对齐和增强HLS 30米分辨率图像。 Result: 通过定量和定性评估，证明了该方法的有效性，并显示了其增强基于卫星传感应用的潜力。 Conclusion: 该研究提供了关于异构卫星图像超分辨率的可行性见解，并强调了未来研究的关键考虑因素。 Abstract: High-resolution satellite imagery is essential for geospatial analysis, yet differences in spatial resolution across satellite sensors present challenges for data fusion and downstream applications. Super-resolution techniques can help bridge this gap, but existing methods rely on artificially downscaled images rather than real sensor data and are not well suited for heterogeneous satellite sensors with differing spectral, temporal characteristics. In this work, we develop a preliminary framework to align and Harmonized Landsat Sentinel 30m(HLS 30) imagery using Harmonized Landsat Sentinel 10m(HLS10) as a reference from the HLS dataset. Our approach aims to bridge the resolution gap between these sensors and improve the quality of super-resolved Landsat imagery. Quantitative and qualitative evaluations demonstrate the effectiveness of our method, showing its potential for enhancing satellite-based sensing applications. This study provides insights into the feasibility of heterogeneous satellite image super-resolution and highlights key considerations for future advancements in the field.

Table of Contents

cs.CL [Back]

[1] Large Language Models in the Travel Domain: An Industrial Experience

[2] ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

[3] A Language Model-Driven Semi-Supervised Ensemble Framework for Illicit Market Detection Across Deep/Dark Web and Social Platforms

[4] A Hybrid Framework for Subject Analysis: Integrating Embedding-Based Regression Models with Large Language Models

[5] Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge Graphs

[6] Theoretical Foundations and Mitigation of Hallucination in Large Language Models

[7] Reading Between the Timelines: RAG for Answering Diachronic Questions

[8] Semantic Convergence: Investigating Shared Representations Across Scaled LLMs

[9] A novel language model for predicting serious adverse event results in clinical trials from their prospective registrations

[10] Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey

[11] Fast and Accurate Contextual Knowledge Extraction Using Cascading Language Model Chains and Candidate Answers

[12] Predicting stock prices with ChatGPT-annotated Reddit sentiment

[13] How and Where to Translate? The Impact of Translation Strategies in Cross-lingual LLM Prompting

[14] Using Sentiment Analysis to Investigate Peer Feedback by Native and Non-Native English Speakers

[15] Hierarchical Memory for High-Efficiency Long-Term Reasoning in LLM Agents

[16] Multi-Relation Extraction in Entity Pairs using Global Context

[17] PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation

[18] How does Chain of Thought Think? Mechanistic Interpretability of Chain-of-Thought Reasoning with Sparse Autoencoding

[19] EH-Benchmark Ophthalmic Hallucination Benchmark and Agent-Driven Top-Down Traceable Reasoning Workflow

[20] Protecting Vulnerable Voices: Synthetic Dataset Generation for Self-Disclosure Detection

[21] Enhancing RAG Efficiency with Adaptive Context Compression

[22] FinMarBa: A Market-Informed Dataset for Financial Sentiment Classification

[23] Augmented Vision-Language Models: A Systematic Review

[24] Deep Learning Approaches for Multimodal Intent Recognition: A Survey

[25] Trusted Knowledge Extraction for Operations and Maintenance Intelligence

[26] Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis

[27] CoE-Ops: Collaboration of LLM-based Experts for AIOps Question-Answering

[28] A Graph-based Approach for Multi-Modal Question Answering from Flowcharts in Telecom Documents

[29] PARROT: An Open Multilingual Radiology Reports Dataset

[30] Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes

[31] SigBERT: Combining Narrative Medical Reports and Rough Path Signature Theory for Survival Risk Estimation in Oncology

[32] A chart review process aided by natural language processing and multi-wave adaptive sampling to expedite validation of code-based algorithms for large database studies

[33] Opacity as Authority: Arbitrariness and the Preclusion of Contestation

[34] C3: A Bilingual Benchmark for Spoken Dialogue Models Exploring Challenges in Complex Conversations

[35] Math Natural Language Inference: this should be easy!

[36] Exploring In-Context Learning for Frame-Semantic Parsing

[37] Context-aware Rotary Position Embedding

[38] SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

[39] RASL: Retrieval Augmented Schema Linking for Massive Database Text-to-SQL

[40] Uncovering the Fragility of Trustworthy LLMs through Chinese Textual Ambiguity

[41] ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual-Language Models through Procedural Plans

[42] User Feedback in Human-LLM Dialogues: A Lens to Understand Users But Noisy as a Learning Signal

[43] LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration

[44] Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

[45] Failures Are the Stepping Stones to Success: Enhancing Few-Shot In-Context Learning by Leveraging Negative Samples

[46] Model Directions, Not Words: Mechanistic Topic Models Using Sparse Autoencoders

[47] Enabling Few-Shot Alzheimer's Disease Diagnosis on Tabular Biomarker Data with LLMs

[48] P-ReMIS: Pragmatic Reasoning in Mental Health and a Social Implication

[49] Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis

[50] Unveiling Super Experts in Mixture-of-Experts Large Language Models

[51] What's Taboo for You? - An Empirical Evaluation of LLMs Behavior Toward Sensitive Content

[52] MUST-RAG: MUSical Text Question Answering with Retrieval Augmented Generation

[53] Text-to-SQL Task-oriented Dialogue Ontology Construction

[54] MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models

[55] Causal2Vec: Improving Decoder-only LLMs as Versatile Embedding Models

[56] Beyond the Cloud: Assessing the Benefits and Drawbacks of Local LLM Deployment for Translators

[57] MRGSEM-Sum: An Unsupervised Multi-document Summarization Framework based on Multi-Relational Graphs and Structural Entropy Minimization

[58] Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

[59] Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

[60] Role-Aware Language Models for Secure and Contextualized Access Control in Organizations

[61] A Novel Evaluation Benchmark for Medical LLMs: Illuminating Safety and Effectiveness in Clinical Domains

[62] Med-R$^3$: Enhancing Medical Retrieval-Augmented Reasoning of LLMs via Progressive Reinforcement Learning

[63] T-Detect: Tail-Aware Statistical Normalization for Robust Detection of Adversarial Machine-Generated Text

[64] DiffLoRA: Differential Low-Rank Adapters for Large Language Models

[65] Arabic Hate Speech Identification and Masking in Social Media using Deep Learning Models and Pre-trained Models Fine-tuning

[66] Rule2Text: Natural Language Explanation of Logical Rules in Knowledge Graphs

[67] Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

cs.CV [Back]

[68] CHECK-MAT: Checking Hand-Written Mathematical Answers for the Russian Unified State Exam

[69] Robust and Efficient 3D Gaussian Splatting for Urban Scene Reconstruction

[70] Modeling Human Gaze Behavior with Diffusion Models for Unified Scanpath Prediction

[71] Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

[72] Adaptive Time-step Training for Enhancing Spike-Based Neural Radiance Fields

[73] Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving

[74] Reference-Guided Diffusion Inpainting For Multimodal Counterfactual Generation

[75] Vision-Language Fusion for Real-Time Autonomous Driving: Goal-Centered Cross-Attention of Camera, HD-Map, & Waypoints

[76] Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model

[77] Details Matter for Indoor Open-vocabulary 3D Instance Segmentation