cs.CL [Back]

[1] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

Ruggero Marino Lazzaroni,Alessandro Angioi,Michelangelo Puliga,Davide Sanna,Roberto Marras

Main category: cs.CL

TL;DR: MedBench-IT 是一个全面的基准，用于评估 LLMs 在意大利医学大学入学考试中的表现。

Details

Motivation: 大型语言模型 (LLMs) 在教育领域展现出越来越大的潜力，但针对特定领域的非英语语言基准仍然稀缺。 Method: MedBench-IT 包括来自 Edizioni Simone 的 17,410 道专家编写的多选题，涵盖六个学科和三个难度级别，并对多个模型进行了评估。 Result: 研究发现，问题可读性与模型性能之间存在统计上显著但较小的负相关关系，并进行了严格的可重复性测试和推理提示评估。 Conclusion: MedBench-IT 是意大利 NLP 社区、EdTech 开发人员和从业人员的重要资源，提供了对当前能力的洞察和标准化评估方法。 Abstract: Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (<30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain.

[2] The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

William Chen,Chutong Meng,Jiatong Shi,Martijn Bartelds,Shih-Heng Wang,Hsiu-Hsuan Wang,Rafael Mosquera,Sara Hincapie,Dan Jurafsky,Antonis Anastasopoulos,Hung-yi Lee,Karen Livescu,Shinji Watanabe

Main category: cs.CL

TL;DR: The ML-SUPERB 2.0 Challenge introduces a comprehensive evaluation framework for multilingual ASR, achieving significant performance improvements and highlighting the importance of inclusivity in speech technology development.

Details

Motivation: Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties, prompting the need for a more comprehensive evaluation framework. Method: A new test suite with data from over 200 languages, accents, and dialects was constructed, along with an online evaluation server based on DynaBench to assess and compare state-of-the-art multilingual ASR models. Result: Five submissions from three teams outperformed the baselines, with the best submission achieving a 23% absolute improvement in LID accuracy, 18% reduction in CER on general multilingual data, and further improvements on accented and dialectal data. Conclusion: The ML-SUPERB 2.0 Challenge significantly advanced multilingual ASR by achieving notable improvements in LID accuracy and CER across multiple languages, dialects, and accents, highlighting the importance of community-driven challenges for inclusive speech technologies. Abstract: Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOTA) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.

[3] Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

Zhiyin Tan,Jennifer D'Souza

Main category: cs.CL

TL;DR: This paper proposes an LLM-based framework for evaluating topic models, offering more interpretable and robust assessments than traditional metrics by focusing on semantic and structural soundness across multiple dimensions.

Details

Motivation: Traditional automated metrics for evaluating topic models, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures. This necessitates a more robust and semantically meaningful evaluation approach. Method: The study introduces a purpose-oriented evaluation framework using nine LLM-based metrics across four dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols and applied across diverse datasets and topic modeling methods. Result: LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift. The framework proves effective across datasets including news articles, scholarly publications, and social media posts. Conclusion: The study concludes that LLM-based metrics offer a scalable and interpretable solution for evaluating topic models, uncovering weaknesses missed by traditional metrics and ensuring topic relevance in dynamic datasets. Abstract: This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at https://github.com/zhiyintan/topic-model-LLMjudgment.

[4] Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

Amal Chebbi,Babajide Kolade

Main category: cs.CL

TL;DR: EnergyGPT是一个基于LLaMA 3.1-8B并通过监督微调开发的能源领域专用语言模型，展示了在能源相关任务中的显著性能提升。

Details

Motivation: 大型语言模型在通用领域的表现优异，但在需要深度专业知识和精确领域知识的特定领域（如能源）中效果有限。因此，本文旨在开发一个专门针对能源领域的语言模型。 Method: 使用监督微调技术，基于LLaMA 3.1-8B模型和高质量能源相关文本语料库开发EnergyGPT。 Result: 通过领域特定的问答基准测试评估模型性能，结果显示EnergyGPT在大多数能源相关任务上优于基础模型。 Conclusion: EnergyGPT是一个为能源领域定制的领域专用语言模型，通过监督微调LLaMA 3.1-8B模型实现，并展示了在能源相关任务上的优越性能。 Abstract: Large Language Models have demonstrated impressive capabilities across various domains. However, their general-purpose nature often limits their effectiveness in specialized fields such as energy, where deep technical expertise and precise domain knowledge are essential. In this paper, we introduce EnergyGPT, a domain-specialized language model tailored for the energy sector, developed by fine-tuning LLaMA 3.1-8B model using Supervised Fine-Tuning on a high-quality, curated corpus of energy-related texts. We present a complete development pipeline, including data collection and curation, model fine-tuning, benchmark design and LLM-judge choice, evaluation and deployment. Through this work, we demonstrate that our training strategy enables improvements in domain relevance and performance without the need for large-scale infrastructure. By evaluating the performance of the model using domain-specific question-answering benchmarks, our results demonstrate that EnergyGPT outperforms the base model in most of the energy-related language understanding and generation tasks.

[5] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

Zonghai Yao,Michael Sun,Won Seok Jang,Sunjae Kwon,Soie Kwon,Hong Yu

Main category: cs.CL

TL;DR: The paper introduces DischargeSim, a benchmark for evaluating large language models (LLMs) as discharge educators, revealing significant performance gaps and variability across patient profiles, with model size not always correlating with better outcomes.

Details

Motivation: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. Recent LLM benchmarks emphasize diagnostic reasoning but fail to evaluate models' ability to support patients post-visit. Method: The study introduces DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators through simulated post-visit, multi-turn conversations between DoctorAgents and PatientAgents with diverse psychosocial profiles. Result: Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. Conclusion: DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support. Abstract: Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models' ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.

[6] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

Zahra Atf,Peter R Lewis

Main category: cs.CL

TL;DR: We propose a transparent framework based on moral principles to handle uncertainty in LLM-generated text, offering a lightweight alternative to probabilistic models for socially responsible natural language generation.

Details

Motivation: Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. Method: We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Result: Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Conclusion: Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation. Abstract: Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.

[7] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Aida Kostikova,Ole Pütz,Steffen Eger,Olga Sabelfeld,Benjamin Paassen

Main category: cs.CL

TL;DR: LLMs effectively analyze political debates on migration, showing a historical shift from solidarity to increasing anti-solidarity in Germany.

Details

Motivation: Automating complex annotation tasks with LLMs can enable broader analysis of political speech on migration, a historically significant and current topic in Germany. Method: Evaluated multiple LLMs against human annotations for (anti-)solidarity subtypes in German parliamentary debates; assessed model size, prompting, fine-tuning, and data type. Result: LLMs achieved significant performance in annotating (anti-)solidarity subtypes, revealing high solidarity post-WWII and a shift toward anti-solidarity since 2015. Conclusion: LLMs show promise in political text analysis, and migration debates in Germany highlight trends of increasing anti-solidarity since 2015. Abstract: Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.

[8] Causal Attention with Lookahead Keys

Zhuoqing Song,Peng Sun,Huizhuo Yuan,Quanquan Gu

Main category: cs.CL

TL;DR: The paper proposes CASTLE, an attention mechanism that dynamically updates keys to integrate future context while preserving autoregression, resulting in improved performance on language modeling tasks.

Details

Motivation: Standard causal attention mechanisms have static query, key, and value (QKV) components that encode only preceding context. The authors aim to enhance this mechanism by integrating information from future tokens without violating the autoregressive property. Method: The paper introduces CASTLE, which updates keys dynamically as the context unfolds, terming them lookahead keys. It uses a mathematical equivalence to avoid explicitly materializing these lookahead keys at each position, allowing efficient parallel training. Result: CASTLE consistently outperformed standard causal attention across various model scales in language modeling benchmarks, reducing validation perplexity and improving performance on downstream tasks. Conclusion: CASTLE, as a new attention mechanism, effectively improves the performance of language modeling tasks while maintaining the autoregressive property. Abstract: In standard causal attention, each token's query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token's keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.

[9] Basis Vector Metric: A Method for Robust Open-Ended State Change Detection

David Oprea,Sam Powers

Main category: cs.CL

TL;DR: The BVM method excels in classifying noun states compared to other metrics but does not significantly outperform logistic regression in differentiating adjectives. Potential improvements to BVM could enhance its performance further.

Details

Motivation: The motivation was to evaluate the effectiveness of a new method called BVM in accurately judging state changes in images, particularly in classifying noun states and differentiating adjectives. Method: The study employed the BVM (Basis Vectors Method) using language embeddings to assess state changes in images. Two experiments were conducted: one to compare BVM with other metrics (cosine similarity, dot product, product quantization, binary index, Naive Bayes, and a custom neural network) in classifying noun states, and another to evaluate BVM's ability to differentiate adjectives compared to logistic regression. Result: In the first experiment, BVM outperformed all other metrics in classifying noun states. In the second experiment, BVM did not show a significant advantage over logistic regression in differentiating adjectives, but potential enhancements to BVM were identified. Conclusion: The BVM method showed mixed results; it outperformed other metrics in classifying noun states but did not conclusively outperform logistic regression in differentiating adjectives. However, potential improvements to the BVM method were identified. Abstract: We test a new method, which we will abbreviate using the acronym BVM (Basis Vectors Method), in its ability to judge the state changes in images through using language embeddings. We used the MIT-States dataset, containing about 53,000 images, to gather all of our data, which has 225 nouns and 115 adjectives, with each noun having about 9 different adjectives, forming approximately 1000 noun-adjective pairs. For our first experiment, we test our method's ability to determine the state of each noun class separately against other metrics for comparison. These metrics are cosine similarity, dot product, product quantization, binary index, Naive Bayes, and a custom neural network. Among these metrics, we found that our proposed BVM performs the best in classifying the states for each noun. We then perform a second experiment where we try using BVM to determine if it can differentiate adjectives from one another for each adjective separately. We compared the abilities of BVM to differentiate adjectives against the proposed method the MIT-States paper suggests: using a logistic regression model. In the end, we did not find conclusive evidence that our BVM metric could perform better than the logistic regression model at discerning adjectives. Yet, we were able to find evidence for possible improvements to our method; this leads to the chance of increasing our method's accuracy through certain changes in our methodologies.

[10] Instance-level Performance Prediction for Long-form Generation Tasks

Chi-Yang Hsu,Alexander Braylan,Yiheng Su,Omar Alonso,Matthew Lease

Main category: cs.CL

TL;DR: This paper introduces a new benchmark for predicting the performance of long-form generation tasks, effectively predicting scores with few examples and quantifying uncertainty.

Details

Motivation: The motivation is to develop a benchmark for instance-level performance prediction in long-form generation tasks that have complex, fine-grained quality metrics. Method: The paper presents a benchmark that predicts continuous evaluation metric scores using black-box model inputs and outputs. It also infers prediction intervals to quantify uncertainty. Result: The result shows that the benchmark can predict scores effectively across various long-form generation tasks with as few as 16 training examples, and it provides a way to quantify uncertainty through prediction intervals. Conclusion: The paper concludes that their benchmark can effectively predict evaluation metric scores for long-form generation tasks with a small number of training examples, and it introduces a novel task, a valuable benchmark, and practical baselines. Abstract: We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.

[11] Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations

Sihyun Park

Main category: cs.CL

TL;DR: This paper introduces KAMIR, a novel data selection approach for training large language models that leverages model internal representations to enhance generalization performance across diverse tasks.

Details

Motivation: The lack of an established methodology for effective training data selection in supervised fine-tuning (SFT) of large language models (LLMs), where existing approaches are limited in scope or incur additional costs due to prompt engineering. Method: The study proposes KAMIR, which computes similarities between hidden states of each layer and the final hidden states for a given input to evaluate data, selecting data based on the model's familiarity with the input. Result: Experiments showed that training with less familiar data using KAMIR leads to improved generalization performance across a variety of tasks, even with small datasets and simple classifier architectures. Conclusion: KAMIR, a new method for training data selection in LLMs, enhances generalization performance across diverse tasks like machine reading comprehension and summarization by analyzing the model's internal representations. Abstract: Recent advances in large language models (LLMs) have been driven by pretraining, supervised fine tuning (SFT), and alignment tuning. Among these, SFT plays a crucial role in transforming a model 's general knowledge into structured responses tailored to specific tasks. However, there is no clearly established methodology for effective training data selection. Simply increasing the volume of data does not guarantee performance improvements, while preprocessing, sampling, and validation require substantial time and cost. To address this issue, a variety of data selection methods have been proposed. Among them, knowledge based selection approaches identify suitable training data by analyzing the model 's responses. Nevertheless, these methods typically rely on prompt engineering, making them sensitive to variations and incurring additional costs for prompt design. In this study, we propose Knowledge Analysis via Model Internal Representations (KAMIR), a novel approach that overcomes these limitations by analyzing data based on the model 's internal representations. KAMIR computes similarities between the hidden states of each layer (block) and the final hidden states for a given input to assess the data. Unlike prior methods that were largely limited to multiple choice tasks, KAMIR can be applied to a wide range of tasks such as machine reading comprehension and summarization. Moreover, it selects data useful for training based on the model 's familiarity with the input, even with a small dataset and a simple classifier architecture. Experiments across diverse task datasets demonstrate that training with less familiar data leads to better generalization performance.

Nakyung Lee,Yeongoon Kim,Minhae Oh,Suhwan Kim,Jin Woo Koo,Hyewon Jo,Jungwoo Lee

Main category: cs.CL

TL;DR: 本文提出了一种改进Transformer自注意力机制的方法SAOBP，通过引入多跳关系和GTD指标，有效解决了注意力局部化问题，并在小模型中提升了性能。

Details

Motivation: Transformer的自注意力机制常常面临局部化问题，注意力集中在有限的token子集上，无法捕捉长距离依赖关系。 Method: 提出了Self-Attention One-step Belief Propagation (SAOBP)，通过信念传播过程注入多跳关系，并引入Global Token Dependency (GTD)来解释和量化这些交互。 Result: 实验结果表明SAOBP可以有效地防止熵崩溃，适应性地保持GTD在任务适当水平，并在小规模模型中观察到显著的性能提升。 Conclusion: SAOBP框架有助于防止深层中的熵崩溃，并以任务适当水平维持GTD，从而提升模型性能，尤其在小规模模型中表现突出。 Abstract: Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.

[13] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

Yixuan Tang,Yi Yang,Ahmed Abbasi

Main category: cs.CL

TL;DR: PersonaFuse 是一种新型 LLM 后训练框架，通过结合人格适配器与动态路由网络，使 LLMs 能够根据不同情境表达不同人格，从而增强其社交和情感智能。

Details

Motivation: LLMs 在实际对话中表现出情感感知和社会能力的局限性，需要在不同社交和任务背景中适应沟通风格和情感表达。 Method: 引入了 PersonaFuse，一种基于特质激活理论和五大人格模型的新型 LLM 后训练框架，采用专家混合架构结合动态路由网络。 Result: PersonaFuse 在多个维度上的社交情感智能方面显著优于基线模型，并在心理健康咨询和基于评论的客户服务等应用中提供了持续改进。 Conclusion: PersonaFuse 提供了一种理论基础扎实且实用的方法来开发增强社交情感的 LLM，标志着向更以人为本的 AI 系统迈出重要一步。 Abstract: Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.

[14] Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

Sankalp Tattwadarshi Swain,Anshika Krishnatray,Dhruv Kumar,Jagat Sesh Challa

Main category: cs.CL

TL;DR: This study evaluates whether large language models can learn a new language through interactive feedback, finding that while they fail to converse within 100 responses, they employ human-like learning strategies, pointing to new evaluation methods and model improvements.

Details

Motivation: The motivation stems from the lack of existing studies assessing whether LLM agents can learn language through pattern recognition and interactive feedback, which is a key feature of human language acquisition. Method: The researchers developed a novel experimental framework where an LLM agent was tasked with acquiring and using a new language (Tinkatongue) through interaction with a bot that only understood Tinkatongue. Result: LLM agents failed to establish a conversation within 100 responses but used distinct strategies resembling human approaches to language learning. Conclusion: The study concludes that while LLM agents cannot establish a conversation in a newly constructed language within 100 responses, they employ strategies similar to human language learning, suggesting new directions for evaluation benchmarks and model designs. Abstract: Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.

[15] The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering

Yi-Jie Cheng,Oscar Chew,Yun-Nung Chen

Main category: cs.CL

TL;DR: 本文提出一种适用于小型语言模型的方法，通过引入轻量级探索模块提升其在知识图谱问答任务中的表现，从而缓解模型幻觉问题。

Details

Motivation: 现有将知识图谱集成到大型语言模型中的方法通常依赖专有或极大模型，限制了可访问性和扩展性。本文旨在探索适用于小型语言模型的方法。 Method: 提出了一种利用简单高效的探索模块来处理知识图谱遍历的方法，替代语言模型本身进行知识图谱推理。 Result: 实验结果表明，这些轻量级模块有效提升了小型语言模型在知识图谱问答任务中的性能。 Conclusion: 通过使用轻量级探索模块来提升小型语言模型在基于知识图谱的问答任务中的表现，有效缓解了模型幻觉问题。 Abstract: Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: https://github.com/yijie-cheng/SLM-ToG/.

[16] LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

Weichu Liu,Jing Xiong,Yuxuan Hu,Zixuan Li,Minghuan Tan,Ningning Mao,Chenyang Zhao,Zhongwei Wan,Chaofan Tao,Wendong Xu,Hui Shen,Chengming Li,Lingpeng Kong,Ngai Wong

Main category: cs.CL

TL;DR: This study presents LongEmotion, a benchmark for long-context emotional intelligence tasks, employing RAG and CoEM methods to improve performance, thereby advancing large language models towards more realistic and practical applications of emotional intelligence.

Details

Motivation: The motivation behind the study is to address the lack of focus on emotional intelligence in long-context scenarios in existing benchmarks, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. Method: The researchers designed and introduced a benchmark called LongEmotion for long-context EI tasks. They employed Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM) methods to enhance performance under realistic constraints and compared these with standard prompt-based methods. Result: Experimental results showed that both RAG and CoEM consistently enhanced emotional intelligence-related performance across most long-context tasks. A comparative case study on the GPT series also highlighted differences among various models concerning emotional intelligence. Conclusion: The study concludes that the incorporation of RAG and CoEM methods enhances emotional intelligence-related performance across most long-context tasks, advancing large language models toward more practical and real-world applications in emotional intelligence. Abstract: Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at https://github.com/LongEmotion/LongEmotion, and the project page can be found at https://longemotion.github.io/.

[17] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

Christian Rene Thelen,Patrick Gustav Blaneck,Tobias Bornheim,Niklas Grieger,Stephan Bialonski

Main category: cs.CL

TL;DR: The study demonstrates that multilingual models like XLM-RoBERTa-Large can effectively detect positive online communication (candy speech) with high accuracy, outperforming monolingual models.

Details

Motivation: Automated detection of positive, supportive online communication (candy speech) is underexplored, and its systematic analysis is limited without reliable detection methods. Method: A 46k-comment German YouTube corpus was analyzed using monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa, with a focus on span-level candy speech detection. Result: The multilingual XLM-RoBERTa-Large model outperformed other models, achieving the highest scores in binary positive F1 (0.8906) and categorized span-based detection (strict F1: 0.6307). Conclusion: Multilingual models, especially XLM-RoBERTa-Large, are effective in detecting candy speech with high performance in both binary and span-based detection tasks. Abstract: Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.

[18] Understanding Stigmatizing Language Lexicons: A Comparative Analysis in Clinical Contexts

Yiliang Zhou,Di Hu,Tianchu Lyu,Jasmine Dhillon,Alexandra L. Beck,Gelareh Sadigh,Kai Zheng

Main category: cs.CL

TL;DR: 本文旨在探讨医疗领域中污名化语言的现状，指出目前缺乏统一的词汇表，并通过比较现有词汇表的异同及情感分析，强调制定标准化词汇表的必要性。

Details

Motivation: 污名化语言会导致医疗不平等，但目前尚无普遍接受或标准化的词汇来定义哪些词语构成医疗领域的污名化语言。 Method: 作者系统检索了相关文献，确定了现有的污名化语言词汇表，并进行了比较分析，包括词汇表之间的相似性和差异，以及基于已有情感数据集的术语情感分类。 Result: 研究发现了四个词汇表，分析显示它们之间具有中等程度的语义相似性，大多数污名化术语与临床医生对患者负面行为的评判性表达有关。情感分析表明，大多数术语被归类为负面情绪，但不同词汇表之间存在差异。 Conclusion: 研究强调了建立标准化污名化语言词汇表的重要性，并指出了定义临床文本中污名化语言所面临的挑战。 Abstract: Stigmatizing language results in healthcare inequities, yet there is no universally accepted or standardized lexicon defining which words, terms, or phrases constitute stigmatizing language in healthcare. We conducted a systematic search of the literature to identify existing stigmatizing language lexicons and then analyzed them comparatively to examine: 1) similarities and discrepancies between these lexicons, and 2) the distribution of positive, negative, or neutral terms based on an established sentiment dataset. Our search identified four lexicons. The analysis results revealed moderate semantic similarity among them, and that most stigmatizing terms are related to judgmental expressions by clinicians to describe perceived negative behaviors. Sentiment analysis showed a predominant proportion of negatively classified terms, though variations exist across lexicons. Our findings underscore the need for a standardized lexicon and highlight challenges in defining stigmatizing language in clinical texts.

[19] From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

Mardiyyah Oduwole,Oluwatosin Olajide,Jamiu Suleiman,Faith Hunja,Busayo Awobade,Fatimo Adebanjo,Comfort Akanni,Chinonyelum Igwe,Peace Ododo,Promise Omoigui,Steven Kolawole,Abraham Owodunni

Main category: cs.CL

TL;DR: 本研究探讨了在资源匮乏的非洲语言中，通过数据增强技术提升机器翻译性能的效果，并展示了这些技术的潜力。

Details

Motivation: 非洲大陆上的语言多样性为机器翻译带来了挑战和机遇，特别是针对资源匮乏的语言，需要探索有效的数据增强技术来提升翻译性能。 Method: 研究采用了两种数据增强技术：结合反向翻译的句子拼接和替换（switch-out），并在六个非洲语言上进行了实验。 Result: 实验结果显示，在所有六种语言中，BLEU分数至少提高了25%，表明这些技术能显著提升机器翻译性能。 Conclusion: 这些数据增强技术在资源匮乏的语言上表现出色，为开发更加稳健的翻译系统提供了新的思路。 Abstract: The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25\% in BLEU score across all six languages.We provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.

[20] HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention

Saumya Goswami,Siddharth Kurra

Main category: cs.CL

TL;DR: HALT-RAG是一种用于检测检索增强生成输出中错误信息的后验验证系统，具有良好的性能和实用性。

Details

Motivation: 检测与源文本矛盾或未被支持的内容对于生成语言模型的安全部署至关重要。 Method: HALT-RAG使用从自然语言推理模型和词法信号中提取的通用特征集，训练一个简单且校准过的元分类器，并采用严格的5折out-of-fold训练协议进行评估。 Result: 在HaluEval基准测试中，HALT-RAG在摘要、问答和对话任务上分别取得了0.7756、0.9786和0.7391的F1分数，系统概率校准良好，能够实现有效的拒绝机制。 Conclusion: HALT-RAG系统通过其通用特征集和轻量级分类器实现了出色的验证性能，并提供了可靠的机制来平衡模型性能与安全性需求。 Abstract: Detecting content that contradicts or is unsupported by a given source text is a critical challenge for the safe deployment of generative language models. We introduce HALT-RAG, a post-hoc verification system designed to identify hallucinations in the outputs of Retrieval-Augmented Generation (RAG) pipelines. Our flexible and task-adaptable framework uses a universal feature set derived from an ensemble of two frozen, off-the-shelf Natural Language Inference (NLI) models and lightweight lexical signals. These features are used to train a simple, calibrated, and task-adapted meta-classifier. Using a rigorous 5-fold out-of-fold (OOF) training protocol to prevent data leakage and produce unbiased estimates, we evaluate our system on the HaluEval benchmark. By pairing our universal feature set with a lightweight, task-adapted classifier and a precision-constrained decision policy, HALT-RAG achieves strong OOF F1-scores of 0.7756, 0.9786, and 0.7391 on the summarization, QA, and dialogue tasks, respectively. The system's well-calibrated probabilities enable a practical abstention mechanism, providing a reliable tool for balancing model performance with safety requirements.

[21] ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval

Zihan Chen,Lei Shi,Weize Wu,Qiji Zhou,Yue Zhang

Main category: cs.CL

TL;DR: 本文提出了一种高效的三阶段框架ALLabel，用于在有限的标注预算下进行实体识别，显著提升了性能成本比。

Details

Motivation: 大规模、高性能的实体识别在自然科学领域中非常重要，而现有的LLM实体识别模型依赖微调技术，这往往会产生巨大的成本。 Method: 提出了ALLabel，一个三阶段的框架，通过选择最具信息量和代表性的样本来构建用于大语言模型（LLM）上下文学习的真实检索语料库。 Result: 实验结果表明，使用ALLabel选择性地标注仅5%-10%的数据集可以达到与标注整个数据集相当的性能，并且在三个专业领域数据集中均优于所有基线方法。 Conclusion: ALLabel是一个有效的框架，可以在标注预算有限的情况下实现高性能的实体识别。 Abstract: Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5\%-10\% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.

[22] VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Zheng Wu,Heyuan Huang,Xingyu Lou,Xiangmou Qu,Pengzhou Cheng,Zongru Wu,Weiwen Liu,Weinan Zhang,Jun Wang,Zhaoxiang Wang,Zhuosheng Zhang

Main category: cs.CL

TL;DR: 本文提出VeriOS-Agent，在不可信环境下通过查询人类提高任务完成可靠性，并取得了显著效果。

Details

Motivation: 现有的操作系统代理大多为理想化设置设计，而现实环境常有不可信条件，需要提高任务完成的可靠性。 Method: 提出了一种查询驱动的人机-GUI交互框架，并基于此框架引入了VeriOS-Agent，采用两阶段学习范式进行训练。 Result: VeriOS-Agent在不可信场景下的平均逐步成功率相比现有技术提高了20.64％，且未影响正常性能。 Conclusion: VeriOS-Agent实现了在不可信场景下更可靠的任务完成，同时保持了正常条件下的性能，具有理性、通用性和可扩展性。 Abstract: With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64\% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

[23] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition

Yi Liu,Xiangrong Zhu,Xiangyu Liu,Wei Wei,Wei Hu

Main category: cs.CL

TL;DR: This paper proposes IRAKE, a new method for knowledge editing in large language models, which effectively addresses 'edit skipping' in multi-hop question answering by using iterative retrieval and guided decomposition.

Details

Motivation: Existing RAG-based KE methods struggle with multi-hop question answering due to 'edit skipping', where edited facts are overlooked during inference. Method: IRAKE uses iterative retrieval-augmented knowledge editing with guided decomposition, leveraging both single edited facts and entire edited cases for better accuracy. Result: Experimental results show that IRAKE successfully reduces editing failures caused by edit skipping and performs better than state-of-the-art KE methods in multi-hop QA tasks. Conclusion: IRAKE effectively addresses the problem of edit skipping in multi-hop question answering, outperforming existing KE methods. Abstract: In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of "edit skipping", which refers to skipping the relevant edited fact in inference. In addition to the diversity of natural language expressions of knowledge, edit skipping also arises from the mismatch between the granularity of LLMs in problem-solving and the facts in the edited memory. To address this issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing method with guided decomposition (IRAKE) through the guidance from single edited facts and entire edited cases. Experimental results demonstrate that IRAKE mitigates the failure of editing caused by edit skipping and outperforms state-of-the-art methods for KE in multi-hop question answering.

[24] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

Andrey Sakhovskiy,Elena Tutubalina

Main category: cs.CL

TL;DR: 该论文提出了一种名为BALI的方法，通过将语言模型（LM）与知识图谱（KG）进行联合预训练，以提升生物医学文本理解的效果。

Details

Motivation: 现有的生物医学语言模型对复杂的领域特定概念结构和知识图谱中的事实信息理解有限，因此需要一种能够增强语言模型外部知识的方法。 Method: 论文提出BALI方法，通过同时训练一个专门的KG编码器并对其表示与语言模型进行对齐，从而增强语言模型。对于给定文本，利用统一医学语言系统（UMLS）KG进行生物医学概念链接，并使用KG子图作为跨模态正样本。 Result: 实验结果显示，将BALI应用于PubMedBERT和BioLinkBERT等主流生物医学语言模型时，其在语言理解任务和实体表示质量方面均有显著提升，即使预训练数据集较小也表现良好。 Conclusion: BALI方法有效提升了生物医学语言模型的性能，表明结合知识图谱的联合预训练是一种增强模型理解能力的有效途径。 Abstract: In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.

[25] MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs

Libo Ren,Yee Man Ng,Lifeng Han

Main category: cs.CL

TL;DR: This paper introduces a methodology using Iterative Self-Prompting on large language models to summarize clinical case documents, aiming to improve communication between patients and clinicians.

Details

Motivation: Efficient communication between patients and clinicians is crucial for shared decision-making, but clinical reports are often lengthy and filled with jargon, making it difficult for domain experts to identify important aspects efficiently. Method: The paper uses an Iterative Self-Prompting (ISP) technique on large language models (LLMs), refining prompts via example-based few-shot learning. Model fine-tuning is guided using lexical and embedding space metrics, specifically ROUGE and BERT-score, with epochs. Result: The submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports. The high BERTscore indicates semantically equivalent output summaries compared to references. Conclusion: This paper concludes that perspective-aware ISP (PA-ISP) can be effectively deployed for clinical report summarization, enhancing communication between patients and clinicians. Abstract: Efficient communication between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it difficult for domain experts to identify important aspects in the document efficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (LLMs) by asking LLMs to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the overlap at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better communication between patients and clinicians.

Xixi Wu,Yanchao Tan,Nan Hou,Ruiyang Zhang,Hong Cheng

Main category: cs.CL

TL;DR: MoLoRAG是一种结合语义和逻辑相关性的检索框架，用于提升多模态、多页文档理解的准确性。

Details

Motivation: 传统方法在处理文档时忽略了关键的多模态信息，而现有的检索增强生成(RAG)方法仅依赖语义相关性，忽略了页面与查询之间的逻辑连接。 Method: MoLoRAG构建了一个页面图来捕捉页面之间的上下文关系，并使用轻量级视觉语言模型(VLM)进行图遍历以检索相关页面。 Result: MoLoRAG在四个DocQA数据集上的实验表明，其准确率比LVLM直接推理平均提高了9.68%，检索精度比基线方法提高了7.44%。 Conclusion: MoLoRAG通过结合语义和逻辑相关性，提高了多模态、多页文档理解的准确性，并在四个DocQA数据集上展示了相对于现有方法的显著改进。 Abstract: Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top-$K$ pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at https://github.com/WxxShirley/MoLoRAG.

[27] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

Zexuan Li,Hongliang Dai,Piji Li

Main category: cs.CL

TL;DR: 该论文提出了一种名为M-BRe的框架，用于从无标签文本中自动提取关系抽取（RE）的训练样本，解决了使用大语言模型（LLMs）进行多类别关系分类时的语义捕捉不足和计算开销过大的问题。

Details

Motivation: 关系抽取的训练数据手动标注成本高昂，因为目标关系在文本中的出现频率低且难以寻找。需要一种高效方法自动从未标注文本中提取训练实例。 Method: 提出M-BRe框架，结合多类别与二分类方法的优势，包含三个模块：关系分组（Relation Grouping）、关系抽取（Relation Extraction）和标签决策（Label Decision）。 Result: 实验表明M-BRe在从未标注文本中发现高质量训练样本方面具有卓越能力。 Conclusion: M-BRe框架有效解决了LLMs在多类别关系抽取中的语义捕捉问题和二分类方法的计算效率问题，为自动提取RE训练样本提供了新方案。 Abstract: For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.

[28] Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

Rochana Prih Hastuti,Rian Adam Rajagede,Mansour Al Ghanim,Mengxin Zheng,Qian Lou

Main category: cs.CL

TL;DR: This paper evaluates the impact of watermarking on medical language models, showing that current methods can harm factual accuracy and proposing a new evaluation framework to guide future watermarking approaches in medical domains.

Details

Motivation: As large language models are increasingly used in sensitive domains like medicine, safety risks regarding provenance and accountability arise. Watermarking is a strategy to mitigate these risks, but its reliability in medical contexts has not been tested, particularly concerning factual accuracy in low-entropy settings. Method: The authors proposed a medical-focused evaluation workflow that combines factual accuracy and coherence assessment. They introduced the Factuality-Weighted Score (FWS), using GPT-Judger and human validation, to evaluate watermarking techniques in medical contexts. Result: The evaluation revealed that current watermarking methods significantly compromise medical factuality, with entropy shifts negatively impacting the representation of medical entities. Conclusion: The study concludes that current watermarking techniques significantly affect medical factuality and highlights the necessity for domain-aware watermarking approaches that maintain the integrity of medical content. Abstract: As large language models (LLMs) adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs, overlooking factual risks under low-entropy settings often exploited by watermarking's reweighting strategy. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.

[29] Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

Michele Joshua Maggini,Dhia Merzougui,Rabiraj Bandyopadhyay,Gaël Dias,Fabrice Maurel,Pablo Gamallo

Main category: cs.CL

TL;DR: This study finds that Fine-Tuning Large Language Models performs better than In-Context Learning for detecting fake news, harmful tweets, and political bias across multiple languages and datasets.

Details

Motivation: The motivation for this study is the growing concern around the spread of fake news, polarizing content, and harmful information on online platforms, and the lack of comprehensive benchmarking of Large Language Models in different settings and languages for content detection. Method: The researchers conducted experiments across 10 datasets and 5 languages (English, Spanish, Portuguese, Arabic, and Bulgarian), utilizing various strategies such as parameter-efficient Fine-Tuning and In-Context Learning (zero-shot prompts, codebooks, few-shot examples with random and diverse selection, and Chain-of-Thought). Result: The experiments showed that In-Context Learning strategies often underperform compared to Fine-Tuning, emphasizing the importance of adapting models to specific tasks through Fine-Tuning even when larger models are used in In-Context Learning setups. Conclusion: The study concludes that Fine-Tuning models generally outperform In-Context Learning strategies in detecting fake news, political bias, and harmful content, even when using smaller models compared to larger ones evaluated in In-Context Learning setups. Abstract: The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.

[30] SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

Decheng Duan,Yingyi Zhang,Jitong Peng,Chengzhi Zhang

Main category: cs.CL

TL;DR: 这篇论文介绍了SciNLP，一个用于自然语言处理领域全文实体和关系提取的专门基准测试数据集，并展示了其在模型性能上的提升。

Details

Motivation: 现有的数据集大多关注特定出版物部分，因为领域复杂性和科学文本注释的高成本，而这篇论文旨在解决这一局限。 Method: 开发了一个名为SciNLP的专用基准测试，包含60篇手动注释的自然语言处理出版物，并进行了与现有数据集的对比实验。 Result: 结果显示，现有模型在不同长度的学术文本中具有不同的提取能力，并且使用SciNLP训练的模型构建了一个细粒度的自然语言处理领域知识图谱。 Conclusion: SciNLP是第一个在自然语言处理领域提供全文实体和关系注释的数据集，并实现了某些基线模型上的显著性能改进。 Abstract: Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at https://github.com/AKADDC/SciNLP.

[31] Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

Xiaolin Chen,Xuemeng Song,Haokun Wen,Weili Guan,Xiangyu Zhao,Liqiang Nie

Main category: cs.CL

TL;DR: This paper proposes DK2R, a method that enhances textual response generation in multimodal dialog systems by leveraging both structured and unstructured knowledge with large language models (LLMs).

Details

Motivation: The motivation stems from the limitations in existing systems, including the neglect of unstructured review knowledge and underutilization of large language models (LLMs), which inspired the integration of dual knowledge types with LLMs. Method: The study proposes DK2R, a dual knowledge-enhanced two-stage reasoner that leverages large language models (LLMs) to evaluate and utilize structured attribute and unstructured review knowledge for improved textual response generation. Result: Extensive experiments on a public dataset demonstrate the superiority of DK2R in enhancing textual response generation, confirming the benefits of incorporating dual knowledge types and LLMs in multimodal task-oriented dialog systems. Conclusion: The study concludes that DK2R effectively utilizes dual knowledge types through a two-stage reasoning process, significantly enhancing response generation in multimodal task-oriented dialog systems. Abstract: Textual response generation is pivotal for multimodal \mbox{task-oriented} dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textit{neglect of unstructured review knowledge} and 2) \textit{underutilization of large language models (LLMs)}. Inspired by this, we aim to fully utilize dual knowledge (\textit{i.e., } structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textit{dynamic knowledge type selection} and 2) \textit{intention-response decoupling}. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type's utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.

[32] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

Mihai Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran

Main category: cs.CL

TL;DR: 我们提出了TF2，一个统一的框架，用于创建、微调和评估英罗文学翻译的数据集和模型，通过生成高质量的罗马尼亚语参考文献和两阶段微调模型，实现了与大型专有模型相当的翻译质量，同时具有开放性和成本效益。

Details

Motivation: 文学翻译最近在机器翻译研究中成为一个独特而复杂的任务，但小开放模型的翻译仍然是一个未解决的问题。 Method: 首先从TF1池中生成15k个高质量的罗马尼亚语参考文献，然后对一个120亿参数的开放权重模型进行两阶段微调：(i) 指令微调以捕捉特定体裁的叙述风格，(ii) 适配器压缩以实现高效部署。 Result: 结果显示，经过微调的模型在流畅性和充分性方面与高性能的大型专有模型相当，同时是开放、可访问的，并且成本效益显著提高。 Conclusion: TF2提供了一个端到端、可重复的管道，用于研究成本效益高的翻译、跨语言叙事生成以及在低资源环境中广泛采用开放模型进行文化意义重大的文学内容创作。 Abstract: Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.

[33] Are Humans as Brittle as Large Language Models?

Jiahui Li,Sean Papay,Roman Klinger

Main category: cs.CL

TL;DR: 本论文探讨了大语言模型（LLM）输出的不稳定性，将其与人类注释者对指令变化的敏感性进行比较，发现两者在特定提示修改下均表现出较高的敏感性，但人类判断的分布受打字错误和标签顺序颠倒的影响较小。

Details

Motivation: 研究LLM输出不稳定的原因，尤其是提示脆弱性是否为LLM独有，并探讨其与人类注释变异性之间的关系。 Method: 通过系统比较提示修改对LLM和人类注释者的影响，特别是在不同提示扰动下的敏感性。 Result: 发现LLM和人类在某些提示修改下均表现出更高的敏感性，尤其是标签集或标签格式替换时，但人类对打字错误和标签顺序颠倒的反应比LLM更稳定。 Conclusion: 提示脆弱性并非LLM独有，它可能正确反映了人类注释的变异性，但LLM的输出分布更容易受到某些扰动的影响。 Abstract: The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.

[34] From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing

Chengyan Wu,Yiqiang Cai,Yufei Cheng,Yun Xue

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型和低秩适应（LoRA）的中文性别偏见检测与缓解方法，结合多专家模型集成和多温度采样机制，在NLPCC-2025共享任务7中表现优异。

Details

Motivation: 促进自然语言生成中的公平性和可控性，通过自动检测、分类和缓解中文句子级别的性别偏见。 Method: 采用基于大语言模型的微调方法，结合低秩适应（LoRA）策略进行性别偏见检测和缓解。通过构建更平衡的训练集、引入多源异构样本增强模型泛化能力。对于检测和分类子任务，使用多数投票策略集成多个专家模型输出；设计多温度采样机制以捕捉偏见表达风格的潜在变化。 Result: 实验结果表明，所提出的方法在性别偏见的检测、分类和缓解方面具有有效性，并在共享任务中获得平均得分47.90%，排名第四。 Conclusion: 该论文提出了一种基于大语言模型微调的方法，通过低秩适应（LoRA）有效适应性别偏见检测任务，并通过多专家模型集成和多温度采样机制提升性能。该方法在NLPCC-2025共享任务7中取得了平均得分47.90%，排名第四。 Abstract: This paper presents our team's solution to Shared Task 7 of NLPCC-2025, which focuses on sentence-level gender bias detection and mitigation in Chinese. The task aims to promote fairness and controllability in natural language generation by automatically detecting, classifying, and mitigating gender bias. To address this challenge, we adopt a fine-tuning approach based on large language models (LLMs), efficiently adapt to the bias detection task via Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more balanced training set to alleviate class imbalance and introduce heterogeneous samples from multiple sources to enhance model generalization. For the detection and classification sub-tasks, we employ a majority voting strategy that integrates outputs from multiple expert models to boost performance. Additionally, to improve bias generation detection and mitigation, we design a multi-temperature sampling mechanism to capture potential variations in bias expression styles. Experimental results demonstrate the effectiveness of our approach in bias detection, classification, and mitigation. Our method ultimately achieves an average score of 47.90%, ranking fourth in the shared task.

[35] Biased Tales: Cultural and Topic Bias in Generating Children's Stories

Donya Rooein,Vilém Zouhar,Debora Nozza,Dirk Hovy

Main category: cs.CL

TL;DR: This study reveals gender and cultural biases in AI-generated children's stories, showing that female protagonists are more likely to be described based on appearance, and non-Western characters are more often associated with tradition and family themes.

Details

Motivation: With the increasing use of large language models (LLMs) by parents to create bedtime stories, there are growing concerns about the presence of cultural and gender stereotypes in these stories, especially their impact on children's beliefs and values. Method: The researchers created a dataset called Biased Tales to analyze how biases affect protagonists' attributes and story elements in AI-generated stories. They compared narratives with male and female protagonists and those with Western and non-Western cultural contexts. Result: Significant disparities were found in the portrayal of protagonists based on gender and culture. When the protagonist was described as a girl, appearance-related attributes increased by 55.26%. Non-Western characters were more likely to be associated with cultural heritage, tradition, and family themes compared to Western characters. Conclusion: The study highlights the existence of sociocultural biases in LLM-generated stories for children and emphasizes the need for more equitable and diverse AI-generated narratives. Abstract: Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists' attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.

[36] GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models

Tuo Wang,Adithya Kulkarni,Tyler Cody,Peter A. Beling,Yujun Yan,Dawei Zhou

Main category: cs.CL

TL;DR: 本文提出了GENUINE，一种结构感知的不确定性估计框架，显著提升了大型语言模型的可靠性。

Details

Motivation: 现有方法通常忽视语义依赖，仅基于标记级别的概率度量，难以捕捉生成文本中的结构关系。 Method: 提出了一种名为GENUINE的结构感知框架，利用监督学习建模语义和结构关系。 Result: 在多个NLP任务中，GENUINE的AUROC比基于语义熵的方法高出29%，校准误差降低了15%以上。 Conclusion: GENUINE通过结合依赖解析树和分层图池化，有效提升了大型语言模型的不确定性估计，优于现有方法。 Abstract: Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at https://github.com/ODYSSEYWT/GUQ.

[37] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

Lukas Haas,Gal Yona,Giovanni D'Antonio,Sasha Goldshtein,Dipanjan Das

Main category: cs.CL

TL;DR: SimpleQA Verified is a refined benchmark for assessing LLM factuality, revealing Gemini 2.5 Pro's leading performance and offering tools for tracking progress in reducing hallucinations.

Details

Motivation: To address limitations in OpenAI's SimpleQA benchmark, such as noisy labels, topical biases, and redundancy, for more accurate evaluation of LLM factuality. Method: Development of SimpleQA Verified through de-duplication, topic balancing, and source reconciliation; improvement of the autorater prompt. Result: Gemini 2.5 Pro achieved a state-of-the-art F1-score of 55.6 on SimpleQA Verified, outperforming other models including GPT-5. Conclusion: SimpleQA Verified offers a more reliable benchmark for evaluating LLM factuality, highlighting Gemini 2.5 Pro's superior performance and providing tools for tracking progress in reducing hallucinations. Abstract: We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI's SimpleQA. It addresses critical limitations in OpenAI's benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: https://www.kaggle.com/benchmarks/deepmind/simpleqa-verified.

[38] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Tong Zheng,Hongming Zhang,Wenhao Yu,Xiaoyang Wang,Xinyu Yang,Runpeng Dai,Rui Liu,Huiwen Bao,Chengsong Huang,Heng Huang,Dong Yu

Main category: cs.CL

TL;DR: This paper introduces Parallel-R1, a new reinforcement learning framework that trains large language models to use parallel thinking, significantly improving their performance on complex reasoning tasks like math problems.

Details

Motivation: The motivation is to enhance the reasoning capabilities of large language models by enabling parallel thinking, which current methods fail to achieve due to their reliance on supervised fine-tuning that promotes imitation over exploration. Method: The study introduces Parallel-R1, a reinforcement learning framework that uses a progressive curriculum to train large language models for parallel thinking. It starts with supervised fine-tuning on easier tasks and transitions to reinforcement learning on more complex tasks. Result: Experiments showed an 8.4% improvement in accuracy over sequential thinking models, with a 42.9% improvement on the AIME25 benchmark after the mid-training exploration phase. Conclusion: Parallel-R1 successfully instills parallel thinking in large language models using a combination of supervised fine-tuning and reinforcement learning, leading to significant accuracy improvements on complex reasoning tasks. Abstract: Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model's thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.

cs.CV [Back]

[39] CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis

Cedric Caruzzo,Jong Chul Ye

Main category: cs.CV

TL;DR: CellPainTR, a Transformer-based architecture, addresses batch effects and improves out-of-distribution generalization for large-scale biological data analysis, enabling more reliable cross-study biological analysis.

Details

Motivation: The need to integrate large, heterogeneous biological datasets is hindered by technical batch effects and a lack of generalizable models. Current methods require retraining on new data, which limits their scalability and effectiveness. Method: CellPainTR employs a Transformer-based architecture with source-specific context tokens, enabling robustness to batch effects and eliminating the need for fine-tuning when applied to unseen datasets. It is validated on the JUMP dataset and tested for out-of-distribution generalization on the Bray et al. dataset. Result: CellPainTR outperforms established methods like ComBat and Harmony in batch integration and biological signal preservation. It demonstrates robust performance in a challenging OOD task on the unseen Bray et al. dataset despite domain and feature shifts. Conclusion: CellPainTR represents a significant advancement toward foundational models for image-based biological profiling, enabling scalable and reliable cross-study analysis without the need for retraining. Abstract: Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR's design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.

[40] FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

Alexey Zhukov,Jenny Benois-Pineau,Amira Youssef,Akka Zemmari,Mohamed Mosbah,Virginie Taillandier

Main category: cs.CV

TL;DR: 提出了一种基于域规则的多模态融合架构，结合YOLO和Vision Transformer，用于铁路缺陷检测，通过融合音频和图像信息提高了检测精度。

Details

Motivation: 由于传统的单模态方法在检测铁路结构元素或缺陷时存在过检问题，因此需要一种更有效的多模态融合方法。 Method: 该方法融合了YOLOv8n和Vision Transformer (ViT)，结合了来自多个层的特征图和合成音频表示，并在真实世界铁路数据集上进行了实验评估。 Result: 与仅使用视觉的方法相比，多模态融合在精度和整体准确性上提高了0.2点，并通过学生t检验确认了差异的统计显著性。 Conclusion: 提出的多模态融合架构有效提高了铁路缺陷检测的精度和准确性。 Abstract: Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student's unpaired t-test also confirms statistical significance of differences in the mean accuracy.

[41] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection

Yingsheng Wang,Shuo Lu,Jian Liang,Aihua Zheng,Ran He

Main category: cs.CV

TL;DR: This paper introduces ClaFR, a post-hoc OOD detection method that does not require training data, achieving strong performance by leveraging subspace projection and reconstruction error.

Details

Motivation: Existing feature-based post-hoc OOD detection methods often require access to training data, which can be unsuitable for privacy-sensitive scenarios. Method: ClaFR performs orthogonal decomposition of the classifier's weights to extract a class-known subspace and maps features into this subspace for OOD score calculation based on reconstruction error. Result: ClaFR outperforms existing OOD detection methods in terms of performance while not requiring training data access. Conclusion: The proposed ClaFR method effectively performs OOD detection without requiring training data access, achieving leading performance on multiple benchmarks. Abstract: Out-of-distribution (OOD) detection helps models identify data outside the training categories, crucial for security applications. While feature-based post-hoc methods address this by evaluating data differences in the feature space without changing network parameters, they often require access to training data, which may not be suitable for some data privacy scenarios. This may not be suitable in scenarios where data privacy protection is a concern. In this paper, we propose a simple yet effective post-hoc method, termed Classifier-based Feature Reconstruction (ClaFR), from the perspective of subspace projection. It first performs an orthogonal decomposition of the classifier's weights to extract the class-known subspace, then maps the original data features into this subspace to obtain new data representations. Subsequently, the OOD score is determined by calculating the feature reconstruction error of the data within the subspace. Compared to existing OOD detection algorithms, our method does not require access to training data while achieving leading performance on multiple OOD benchmarks. Our code is released at https://github.com/Aie0923/ClaFR.

[42] DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining

Bryan Rodas,Natalie Montesino,Jakob Ambsdorf,David Klindt,Randall Balestriero

Main category: cs.CV

TL;DR: DIET-CP 是一种无需额外超参数的持续预训练方法，能够有效适应有限数据下的基础模型，提升性能。

Details

Motivation: 在专业领域中，数据集通常较小，限制了现有自监督学习方法的应用。此外，预训练模型通常只提供主干权重，缺乏继续预训练所需的信息。 Method: DIET-CP 使用一种简单的无监督目标，无需额外超参数，可以将强大的基础模型适应到新的数据分布。 Result: DIET-CP 在多种数据模态和主干模型中表现出稳定性，并在仅使用 1000 张图像的情况下为 DINOv3 等最先进模型带来了显著的性能提升。 Conclusion: DIET-CP 是一种简单有效的持续预训练策略，适用于在特定领域中调整基础模型，特别是在数据量有限的情况下。 Abstract: Continued pretraining offers a promising solution for adapting foundation models to a new target domain. However, in specialized domains, available datasets are often very small, limiting the applicability of SSL methods developed for large-scale pretraining and making hyperparameter search infeasible. In addition, pretrained models are usually released as backbone-weights only, lacking important information to continue pretraining. We propose to bridge this gap with DIET-CP, a simple continued pretraining strategy, where any strong foundation model can be steered towards the new data distribution of interest. DIET-CP relies on a very simple objective, requires no labels, and introduces no more hyperparameters than supervised finetuning. It is stable across data modalities and backbone choices, while providing a significant performance boost for state-of-the-art models such as DINOv3 using only 1000 images.

[43] FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models

Kun Zhai,Siheng Chen,Xingjun Ma,Yu-Gang Jiang

Main category: cs.CV

TL;DR: FedAPT enhances robustness against adversarial attacks in federated learning for Vision-Language Models by addressing class information gaps and improving prompt alignment.

Details

Motivation: FPT methods are vulnerable to adversarial attacks, and the class information gap in non-IID settings affects robustness. Method: Proposed a Class-Aware Prompt Generator and Cross-Layer Generator Sharing strategy to address the class information gap in non-IID settings. Result: Experiments showed FedAPT outperforms existing methods in adversarial robustness and generalizes well in real-world applications. Conclusion: FedAPT improves adversarial robustness and generalization in cross-domain and cross-dataset scenarios compared to existing methods. Abstract: Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbf{FedAPT}), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textit{class information gap} between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbf{class-aware prompt generator} that generates visual prompts from text prompts. This generator is guided by a \emph{Global Label Embedding} (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbf{cross-layer generator sharing} strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.

[44] Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025)

Zirui Xu,Raphael Tang,Mike Bianco,Qi Zhang,Rishi Madhok,Nikolaos Karianakis,Fuxun Yu

Main category: cs.CV

TL;DR: This paper presents the winning solution for the EarthVision Embed2Scale Challenge, focusing on embedding hyperspectral geospatial data for various tasks.

Details

Motivation: The motivation is to develop foundational geospatial models that facilitate various downstream tasks such as classification and regression. Method: The report details the proposed method for embedding SSL4EO-S12 hyperspectral geospatial data cubes into vectors. Result: The method achieved the Top-1 position in the EarthVision Embed2Scale Challenge at CVPR 2025. Conclusion: The technical report presents the Top-1 winning solution for the EarthVision Embed2Scale challenge, which focuses on embedding hyperspectral geospatial data for downstream tasks. Abstract: EarthVision Embed2Scale challenge (CVPR 2025) aims to develop foundational geospatial models to embed SSL4EO-S12 hyperspectral geospatial data cubes into embedding vectors that faciliatetes various downstream tasks, e.g., classification, regression, etc. In this technical report, we introduce our proposed method for the Top-1 winning solution on the Embed2Scale Challenge.

[45] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

Srihari Bandraupalli,Anupam Purwar

Main category: cs.CV

TL;DR: 本文提出了 ViLD 框架，用于更准确地评估视觉-语言模型在企业应用中的表现，并引入了新的评估算法和数据集。

Details

Motivation: 当前的基准测试过于依赖多项选择题和合成数据，无法反映企业应用场景（如社交媒体内容分析）的复杂性。 Method: 该论文定义了十个关键任务，并引入了 BlockWeaver 算法来比较 VLMs 的 OCR 输出，同时构建了一个包含 7500 个真实世界样本的新基准数据集。 Result: 通过 ViLD 框架，论文展示了在企业关键任务上的有效评估，并对 Qwen、MIMO 和 InternVL 等领先模型进行了行业基准测试。 Conclusion: ViLD 提供了一个实用的框架，用于评估开放源代码视觉-语言模型在企业部署中的性能，弥补了学术评估与实际业务需求之间的差距。 Abstract: Open-source Vision-Language Models show immense promise for enterprise applications, yet a critical disconnect exists between academic evaluation and enterprise deployment requirements. Current benchmarks rely heavily on multiple-choice questions and synthetic data, failing to capture the complexity of real-world business applications like social media content analysis. This paper introduces VLM-in-the-Wild (ViLD), a comprehensive framework to bridge this gap by evaluating VLMs on operational enterprise requirements. We define ten business-critical tasks: logo detection, OCR, object detection, human presence and demographic analysis, human activity and appearance analysis, scene detection, camera perspective and media quality assessment, dominant colors, comprehensive description, and NSFW detection. To this framework, we bring an innovative BlockWeaver Algorithm that solves the challenging problem of comparing unordered, variably-grouped OCR outputs from VLMs without relying on embeddings or LLMs, achieving remarkable speed and reliability. To demonstrate efficacy of ViLD, we constructed a new benchmark dataset of 7,500 diverse samples, carefully stratified from a corpus of one million real-world images and videos. ViLD provides actionable insights by combining semantic matching (both embedding-based and LLM-as-a-judge approaches), traditional metrics, and novel methods to measure the completeness and faithfulness of descriptive outputs. By benchmarking leading open-source VLMs (Qwen, MIMO, and InternVL) against a powerful proprietary baseline as per ViLD framework, we provide one of the first industry-grounded, task-driven assessment of VLMs capabilities, offering actionable insights for their deployment in enterprise environments.

[46] The Protocol Genome A Self Supervised Learning Framework from DICOM Headers

Jimmy Joseph

Main category: cs.CV

TL;DR: The Protocol Genome is a self-supervised learning method that uses DICOM headers to improve calibration and robustness across clinical imaging tasks, achieving higher AUROC and reducing false positives.

Details

Motivation: Latent confounders such as scanner make/model, sequence, and other parameters impede the generalization of image-only networks across clinical sites, motivating the need for a more robust and protocol-aware approach. Method: Protocol Genome uses structured DICOM headers as labels and employs tokenized embeddings of de-identified header fields alongside image features using protocol-image contrastive learning, masked protocol prediction, and protocol-protocol translation. Result: Protocol Genome achieves an AUROC of 0.901 (vs 0.847 baseline) and ECE of 0.036 (vs 0.058), with 25-37% calibration improvements and higher external AUROC across multiple clinical tasks. Conclusion: Protocol Genome is an effective self-supervised learning system that enhances external AUROC, calibration, and robustness across modalities and vendors in clinical imaging tasks. Abstract: In this paper, we introduce the Protocol Genome, a self-supervised learning system that learns correlations from DICOM headers and achieves AUROC 0.901 (vs 0.847 baseline) and ECE 0.036 (vs 0.058) on fully held-out external validation. Our method also improves calibration and robustness across modalities (CT, MRI, CXR) and vendors. Clinical imaging is funneled through PACS/DICOM, where procedure choices (scanner make/model, sequence, kernel, kVp, TR/TE, and slice thickness) have consequences for contrast, noise, and artifact. These latent confounders impede the generalization of image-only networks across sites. We consider structured DICOM headers as a label and learn protocol-aware but clinically robust image representations. Protocol Genome obtains tokenized embeddings of de-identified header fields and models them along with image features using: (1) protocol-image contrastive learning, (2) masked protocol prediction, and (3) protocol-protocol translation. With 1.26M studies (7 health systems, 31 scanners, 3 vendors; CT, MR, CR/DR), we experiment on: (A) chest CT triage for PE, (B) brain MRI glioma grading, and (C) chest radiograph cardiomegaly detection. Relative to strong SSL baselines (SimCLR, MAE) as well as ImageNet transfer, Protocol Genome (+0.046: PE, +0.058: glioma, +0.041: cardiomegaly) is associated with higher external AUROC; 25-37% calibration improvements are obtained (p < 0.01, DeLong tests). While the gains may be task-dependent, they are preserved with 10-20% of labeled data. From a clinical point of view, the technique reduces false positives at protocol borders and is applicable in a PACS (DICOM C-FIND/C-MOVE, DICOMweb QIDO/WADO). We publish a model card and deployment guide, complete with both de-identification and bias audits.

Jie Zhang,Ting Xu,Gelei Deng,Runyi Hu,Han Qiu,Tianwei Zhang,Qing Guo,Ivor Tsang

Main category: cs.CV

TL;DR: This paper finds that vision language models (VLMs) struggle with recognizing fragmented or altered text, unlike humans, revealing a lack of compositional priors needed for robust literacy, and proposes benchmarks and methods to improve future model development.

Details

Motivation: The motivation behind this paper is to investigate whether advanced vision language models (VLMs) share the human ability to recognize text even when characters are fragmented, fused, or partially occluded, thereby identifying structural limitations in these models that affect their robustness in literacy tasks. Method: The paper constructs two psychophysics-inspired benchmarks across Chinese logographs and English alphabetic words by splicing, recombining, and overlaying glyphs to create 'visible but unreadable' stimuli for models while remaining legible to humans. This method tests the models' resilience and performance under visual perturbations. Result: Despite strong performance on clean text, contemporary VLMs show a significant drop in accuracy under text perturbations, often producing unrelated or incoherent outputs. This indicates a reliance on generic visual invariances rather than compositional priors necessary for robust literacy. Conclusion: This paper concludes that advanced vision language models (VLMs) lack the compositional priors needed for robust literacy, as they show a severe drop in performance when dealing with fragmented, fused, or partially occluded text, unlike humans who show resilience in recognizing such text. Abstract: Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

[48] K-Syn: K-space Data Synthesis in Ultra Low-data Regimes

Guan Yu,Zhang Jianhua,Liang Dong,Liu Qiegen

Main category: cs.CV

TL;DR: This paper proposes a novel method for synthesizing k-space data for dynamic cardiac MRI reconstruction by leveraging feature-level modeling in the frequency domain and temporal fusion strategies, showing promising results in low-data scenarios.

Details

Motivation: The scarcity of high-quality and diverse k-space data due to the dynamic and complex nature of cardiac MRI hampers robust reconstruction, necessitating a novel approach for data synthesis. Method: The method involves feature-level learning in the frequency domain using the global representation capacity of the Fourier transform and employs a temporal-fusion strategy to optimize generative trajectories by integrating k-space data across time frames. Result: The proposed method enables stable and rich k-space data generation even in ultra low-data regimes and demonstrates strong generative ability in experiments. Conclusion: The proposed method demonstrates strong generative ability in low-data regimes, offering practical potential to address data scarcity in dynamic MRI reconstruction through feature-level modeling and temporal fusion in the frequency domain. Abstract: Owing to the inherently dynamic and complex characteristics of cardiac magnetic resonance (CMR) imaging, high-quality and diverse k-space data are rarely available in practice, which in turn hampers robust reconstruction of dynamic cardiac MRI. To address this challenge, we perform feature-level learning directly in the frequency domain and employ a temporal-fusion strategy as the generative guidance to synthesize k-space data. Specifically, leveraging the global representation capacity of the Fourier transform, the frequency domain can be considered a natural global feature space. Therefore, unlike traditional methods that use pixel-level convolution for feature learning and modeling in the image domain, this letter focuses on feature-level modeling in the frequency domain, enabling stable and rich generation even with ultra low-data regimes. Moreover, leveraging the advantages of feature-level modeling in the frequency domain, we integrate k-space data across time frames with multiple fusion strategies to steer and further optimize the generative trajectory. Experimental results demonstrate that the proposed method possesses strong generative ability in low-data regimes, indicating practical potential to alleviate data scarcity in dynamic MRI reconstruction.

[49] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă,Antonio Bărbălau,Dan Oneata,Elena Burceanu

Main category: cs.CV

TL;DR: 研究模型能否将属性知识泛化到概念上不同的类别，发现模型性能随训练与测试类别相关性降低而显著下降，聚类分割方法表现最佳。

Details

Motivation: 现有的属性预测研究主要集中在狭窄的分类或视觉相似的领域，而对模型是否能够跨不同类别抽象属性的应用能力尚不清楚。 Method: 引入了逐步减少训练集和测试集之间相关性的训练-测试分割策略，包括基于LLM的语义分组、嵌入相似性阈值、基于嵌入的聚类和使用真实标签的超类别划分。 Result: 随着训练和测试类别之间的相关性降低，模型性能显著下降，表明其对分割设计的高度敏感。其中，聚类方法在减少隐藏相关性的同时保持了可学习性，是最有效的权衡方法。 Conclusion: 当前模型在抽象属性并将其应用于概念上相距较远的类别方面存在局限性，分割设计对性能有显著影响。 Abstract: Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute "has four legs" is common to both "dogs" and "chairs". To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

[50] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

Ahmed R. Sadik,Mariusz Bujny

Main category: cs.CV

TL;DR: This paper introduces a quantitative evaluation framework for assessing LLM-generated 3D models, demonstrating that semantically rich inputs (e.g., code prompts) significantly improve accuracy and enable faster convergence toward ground truth CAD models.

Details

Motivation: The motivation is to address the lack of robust evaluation methods for assessing the geometric and structural fidelity of 3D models generated by large language models (LLMs), particularly for applications like CAD design and rapid prototyping. Method: The paper proposes a human-in-the-loop framework with similarity and complexity metrics (e.g., volumetric accuracy, surface alignment) to quantitatively evaluate LLM-generated 3D models. It compares LLM performance across four input modalities using an L-bracket case study. Result: The results show that generation fidelity improves with semantically richer inputs, with code-level prompts achieving perfect reconstruction. The proposed framework enables faster convergence toward ground truth compared to traditional qualitative methods. Conclusion: The paper concludes that the proposed quantitative evaluation framework significantly improves the convergence toward ground truth CAD models and provides a scalable methodology for validating generative models in CAD applications. Abstract: Large Language Models are increasingly capable of interpreting multimodal inputs to generate complex 3D shapes, yet robust methods to evaluate geometric and structural fidelity remain underdeveloped. This paper introduces a human in the loop framework for the quantitative evaluation of LLM generated 3D models, supporting applications such as democratization of CAD design, reverse engineering of legacy designs, and rapid prototyping. We propose a comprehensive suite of similarity and complexity metrics, including volumetric accuracy, surface alignment, dimensional fidelity, and topological intricacy, to benchmark generated models against ground truth CAD references. Using an L bracket component as a case study, we systematically compare LLM performance across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code based correction prompts. Our findings demonstrate improved generation fidelity with increased semantic richness, with code level prompts achieving perfect reconstruction across all metrics. A key contribution of this work is demonstrating that our proposed quantitative evaluation approach enables significantly faster convergence toward the ground truth, especially compared to traditional qualitative methods based solely on visual inspection and human intuition. This work not only advances the understanding of AI assisted shape synthesis but also provides a scalable methodology to validate and refine generative models for diverse CAD applications.

[51] MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

Jiarui Chen,Yikeng Chen,Yingshuang Zou,Ye Huang,Peng Wang,Yuan Liu,Yujing Sun,Wenping Wang

Main category: cs.CV

TL;DR: MEGS$^{2}$ is a memory-efficient 3D Gaussian Splatting framework that reduces VRAM usage by optimizing both primitive and lobe numbers while preserving rendering quality.

Details

Motivation: 3D Gaussian Splatting (3DGS) suffers from high memory consumption, limiting its use on edge devices. Existing compression methods mainly focus on storage and do not address the critical issue of rendering memory bottleneck. Method: MEGS$^{2}$ replaces spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes and proposes a unified soft pruning framework that jointly optimizes primitive-number and lobe-number pruning as a constrained optimization problem. Result: MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods while maintaining comparable rendering quality. Conclusion: MEGS$^{2}$ is an effective memory-efficient framework for 3D Gaussian Splatting that significantly reduces both static and rendering VRAM without compromising rendering quality. Abstract: 3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS$^{2}$, a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS$^{2}$ achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.

[52] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

Jisung Hwang,Jaihoon Kim,Minhyuk Sung

Main category: cs.CV

TL;DR: This paper introduces a novel regularization loss that enforces standard Gaussianity in the latent space of text-to-image models, combining spatial and spectral domain regularizations. It improves generative modeling performance, enhances aesthetics and text alignment, prevents reward hacking, and accelerates convergence.

Details

Motivation: The motivation is to develop a more effective regularization method for latent space optimization in text-to-image models, addressing limitations in existing Gaussianity-based regularizations like reward hacking and high computational complexity. Method: A composite regularization loss combining moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain was developed. This loss encourages latent samples to conform to a standard Gaussian distribution, ensuring permutation invariance by applying losses to randomly permuted inputs. Result: The proposed regularization loss outperforms existing Gaussianity-based regularizations in downstream tasks involving optimization in the latent space of text-to-image models, showing improved performance in aesthetics, text alignment, and convergence speed. Conclusion: The proposed regularization loss effectively improves generative modeling tasks, particularly in enhancing aesthetics and text alignment while preventing reward hacking and accelerating convergence compared to previous methods. Abstract: We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.

[53] SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards

Kamyar Barakati,Utkarsh Pratiush,Sheryl L. Sanchez,Aditya Raghavan,Delia J. Milliron,Mahshid Ahmadi,Philip D. Rack,Sergei V. Kalinin

Main category: cs.CV

TL;DR: 本研究提出一种基于奖励函数的优化方法，通过微调 SAM 模型提升其在显微镜图像分割中的性能，尤其适用于实时流数据。

Details

Motivation: 基础模型通常需要大量手动优化，限制了其在实时流数据分析中的可用性。因此需要一种更高效的优化方法。 Method: 构建代表成像系统物理特性的奖励函数，并通过奖励驱动的优化框架对 SAM 模型进行微调。 Result: 提出的优化方法提高了 SAM 的适应性和性能，使其更符合不同分割任务的需求，尤其是在显微镜图像分析中实现了精确分割。 Conclusion: 通过引入基于奖励函数的优化方法，SAM* 在多样化的分割任务中展现出更强的适应性和性能，尤其适用于实时流数据分割。 Abstract: Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data. This task can be performed using custom models trained on domain-specific datasets, transfer learning from pre-trained models, or foundational models that offer broad applicability. However, foundational models often present a considerable number of non-transparent tuning parameters that require extensive manual optimization, limiting their usability for real-time streaming data analysis. Here, we introduce a reward function-based optimization to fine-tune foundational models and illustrate this approach for SAM (Segment Anything Model) framework by Meta. The reward functions can be constructed to represent the physics of the imaged system, including particle size distributions, geometries, and other criteria. By integrating a reward-driven optimization framework, we enhance SAM's adaptability and performance, leading to an optimized variant, SAM$^{*}$, that better aligns with the requirements of diverse segmentation tasks and particularly allows for real-time streaming data segmentation. We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.

[54] Enhancing Classification of Streaming Data with Image Distillation

Rwad Khatib,Yehudit Aperstein

Main category: cs.CV

TL;DR: 本文研究了在资源受限环境下使用数据提炼改进流图像数据分类的方法，并提出了准确性和效率的新标准。

Details

Motivation: 在内存和计算资源有限的环境中，有效分类流数据是一个挑战。因此，研究者们探究了数据提炼作为改进流图像数据分类精度的创新方法。 Method: 通过关注从数据流中提炼基本特征，使用提炼基于分类（DBC）的方法。 Result: DBC方法达到了73.1%的准确率，超过了传统的霍夫丁树和自适应随机森林方法以及基于水库抽样的分类技术。 Conclusion: DBC方法在流数据分类中表现出色，处理复杂数据流并树立了准确性和效率的新标准。 Abstract: This study tackles the challenge of efficiently classifying streaming data in envi-ronments with limited memory and computational resources. It delves into the application of data distillation as an innovative approach to improve the precision of streaming image data classification. By focusing on distilling essential features from data streams, our method aims to minimize computational demands while preserving crucial information for accurate classification. Our investigation com-pares this approach against traditional algorithms like Hoeffding Trees and Adap-tive Random Forest, adapted through embeddings for image data. The Distillation Based Classification (DBC) demonstrated superior performance, achieving a 73.1% accuracy rate, surpassing both traditional methods and Reservoir Sam-pling Based Classification (RBC) technique. This marks a significant advance-ment in streaming data classification, showcasing the effectiveness of our method in processing complex data streams and setting a new standard for accuracy and efficiency.

[55] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

Juan Manuel Contreras

Main category: cs.CV

TL;DR: This study evaluates gender bias in large multimodal models (LMMs) using a new benchmark called the Aymara Image Fairness Evaluation. It finds that LMMs amplify gender stereotypes, exhibit a default-male bias, and vary significantly in their levels of bias, with some models approaching gender parity. This underscores the need for standardized tools to ensure fairness in AI development.

Details

Motivation: The motivation is to address the lack of large-scale, comparable, and cross-model analysis of gender bias in large multimodal models (LMMs), which risk perpetuating harmful social biases from their training data. Method: The study introduces the Aymara Image Fairness Evaluation benchmark and uses 75 procedurally-generated, gender-neutral prompts to test 13 commercially available LMMs. These models generate images of people in various professions, categorized as stereotypically male, female, or non-stereotypical. A validated LLM-as-a-judge system scores the 965 resulting images for gender representation. Result: The results show that LMMs systematically reproduce and amplify occupational gender stereotypes compared to real-world labor data. They exhibit a default-male bias, generating men 68.3% of the time for non-stereotypical professions, and show significant variation in bias levels across models, with some achieving near gender parity. Conclusion: The study concludes that large multimodal models (LMMs) are not inevitably biased but can reflect design choices, as evidenced by a top-performing model that reduced gender stereotypes and approached gender parity. This highlights the need for standardized evaluation tools to ensure fairness and accountability in AI development. Abstract: Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p < .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.

[56] Faster VGGT with Block-Sparse Global Attention

Chung-Shien Brian Wang,Christian Schmidt,Jens Piekenbrinck,Bastian Leibe

Main category: cs.CV

TL;DR: 本文提出了一种高效的块稀疏注意力机制，以解决基于变换器的多视角重建模型的运行时瓶颈问题，实现了推理速度的显著提升。

Details

Motivation: 基于变换器的模型如VGGT和π³虽然取得了良好的效果，但由于全局注意力层的二次复杂度存在运行时瓶颈，限制了其在大规模图像集上的可扩展性。本文旨在解决这一问题。 Method: 通过实证分析全局注意力矩阵，发现概率质量集中在一小部分与跨视图几何匹配相关的块上，然后受结构化注意力和大语言模型进展的启发，提出了块稀疏注意力机制。 Result: 所提出的方法在多个多视角基准测试中进行了评估，证明了其有效性，同时支持大规模图像集合，并且在不重新训练的情况下适用于VGGT和π³模型。 Conclusion: 本文提出了一种基于高度优化的块稀疏核的密集全局注意力操作的替代方案，实现了高达4倍的推理加速，同时保持了任务性能，并且无需对主干网络进行重新训练。 Abstract: Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and $\pi^3$ have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to $4\times$ faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and $\pi^3$, and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

[57] Detection and Recovery of Adversarial Slow-Pose Drift in Offloaded Visual-Inertial Odometry

Soruya Saha,Md Nurul Absur,Saptarshi Debroy

Main category: cs.CV

TL;DR: This paper proposes an unsupervised mechanism to detect and recover from pose spoofing attacks in offloaded VIO systems, significantly improving accuracy under attack conditions.

Details

Motivation: The motivation stems from the increasing trend of offloading Visual-Inertial Odometry (VIO) to edge servers, which exposes a threat surface where pose spoofing can cause significant drift while evading heuristic checks. Method: The method involves training a model on attack-free sessions to learn temporal regularities of motion, which is then used to detect deviations during runtime and initiate recovery procedures. Result: The experimental results demonstrated substantial reductions in trajectory and pose error across multiple spoofing intensities on a realistic offloaded-VIO setup using the ILLIXR testbed. Conclusion: The paper concludes that their proposed unsupervised, label-free detection and recovery mechanism effectively reduces trajectory and pose error in an offloaded-VIO environment under spoofing attacks, compared to a no-defense baseline. Abstract: Visual-Inertial Odometry (VIO) supports immersive Virtual Reality (VR) by fusing camera and Inertial Measurement Unit (IMU) data for real-time pose. However, current trend of offloading VIO to edge servers can lead server-side threat surface where subtle pose spoofing can accumulate into substantial drift, while evading heuristic checks. In this paper, we study this threat and present an unsupervised, label-free detection and recovery mechanism. The proposed model is trained on attack-free sessions to learn temporal regularities of motion to detect runtime deviations and initiate recovery to restore pose consistency. We evaluate the approach in a realistic offloaded-VIO environment using ILLIXR testbed across multiple spoofing intensities. Experimental results in terms of well-known performance metrics show substantial reductions in trajectory and pose error compared to a no-defense baseline.

[58] Realism to Deception: Investigating Deepfake Detectors Against Face Enhancement

Muhammad Saad Saeed,Ijaz Ul Haq,Khalid Malik

Main category: cs.CV

TL;DR: Face enhancement techniques can act as anti-forensic tools, significantly reducing deepfake detection accuracy, highlighting the need for more robust forensic methods.

Details

Motivation: The motivation behind the study is to investigate whether face enhancement techniques, which improve perceptual quality, inadvertently distort biometric features and degrade the performance of deepfake detectors. Method: The study evaluates traditional image processing methods and GAN-based enhancements to assess their impact on deepfake detectors. The evaluation focuses on Naïve, Spatial, and Frequency-based detection methods, with additional adversarial training experiments. Result: Experiments showed that basic enhancement filters could reduce detection accuracy with an ASR up to 64.63%, while GAN-based techniques achieved an ASR up to 75.12%. Adversarial training experiments were conducted to evaluate model robustness. Conclusion: The study concludes that face enhancement techniques can serve as effective anti-forensic tools, significantly reducing the accuracy of deepfake detectors. This highlights the need for more robust and adaptive forensic methods. Abstract: Face enhancement techniques are widely used to enhance facial appearance. However, they can inadvertently distort biometric features, leading to significant decrease in the accuracy of deepfake detectors. This study hypothesizes that these techniques, while improving perceptual quality, can degrade the performance of deepfake detectors. To investigate this, we systematically evaluate whether commonly used face enhancement methods can serve an anti-forensic role by reducing detection accuracy. We use both traditional image processing methods and advanced GAN-based enhancements to evaluate the robustness of deepfake detectors. We provide a comprehensive analysis of the effectiveness of these enhancement techniques, focusing on their impact on Na\"ive, Spatial, and Frequency-based detection methods. Furthermore, we conduct adversarial training experiments to assess whether exposure to face enhancement transformations improves model robustness. Experiments conducted on the FaceForensics++, DeepFakeDetection, and CelebDF-v2 datasets indicate that even basic enhancement filters can significantly reduce detection accuracy achieving ASR up to 64.63\%. In contrast, GAN-based techniques further exploit these vulnerabilities, achieving ASR up to 75.12\%. Our results demonstrate that face enhancement methods can effectively function as anti-forensic tools, emphasizing the need for more resilient and adaptive forensic methods.

[59] Dimensionally Reduced Open-World Clustering: DROWCULA

Erencem Ozbey,Dimitrios I. Diochnos

Main category: cs.CV

TL;DR: 论文提出了一种基于视觉变换器和流形学习的无监督新类别发现方法，并在多个数据集上取得了最先进的结果。

Details

Motivation: 提供标签实例需要大量的人工努力，且实际应用中可能会出现新类别，这使得监督学习方法不适用。 Method: 利用视觉变换器估计聚类数，通过注意力机制生成向量嵌入，并结合流形学习技术来改进这些嵌入。 Result: 在CIFAR-10、CIFAR-100、ImageNet-100和Tiny ImageNet数据集上，无论是已知还是未知聚类数的情况下，都取得了最先进的结果。 Conclusion: 该论文提出了一种完全无监督的方法来确定特定数据集中的新类别，并在单模态聚类和新类别发现方面取得了新的最先进结果。 Abstract: Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world' context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: https://github.com/DROWCULA/DROWCULA.

[60] XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

Raja Mallina,Bryar Shareef

Main category: cs.CV

TL;DR: XBusNet is a novel dual-branch model combining image and text inputs to improve breast ultrasound segmentation, particularly for small and low-contrast lesions, achieving state-of-the-art results on the BLU dataset.

Details

Motivation: Precise breast ultrasound segmentation is challenging for small or low-contrast lesions. Text prompts can add clinical context, but current methods produce coarse results without additional mechanisms to recover fine edges. Method: XBusNet, a novel dual-prompt, dual-branch multimodal model combining image features with clinically grounded text is proposed. It includes a global pathway encoding whole-image semantics and a local U-Net pathway focusing on precise boundaries, both modulated by prompts describing lesion attributes. Result: XBusNet achieved state-of-the-art performance on the BLU dataset with a mean Dice of 0.8765 and IoU of 0.8149, outperforming baselines. It showed significant improvements for small lesions, with fewer missed regions and spurious activations. Conclusion: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions. Abstract: Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.

[61] Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

Sepehr Salem,M. Moein Esfahani,Jingyu Liu,Vince Calhoun

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散概率模型（DPM）的框架，用于解决乳腺癌热成像分类中的数据稀缺问题，并实现了高准确性和敏感性。

Details

Motivation: 数据稀缺限制了医学影像中的深度学习应用。为了解决这一问题，作者提出了使用扩散概率模型进行数据增强的新框架。 Method: 该框架结合了扩散概率模型（DPM）进行数据增强，使用预训练的ResNet-50提取深度特征，并结合U-Net分割肿瘤后提取的手工非线性特征（如分形维度），最后使用XGBoost分类器进行分类。 Result: 基于融合特征的XGBoost分类器实现了98.0%的准确率和98.1%的敏感性。消融实验和统计检验表明，DPM增强和非线性特征融合对结果均有显著贡献。 Conclusion: 这项工作验证了先进生成模型与可解释特征之间的协同作用，能够创建高精度的医学诊断工具。 Abstract: Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0\% accuracy and 98.1\% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.

[62] Reconstruction Alignment Improves Unified Multimodal Models

Ji Xie,Trevor Darrell,Luke Zettlemoyer,XuDong Wang

Main category: cs.CV

TL;DR: The paper proposes RecA, a post-training method for Unified multimodal models that improves image generation and editing by leveraging visual embeddings as dense prompts without requiring detailed captions.

Details

Motivation: The motivation stems from the limitations of conventional training for UMMs, which relies on image-text pairs with sparse captions that often miss fine-grained visual details. This necessitates a more effective and resource-efficient post-training method. Method: The paper introduces Reconstruction Alignment (RecA), a method that leverages visual understanding encoder embeddings as dense 'text prompts' to provide rich supervision without captions. It conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image using a self-supervised reconstruction loss. Result: RecA significantly improves image generation performance on GenEval (0.73→0.90) and DPGBench (80.93→88.15), as well as editing benchmarks (ImgEdit 3.38→3.75, GEdit 6.94→7.25), with only 27 GPU-hours. It also surpasses much larger open-source models. Conclusion: RecA is a broadly applicable and efficient post-training alignment strategy for Unified multimodal models (UMMs) that enhances image generation and editing fidelity across diverse UMM architectures. Abstract: Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details--even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts," providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73$\rightarrow$0.90) and DPGBench (80.93$\rightarrow$88.15), while also boosting editing benchmarks (ImgEdit 3.38$\rightarrow$3.75, GEdit 6.94$\rightarrow$7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs

[63] DEPF: A UAV Multispectral Object Detector with Dual-Domain Enhancement and Priority-Guided Mamba Fusion

Shucong Li,Zhenyu Liu,Zijie Hong,Zhiheng Zhou,Xianghai Cao

Main category: cs.CV

TL;DR: This paper proposes DEPF, a UAV multispectral object detector that addresses low-light enhancement, redundant information reduction, and computational efficiency using a dual-domain enhancement module and a priority-guided mamba fusion module.

Details

Motivation: Multispectral remote sensing object detection faces challenges such as low-light image quality, redundant information interference, and high computational complexity. These issues hinder the effectiveness and applicability of existing methods, especially on UAV platforms. Method: DEPF integrates a Dual-Domain Enhancement Module (DDE) and a Priority-Guided Mamba Fusion Module (PGMF). DDE enhances low-light images using Cross-Scale Wavelet Mamba (CSWM) and Fourier Details Recovery (FDR), while PGMF improves local target modeling by prioritizing feature scanning based on modality differences. Result: Experiments on DroneVehicle and VEDAI datasets show that DEPF outperforms state-of-the-art methods in object detection performance, while maintaining linear computational complexity, making it suitable for UAV deployment. Conclusion: The proposed DEPF detector effectively addresses the challenges of low-light remote sensing images, local small target modeling, and computational complexity, achieving superior performance on UAV multispectral object detection. Abstract: Multispectral remote sensing object detection is one of the important application of unmanned aerial vehicle (UAV). However, it faces three challenges. Firstly, the low-light remote sensing images reduce the complementarity during multi-modality fusion. Secondly, the local small target modeling is interfered with redundant information in the fusion stage easily. Thirdly, due to the quadratic computational complexity, it is hard to apply the transformer-based methods on the UAV platform. To address these limitations, motivated by Mamba with linear complexity, a UAV multispectral object detector with dual-domain enhancement and priority-guided mamba fusion (DEPF) is proposed. Firstly, to enhance low-light remote sensing images, Dual-Domain Enhancement Module (DDE) is designed, which contains Cross-Scale Wavelet Mamba (CSWM) and Fourier Details Recovery block (FDR). CSWM applies cross-scale mamba scanning for the low-frequency components to enhance the global brightness of images, while FDR constructs spectrum recovery network to enhance the frequency spectra features for recovering the texture-details. Secondly, to enhance local target modeling and reduce the impact of redundant information during fusion, Priority-Guided Mamba Fusion Module (PGMF) is designed. PGMF introduces the concept of priority scanning, which starts from local targets features according to the priority scores obtained from modality difference. Experiments on DroneVehicle dataset and VEDAI dataset reports that, DEPF performs well on object detection, comparing with state-of-the-art methods. Our code is available in the supplementary material.

Haiqing Ren,Zhongkai Luo,Heng Fan,Xiaohui Yuan,Guanchen Wang,Libo Zhang

Main category: cs.CV

TL;DR: This paper proposes G$^{3}$CN, a novel method for skeleton-based action recognition that improves the representation of ambiguous actions by refining graph topology with a Gaussian filter and enhancing information propagation using GRUs, demonstrating strong performance across multiple benchmarks.

Details

Motivation: Despite the effectiveness of Graph Convolutional Networks (GCNs) in skeleton-based action recognition, they struggle to distinguish between ambiguous actions due to limitations in representing topological and spatial features. This motivates the development of a novel approach to address this challenge. Method: The method introduces G$^{3}$CN, which combines a Gaussian filter to refine the skeleton topology graph and Gated Recurrent Units (GRUs) to enhance information propagation between skeleton points within the GCN framework. Result: Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks show that G$^{3}$CN effectively improves action recognition performance, especially for ambiguous samples. Conclusion: The proposed G$^{3}$CN method effectively improves skeleton-based action recognition, particularly for ambiguous samples, and demonstrates strong generalization across various GCN backbones. Abstract: Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G$^{3}$CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G$^{3}$CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G$^{3}$CN effectively improves action recognition, particularly for ambiguous samples.

[65] Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

Shibang Liu,Xuemei Xie,Guangming Shi

Main category: cs.CV

TL;DR: This paper introduces PGVL, a novel framework for human pose estimation that leverages visual and language modalities through a Guided Module and recursive cross-attention mechanisms.

Details

Motivation: Current human pose estimation methods primarily focus on single modality modeling, neglecting the potential of multimodal fusion, particularly with language that provides rich priors such as spatial relations. Method: PGVL employs a Guided Module (GM) to facilitate interaction between high and low semantic nodes, and uses recursive bidirectional cross-attention to fuse visual and language features. Result: The proposed PGVL framework outperforms existing methods on major pose estimation datasets, especially in handling occluded regions. Conclusion: PGVL effectively integrates multimodal information, particularly visual and language inputs, for improved human pose estimation, especially in occluded scenes. Abstract: Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.

[66] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

Ze-Xin Yin,Jiaxiong Qiu,Liu Liu,Xinjie Wang,Wei Sui,Zhizhong Su,Jian Yang,Jin Xie

Main category: cs.CV

TL;DR: LGAA is a modular, end-to-end framework for generating PBR-ready 3D assets using multi-view diffusion models, enabling efficient training and high-quality output.

Details

Motivation: The creation of 3D assets with physically based rendering materials is labor-intensive, and existing methods often neglect end-to-end integration of geometry and material modeling. LGAA aims to automate this process by leveraging multi-view diffusion priors for unified modeling. Method: LGAA introduces a modular framework with three components: the LGAA Wrapper, which reuses and adapts network layers from MV diffusion models; the LGAA Switcher, which aligns multiple diffusion priors; and the LGAA Decoder, a tamed variational autoencoder that predicts 2D Gaussian Splatting with PBR channels. A post-processing step extracts high-quality mesh assets. Result: LGAA achieves superior performance in generating PBR-ready 3D assets using both text- and image-conditioned multi-view diffusion models. It enables efficient convergence using only 69k multi-view instances and supports flexible integration of multiple diffusion priors. Conclusion: LGAA provides an efficient and modular approach for end-to-end PBR-ready 3D asset generation, demonstrating superior performance with both text- and image-conditioned multi-view diffusion models while enabling flexibility and efficient training. Abstract: The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: https://zx-yin.github.io/dreamlifting/.

[67] In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng,Jiacheng Hua,Miao Liu,Feng Lu

Main category: cs.CV

TL;DR: This paper introduces EgoGazeVQA, a new benchmark leveraging gaze information to improve AI assistants' understanding of user intentions in egocentric videos, showing that incorporating gaze data enhances the performance of multimodal large language models.

Details

Motivation: The motivation stems from the lack of focus on gaze as an indicator of user intent in existing benchmarks for MLLMs processing egocentric videos. Method: The researchers introduced EgoGazeVQA, a benchmark for gaze-guided video question answering, involving MLLMs generating QA pairs refined by human annotators, and conducted experiments on gaze-related fine-tuning. Result: The results showed that existing MLLMs struggle with interpreting user intentions in egocentric videos, but performance improved with gaze-guided intent prompting methods. Conclusion: The study concludes that integrating gaze information significantly enhances the performance of MLLMs in interpreting user intentions in egocentric video settings. Abstract: The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants' ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.

[68] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

Xudong Lu,Zhi Zheng,Yi Wan,Yongxiang Yao,Annan Wang,Renrui Zhang,Panwang Xia,Qiong Wu,Qingyun Li,Weifeng Lin,Xiangyu Zhao,Xue Yang,Hongsheng Li

Main category: cs.CV

TL;DR: 本文提出GLEAM-C和GLEAM-X，一种结合多模态对齐与可解释推理的新型地理定位方法，提升了匹配准确性和模型透明度。

Details

Motivation: 传统CVGL方法受限于单一视角或模态，且缺乏可解释性。需要一种更高效、准确和可解释的地理定位方法。 Method: 通过两阶段训练策略，将多视角和多模态（包括无人机图像、街道地图、全景视图和地面照片）与卫星图像对齐，并结合多模态大语言模型(MLLMs)进行可解释的交叉视角推理。 Result: 实现了与现有模态专用CVGL模型相当的准确性，同时提高了训练效率，构建了支持可解释性任务的双语基准数据集，并通过人工修订优化了测试集。 Conclusion: GLEAM-C和GLEAM-X构成了一种新的CVGL管道，将多模态、多视角对齐与可解释的对应分析相结合，提升了地理定位的准确性和可解释性。 Abstract: Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they merely predict whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at https://github.com/Lucky-Lance/GLEAM.

[69] XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

Pooya Khosravi,Kun Han,Anthony T. Wu,Arghavan Rezvani,Zexin Feng,Xiaohui Xie

Main category: cs.CV

TL;DR: 本文提出了一种新的深度学习框架XOCT，用于提高光学相干断层扫描血管成像的质量，通过集成跨维度监督和多尺度特征融合网络来改善视网膜血管的重建，从而增强OCTA在眼科疾病检测和监测中的应用。

Details

Motivation: 由于运动敏感性和传统OCT设备软件修改的高成本，获取高质量的OCTA图像具有挑战性，并且当前的深度学习方法在OCT到OCTA转换中经常忽略视网膜各层之间的血管差异，难以重建复杂的密集血管细节。 Method: 提出了一种名为XOCT的新型深度学习框架，该框架集成了跨维度监督（CDS）和多尺度特征融合（MSFF）网络，利用2D层状en-face投影作为监督信号，通过多尺度特征提取和通道重新加权策略来增强血管细节的描绘。 Result: 在OCTA-500数据集上的实验表明，XOCT在en-face投影方面有显著改进，这对于视网膜病理的临床评估尤为重要。 Conclusion: XOCT是一个结合了跨维度监督和多尺度特征融合网络的新深度学习框架，用于视网膜层感知的血管重建，具有增强OCTA的可及性、可靠性和诊断价值的潜力。 Abstract: Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT's improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at https://github.com/uci-cbcl/XOCT.

[70] Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting

Sai Siddhartha Chary Aylapuram,Veeraraju Elluru,Shivang Agarwal

Main category: cs.CV

TL;DR: 这篇论文研究了如何利用机器遗忘技术来减少视觉模型中的偏差，通过选择性地移除有偏见的样本或特征表示，从而缓解各种形式的偏见。

Details

Motivation: 深度神经网络在医学和自动驾驶等安全关键领域中依赖训练数据中的虚假关联，导致预测存在偏差或不公平。传统偏差缓解通常需要从头开始重新训练或重新设计数据管道，而机器遗忘提供了一个有前景的替代方案。 Method: 基于隐私保护的机器遗忘技术，评估了包括梯度上升、LoRA 和教师-学生蒸馏在内的多种策略。 Result: 在三个基准数据集上的实证分析显示，事后遗忘可以显著减少子组差异，在CUB-200上提升了94.86%，在CIFAR-10上提升了30.28%，在CelebA上提升了97.37%。 Conclusion: 机器遗忘可以作为提升已部署视觉系统公平性的实用框架，无需完全重新训练。 Abstract: Deep neural networks often rely on spurious correlations in training data, leading to biased or unfair predictions in safety-critical domains such as medicine and autonomous driving. While conventional bias mitigation typically requires retraining from scratch or redesigning data pipelines, recent advances in machine unlearning provide a promising alternative for post-hoc model correction. In this work, we investigate \textit{Bias-Aware Machine Unlearning}, a paradigm that selectively removes biased samples or feature representations to mitigate diverse forms of bias in vision models. Building on privacy-preserving unlearning techniques, we evaluate various strategies including Gradient Ascent, LoRA, and Teacher-Student distillation. Through empirical analysis on three benchmark datasets, CUB-200-2011 (pose bias), CIFAR-10 (synthetic patch bias), and CelebA (gender bias in smile detection), we demonstrate that post-hoc unlearning can substantially reduce subgroup disparities, with improvements in demographic parity of up to \textbf{94.86\%} on CUB-200, \textbf{30.28\%} on CIFAR-10, and \textbf{97.37\%} on CelebA. These gains are achieved with minimal accuracy loss and with methods scoring an average of 0.62 across the 3 settings on the joint evaluation of utility, fairness, quality, and privacy. Our findings establish machine unlearning as a practical framework for enhancing fairness in deployed vision systems without necessitating full retraining.

[71] ANYPORTAL: Zero-Shot Consistent Video Background Replacement

Wenshuo Gao,Xicheng Lan,Shuai Yang

Main category: cs.CV

TL;DR: ANYPORTAL是一种新颖的零样本视频背景替换框架，通过结合预训练扩散模型和细化投影算法，实现了高效的视频内容创作和编辑。

Details

Motivation: 尽管视频生成技术取得了迅速进步，但创造高质量、精确符合用户意图的视频仍然是一个重大挑战。现有方法通常无法对视频细节进行细粒度控制，限制了它们的实际应用性。 Method: ANYPORTAL框架零样本设置中结合了视频扩散模型的时间先验和图像扩散模型的重新照明能力，并提出了细化投影算法以确保精确的前景保留。 Result: ANYPORTAL实验结果证明，ANYPORTAL在消费级GPU上实现了高质量的结果。 Conclusion: ANYPORTAL提供了一种实用且高效的视频内容创作和编辑解决方案，并克服了实现前景一致性和时间连贯性重新照明的挑战。 Abstract: Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.

[72] MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

Patrick Wienholt,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: MedicalPatchNet 是一种新型的可解释性胸部X光分类模型，它在保持高效分类性能的同时，显著提高了病理定位的准确性，从而增强了临床信任度。

Details

Motivation: 深度神经网络在放射图像分类中表现出色，但缺乏可解释性，限制了其在临床环境中的接受度。因此，开发一种内在可解释的模型以提高临床信任是必要的。 Method: 将图像分割成不重叠的补丁，独立分类每个补丁并聚合预测结果，从而实现直观的可视化诊断贡献。 Result: MedicalPatchNet 在 CheXpert 数据集上的分类性能与 EfficientNet-B0 相当（AUROC 0.907 vs. 0.908），同时在 CheXlocalize 数据集上的病理定位准确率显著提高（平均命中率 0.485 vs. 0.376 使用 Grad-CAM）。 Conclusion: MedicalPatchNet 是一种具有内在可解释性的胸部X光分类架构，通过提供明确且可靠的解释，提高了临床信任度，并有助于安全、可解释的人工智能辅助诊断。 Abstract: Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch's diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: https://github.com/TruhnLab/MedicalPatchNet

[73] LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors

Wenshuo Gao,Xicheng Lan,Luyao Zhang,Shuai Yang

Main category: cs.CV

TL;DR: This paper introduces an automated method for animating vector graphics by combining implicit neural representations with text-to-video diffusion models, resulting in high-quality, flexible animations with minimal manual input.

Details

Motivation: Vector graphics offer scalability and user-friendliness but animating them typically requires significant manual effort. Automating this process enhances comprehensibility, controllability, and efficiency. Method: The approach uses layered implicit neural representations to reconstruct vector graphics, preserving their properties, and optimizes these representations using video score distillation sampling based on motion priors from pretrained text-to-video diffusion models. The vector graphics are then warped to produce smooth animation. Result: Experimental results show that the method generates vivid and natural vector graphic animations, demonstrating significant improvements in animation quality and flexibility over existing techniques. Conclusion: The proposed method successfully automates the animation of vector graphics by integrating implicit neural representations with text-to-video diffusion models, leading to improved flexibility and animation quality compared to existing techniques. Abstract: Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.

Xiao Li,Bharat Gandhi,Ming Zhan,Mohit Nehra,Zhicheng Zhang,Yuchen Sun,Meijia Song,Naisheng Zhang,Xi Wang

Main category: cs.CV

TL;DR: 本文通过微调BLIP-2模型和改进评估指标，显著提升了视觉-语言驱动的室内导航性能，尤其在方向性指令生成方面。

Details

Motivation: 传统的导航系统在室内环境中因缺乏精确的位置数据而效果不佳，本文旨在通过图像和自然语言指导，提升视觉障碍人士室内导航的可及性和独立性。 Method: 该研究结合了视觉和语言模型，利用手动标注的室内导航数据集对BLIP-2模型进行微调，并提出了一种强调方向性和顺序性变量的BERT F1评分改进评估指标。 Result: 应用LoRA后，模型在生成方向性指令方面显著提升，同时提出的评估指标为导航性能提供了更全面的衡量标准。 Conclusion: 通过使用LoRA微调BLIP-2模型，该方法在生成方向性导航指令方面表现出显著改进，克服了原始BLIP-2模型的局限性。 Abstract: We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.

[75] DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

Wenzhi Guo,Bing Wang

Main category: cs.CV

TL;DR: DiGS通过引入SDF学习和几何引导策略，提高了3D高斯点绘的表面重建效果。

Details

Motivation: 3DGS在表面重建上面临挑战，因为其非结构化特性和缺乏明确的几何监督。 Method: DiGS将符号距离场（SDF）学习嵌入3DGS流程中，并设计了几何引导的网格增长策略。 Result: 实验表明，DiGS在DTU、Mip-NeRF 360和Tanks&Temples等标准数据集上提升了重建的准确性和完整性。 Conclusion: DiGS改进了3D高斯点绘的表面重建能力，同时保持了高质量的渲染效果。 Abstract: 3D Gaussian Splatting (3DGS) has recently emerged as a powerful paradigm for photorealistic view synthesis, representing scenes with spatially distributed Gaussian primitives. While highly effective for rendering, achieving accurate and complete surface reconstruction remains challenging due to the unstructured nature of the representation and the absence of explicit geometric supervision. In this work, we propose DiGS, a unified framework that embeds Signed Distance Field (SDF) learning directly into the 3DGS pipeline, thereby enforcing strong and interpretable surface priors. By associating each Gaussian with a learnable SDF value, DiGS explicitly aligns primitives with underlying geometry and improves cross-view consistency. To further ensure dense and coherent coverage, we design a geometry-guided grid growth strategy that adaptively distributes Gaussians along geometry-consistent regions under a multi-scale hierarchy. Extensive experiments on standard benchmarks, including DTU, Mip-NeRF 360, and Tanks& Temples, demonstrate that DiGS consistently improves reconstruction accuracy and completeness while retaining high rendering fidelity.

[76] Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition

Chun Liu,Hailong Wang,Bingqian Zhu,Panpan Ding,Zheng Zheng,Tao Xu,Zhigang Han,Jiayao Wang

Main category: cs.CV

TL;DR: This paper proposes a new framework for adversarial attacks on DNNs in remote sensing, using local mixing and logit optimization to improve transferability and overcome gradient diminishing issues, showing superior performance over existing methods.

Details

Motivation: The motivation is to address the limitations of current mixing-based strategies in adversarial attacks on DNNs, particularly their impact on global semantic features and the gradient diminishing problem. Method: The paper introduces a local mixing strategy and adapts logit loss for non-targeted attacks, along with a perturbation smoothing loss to enhance transferability. Result: Extensive experiments show that the proposed method significantly improves the black-box attack success rate over state-of-the-art methods on multiple datasets and surrogate models. Conclusion: The paper concludes that the proposed framework for adversarial attacks on DNNs in remote sensing applications demonstrates superior performance compared to existing methods. Abstract: Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, posing significant security threats to their deployment in remote sensing applications. Research on adversarial attacks not only reveals model vulnerabilities but also provides critical insights for enhancing robustness. Although current mixing-based strategies have been proposed to increase the transferability of adversarial examples, they either perform global blending or directly exchange a region in the images, which may destroy global semantic features and mislead the optimization of adversarial examples. Furthermore, their reliance on cross-entropy loss for perturbation optimization leads to gradient diminishing during iterative updates, compromising adversarial example quality. To address these limitations, we focus on non-targeted attacks and propose a novel framework via local mixing and logits optimization. First, we present a local mixing strategy to generate diverse yet semantically consistent inputs. Different from MixUp, which globally blends two images, and MixCut, which stitches images together, our method merely blends local regions to preserve global semantic information. Second, we adapt the logit loss from targeted attacks to non-targeted scenarios, mitigating the gradient vanishing problem of cross-entropy loss. Third, a perturbation smoothing loss is applied to suppress high-frequency noise and enhance transferability. Extensive experiments on FGSCR-42 and MTARSI datasets demonstrate superior performance over 12 state-of-the-art methods across 6 surrogate models. Notably, with ResNet as the surrogate on MTARSI, our method achieves a 17.28% average improvement in black-box attack success rate.

[77] MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection

Saad Lahlali,Alexandre Fournier Montgieux,Nicolas Granger,Hervé Le Borgne,Quoc Cuong Pham

Main category: cs.CV

TL;DR: MVAT addresses the challenges of weakly supervised 3D object detection by leveraging temporal multi-view data and a Teacher-Student paradigm, achieving state-of-the-art results without requiring 3D box annotations.

Details

Motivation: Annotating 3D data is costly and serves as a bottleneck for 3D object detection. Existing weakly supervised methods relying on 2D box annotations face challenges due to projection ambiguities and partial object visibility. Method: MVAT uses a Teacher-Student distillation paradigm that leverages temporal multi-view data to generate high-quality pseudo-labels for 3D object detection. The framework also incorporates a multi-view 2D projection loss for consistency between predicted 3D boxes and 2D annotations. Result: Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves excellent performance in weakly supervised 3D object detection, coming close to fully supervised methods without using 3D box annotations. Conclusion: MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. Abstract: Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnote{Code available upon acceptance} Our code is available in our public repository (\href{https://github.com/CEA-LIST/MVAT}{code}).

[78] EHWGesture -- A dataset for multimodal understanding of clinical gestures

Gianluca Amprimo,Alberto Ancilotto,Alessandro Savino,Fabio Quazzolo,Claudia Ferraris,Gabriella Olmo,Elisabetta Farella,Stefano Di Carlo

Main category: cs.CV

TL;DR: 本文介绍了一个新的多模态视频数据集EHWGesture，用于手势理解，特别关注临床应用中的动态手势理解和动作质量评估。

Details

Motivation: 动态手势理解由于复杂的时空变化而仍然具有挑战性，而现有的数据集常常缺乏多模态和多视角的多样性、精确的真值跟踪以及嵌入手势中的动作质量成分。 Method: 引入了一个名为EHWGesture的多模态视频数据集，包含五种临床上相关的手势，并使用两种高分辨率RGB-Depth相机和一个事件相机进行记录。 Result: 基准实验突出了该数据集在手势分类、手势触发检测和动作质量评估方面的潜力。 Conclusion: EHWGesture可以作为推进多模态临床手势理解的综合基准。 Abstract: Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset's potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.

[79] Universal Few-Shot Spatial Control for Diffusion Models

Kiet T. Nguyen,Chanhuyk Lee,Donggyun Kim,Dong Hoon Lee,Seunghoon Hong

Main category: cs.CV

TL;DR: Universal Few-Shot Control (UFC) enables efficient adaptation of text-to-image diffusion models to new spatial control tasks with minimal data, achieving strong performance across various architectures.

Details

Motivation: Existing spatial control adapters in text-to-image diffusion models are limited in adaptability and expensive to train when applied to novel spatial conditions, necessitating a more efficient and versatile solution. Method: UFC uses a few image-condition pairs to construct task-specific control features through a matching mechanism and a small set of task-specific parameter updates, enabling generalization to novel spatial conditions. Result: UFC achieves fine-grained spatial control with only 30 annotated examples of a novel task and performs competitively with fully supervised baselines using just 0.1% of the full training data, and it works effectively across different diffusion architectures like UNet and DiT. Conclusion: UFC is an efficient and adaptable few-shot control adapter for spatial conditioning in text-to-image diffusion models, capable of achieving performance comparable to fully supervised methods with minimal training data and applicable across different diffusion architectures. Abstract: Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at https://github.com/kietngt00/UFC.

[80] HU-based Foreground Masking for 3D Medical Masked Image Modeling

Jin Lee,Vu Dang,Gwang-Hyun Yu,Anh Le,Zahid Rahman,Jin-Ho Jang,Heonzoo Lee,Kun-Yung Kim,Jin-Sul Kim,Jin-Young Kim

Main category: cs.CV

TL;DR: The paper introduces HU-based Foreground Masking in MIM for 3D medical images, improving segmentation performance across multiple datasets by focusing on anatomically meaningful regions.

Details

Motivation: The motivation is to overcome the limitations of random masking in 3D medical image computing, which overlooks the density of anatomical objects, by adopting a more targeted masking strategy. Method: The method involves enhancing the pretext task in MIM using HU-based Foreground Masking, which focuses on the intensity distribution of visceral organs and excludes non-tissue regions. Result: The experiments showed consistent improvement in segmentation quality across five public 3D medical imaging datasets, with Dice scores of BTCV: ~84.64%, Flare22: ~92.43%, MM-WHS: ~90.67%, Amos22: ~88.64%, and BraTS: ~78.55%. Conclusion: The paper concludes that domain-centric MIM with HU-based Foreground Masking significantly enhances the performance of medical image segmentation, highlighting a promising direction for representation learning in this field. Abstract: While Masked Image Modeling (MIM) has revolutionized fields of computer vision, its adoption in 3D medical image computing has been limited by the use of random masking, which overlooks the density of anatomical objects. To address this limitation, we enhance the pretext task with a simple yet effective masking strategy. Leveraging Hounsfield Unit (HU) measurements, we implement an HU-based Foreground Masking, which focuses on the intensity distribution of visceral organs and excludes non-tissue regions, such as air and fluid, that lack diagnostically meaningful features. Extensive experiments on five public 3D medical imaging datasets demonstrate that our masking consistently improves performance, both in quality of segmentation and Dice score (BTCV:~84.64\%, Flare22:~92.43\%, MM-WHS:~90.67\%, Amos22:~88.64\%, BraTS:~78.55\%). These results underscore the importance of domain-centric MIM and suggest a promising direction for representation learning in medical image segmentation. Implementation is available at github.com/AISeedHub/SubFore/.

[81] TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

Peijin Xie,Shun Qian,Bingquan Liu,Dexin Wang,Lin Sun,Xiangzheng Zhang

Main category: cs.CV

TL;DR: 本文提出了TextlessRAG，这是一个基于语音问答大规模文档图像的端到端框架，无需ASR、TTS和OCR，直接解析语音、检索相关视觉知识，并在重排序机制的帮助下生成答案。

Details

Motivation: 文档图像包含大量知识，而语音查询的便携性使得应用更广泛且灵活。然而，之前的工作并未探索通过语音直接查询视觉文档图像的知识库问答。 Method: 本文提出了TextlessRAG，这是第一个基于语音的大规模文档图像问答的端到端框架。TextlessRAG消除了ASR、TTS和OCR，直接解析语音、检索相关视觉知识，并在重排序机制的帮助下生成答案。 Result: 实验表明，TextlessRAG在效率和准确性方面都有显著提升。此外，作者还发布了一个中英双语的语音-文档RAG数据集，以及框架代码。 Conclusion: TextlessRAG是一个有效且高效的语音驱动文档图像问答方法，未来的研究可以在此基础上进一步发展。 Abstract: Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

[82] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Peng Li,Yisheng He,Yingdong Hu,Yuan Dong,Weihao Yuan,Yuan Liu,Zilong Dong,Yike Guo

Main category: cs.CV

TL;DR: 本文提出了一种高效的高斯全头合成框架，通过单张图像快速生成高质量3D头像，无需复杂的优化过程。

Details

Motivation: 为了解决传统方法依赖耗时的GAN反演和测试时优化的问题，同时缓解大规模3D头部数据不足的挑战。 Method: 提出了一种从单张无姿态图像中进行高斯全头合成的前馈框架，结合了粗到细的生成流程和双分支结构以提升重建效果。 Result: 该方法能够在单次前向传递中完成重建，推理速度快，并通过合成数据实现高质量生成。 Conclusion: 实验结果表明该框架的有效性，能够实现高效、高质量的高斯全头生成。 Abstract: We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.

[83] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo,Marc Haraoui

Main category: cs.CV

TL;DR: Visual-TableQA是一个大规模、开放领域的多模态数据集，旨在评估和增强视觉语言模型在复杂表格数据上的视觉推理能力。

Details

Motivation: 现有的基准测试在规模、多样性或推理深度方面仍存在局限，尤其是在渲染表格图像方面。 Method: 该研究使用了一个模块化、可扩展且完全自主的生成管道，涉及多个推理LLM在不同角色（生成、验证和启发）间的协作。此外，通过跨模型提示（'启发'）和LLM陪审团过滤实现多模型协作数据生成。 Result: Visual-TableQA包含2.5k个结构丰富的LaTeX渲染表格和6k个推理密集型问答对，成本低于100美元。实证结果表明，在Visual-TableQA上微调的模型能够稳健地推广到外部基准测试，并超越了多个专有模型，尽管该数据集是合成的。 Conclusion: Visual-TableQA提供了一种有效的方法来评估和增强视觉语言模型在复杂表格数据上的视觉推理能力，并促进了数据集的多样性和创造性。 Abstract: Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting ('inspiration') and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset's synthetic nature. The full pipeline and resources are publicly available at https://github.com/AI-4-Everyone/Visual-TableQA.

[84] Attention Maps in 3D Shape Classification for Dental Stage Estimation with Class Node Graph Attention Networks

Barkin Buyukcakir,Rocharles Cavalcante Fontenele,Reinhilde Jacobs,Jannick De Tobel,Patrick Thevissen,Dirk Vandermeulen,Peter Claes

Main category: cs.CV

TL;DR: 该论文提出了一种可解释的3D形状识别深度学习架构（CGAT），通过可视化注意力机制增强模型决策的信任度，适用于高风险应用如医学和法医学。

Details

Motivation: 深度学习模型的黑箱特性阻碍了其在高风险领域的应用，因此需要一种透明且性能良好的模型来提高信任和问责制。 Method: 提出了一种名为Class Node Graph Attention Network (CGAT)的架构，利用图注意力卷积和注意力机制，对3D形状（如第三磨牙）进行分类，并通过注意力可视化解释决策过程。 Result: 结合局部平均曲率和距离质心的节点特征，模型取得了0.76的加权F1分数，并生成了更全面的注意力可视化图。引入全局CLS节点的有向边提升了注意力图的直观性和分类性能。 Conclusion: CGAT架构在保持良好分类性能的同时，能够生成人类可理解的注意力图，增强了模型的可解释性，有助于推动深度学习在高风险环境中的应用。 Abstract: Deep learning offers a promising avenue for automating many recognition tasks in fields such as medicine and forensics. However, the black-box nature of these models hinders their adoption in high-stakes applications where trust and accountability are required. For 3D shape recognition tasks in particular, this paper introduces the Class Node Graph Attention Network (CGAT) architecture to address this need. Applied to 3D meshes of third molars derived from CBCT images, for Demirjian stage allocation, CGAT utilizes graph attention convolutions and an inherent attention mechanism, visualized via attention rollout, to explain its decision-making process. We evaluated the local mean curvature and distance to centroid node features, both individually and in combination, as well as model depth, finding that models incorporating directed edges to a global CLS node produced more intuitive attention maps, while also yielding desirable classification performance. We analyzed the attention-based explanations of the models, and their predictive performances to propose optimal settings for the CGAT. The combination of local mean curvature and distance to centroid as node features yielded a slight performance increase with 0.76 weighted F1 score, and more comprehensive attention visualizations. The CGAT architecture's ability to generate human-understandable attention maps can enhance trust and facilitate expert validation of model decisions. While demonstrated on dental data, CGAT is broadly applicable to graph-based classification and regression tasks, promoting wider adoption of transparent and competitive deep learning models in high-stakes environments.

[85] Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Xin Lai,Junyi Li,Wei Li,Tao Liu,Tianjian Li,Hengshuang Zhao

Main category: cs.CV

TL;DR: Mini-o3是一个能够执行多轮推理的系统，解决了现有视觉问题解决方法中交互回合有限和推理模式单调的问题。

Details

Motivation: 现有的开源方法在视觉问题解决上表现出单调的推理模式且交互回合有限，难以应对需要试错探索的复杂任务。 Method: 构建了Visual Probe Dataset，开发了迭代数据收集流程，并提出了过度回合掩码策略。 Result: Mini-o3在推理过程中自然扩展到数十轮，准确率随着回合数增加而提高，并在视觉搜索任务中表现出丰富的推理模式和深度思考路径。 Conclusion: Mini-o3通过扩展基于工具的交互，实现了深入的多轮推理，并在具有挑战性的视觉搜索任务上表现出色。 Abstract: Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning -- spanning tens of steps -- and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.

[86] Temporal Image Forensics: A Review and Critical Evaluation

Robert Jöchl,Andreas Uhl

Main category: cs.CV

TL;DR: This paper reviews temporal image forensics, highlighting challenges like content bias and the importance of Explainable AI to ensure reliable age estimation techniques.

Details

Motivation: To provide a comprehensive overview of temporal image forensics, address the issue of content bias, and evaluate the reliability of AI-based techniques in estimating image age. Method: The paper reviews existing temporal image forensics techniques, re-analyzes or re-implements previous methods, proposes a new forensic setting, and investigates neural network behavior in learning age traces. Result: Key findings include verification of sensor defect properties, evidence that some methods exploit content bias instead of age traces, and that neural networks can be easily distracted from learning accurate age indicators. Conclusion: Temporal image forensics faces challenges like content bias, and the reliability of techniques can be enhanced through Explainable AI methods. Neural networks can be easily distracted from learning actual age traces. Abstract: Temporal image forensics is the science of estimating the age of a digital image. Usually, time-dependent traces (age traces) introduced by the image acquisition pipeline are exploited for this purpose. In this review, a comprehensive overview of the field of temporal image forensics based on time-dependent traces from the image acquisition pipeline is given. This includes a detailed insight into the properties of known age traces (i.e., in-field sensor defects and sensor dust) and temporal image forensics techniques. Another key aspect of this work is to highlight the problem of content bias and to illustrate how important eXplainable Artificial Intelligence methods are to verify the reliability of temporal image forensics techniques. Apart from reviewing material presented in previous works, in this review: (i) a new (probably more realistic) forensic setting is proposed; (ii) the main properties (growth rate and spatial distribution) of in-field sensor defects are verified; (iii) it is shown that a method proposed to utilize in-field sensor defects for image age approximation actually exploits other traces (most likely content bias); (iv) the features learned by a neural network dating palmprint images are further investigated; (v) it is shown how easily a neural network can be distracted from learning age traces. For this purpose, previous work is analyzed, re-implemented if required and experiments are conducted.

[87] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation

Yusuke Hirota,Ryo Hachiuma,Boyi Li,Ximing Lu,Michael Ross Boone,Boris Ivanovic,Yejin Choi,Marco Pavone,Yu-Chiang Frank Wang,Noa Garcia,Yuta Nakashima,Chao-Han Huck Yang

Main category: cs.CV

TL;DR: Spurious features in benchmarks distort gender bias evaluations in vision-language models, making current assessments unreliable.

Details

Motivation: The motivation stems from concerns about the safe deployment of vision-language foundation models and the potential for spurious correlations to distort gender bias evaluation. Method: The researchers systematically perturbed non-gender features across multiple benchmarks and vision-language models to quantify the impact on bias evaluation. Result: Minimal perturbations, such as masking 10% of objects or background blurring, were found to dramatically alter bias scores, with metrics shifting up to 175% in generative VLMs and 43% in CLIP variants. Conclusion: The study concludes that spurious features significantly affect the evaluation of gender bias in vision-language models, suggesting that current benchmarks may not reliably assess true gender bias. Abstract: Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.

[88] Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Fangqi Cheng,Surajit Ray,Xiaochen Yang

Main category: cs.CV

TL;DR: 本文提出了一种高效的3D医学视觉语言模型微调方法，通过合成报告和MMSE评分预测辅助监督，显著提升了阿尔茨海默病诊断性能。

Details

Motivation: 现有的医学视觉语言模型在利用患者元数据和整合临床诊断知识方面存在不足，且对3D医学影像的有效性受限。 Method: 提出了一种数据高效的微调方法，包括将结构化元数据转换为合成报告以增强文本输入，并添加辅助标记预测MMSE评分作为额外监督。 Result: 该方法在两个阿尔茨海默病数据集上实现了最先进的性能，并且代码将在发表后公开。 Conclusion: 实验结果表明，提出的微调方法在阿尔茨海默病诊断任务上达到了SOTA性能，仅使用1500张训练图像就优于现有方法在10000张图像上微调的结果。 Abstract: Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

[89] Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis

Fangqi Cheng,Yingying Zhao,Xiaochen Yang

Main category: cs.CV

TL;DR: 本文提出了一种自监督框架，通过利用纵向MRI的时间连续性，提升了神经退行性疾病的诊断准确性和模型的可解释性。

Details

Motivation: 现有的深度学习方法在诊断神经退行性疾病时依赖大量标记数据，且表示缺乏可解释性。本文旨在解决这两个问题。 Method: 利用纵向MRI扫描中的时间连续性进行监督，通过对比学习和输入梯度正则化分别约束静态和动态表示。 Result: 在ADNI数据集上实现了优越的分类准确性，并在OASIS和PPMI数据集上展示了零样本和跨任务泛化能力。 Conclusion: 该论文提出了一种新的自监督交叉编码器框架，用于从纵向MRI扫描中学习可解释的表示，并在分类任务中表现出优异的性能和泛化能力。 Abstract: Deep learning has shown significant potential in diagnosing neurodegenerative diseases from MRI data. However, most existing methods rely heavily on large volumes of labeled data and often yield representations that lack interpretability. To address both challenges, we propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision. This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes and can be effectively fine-tuned for downstream classification tasks. Experimental results on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our method achieves superior classification accuracy and improved interpretability. Furthermore, the learned representations exhibit strong zero-shot generalization on the Open Access Series of Imaging Studies (OASIS) dataset and cross-task generalization on the Parkinson Progression Marker Initiative (PPMI) dataset. The code for the proposed method will be made publicly available.

[90] Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity

Sung Ju Lee,Nam Ik Cho

Main category: cs.CV

TL;DR: 本文提出了一种新的语义水印嵌入方法，该方法通过保持频率完整性和减少裁剪攻击的脆弱性，提高了稳健性和检索准确性。

Details

Motivation: 语义水印技术虽然能抵御再生攻击，但由于频率完整性丢失，经常出现检测性能下降的问题。 Method: 提出了一种新的嵌入方法Hermitian Symmetric Fourier Watermarking（SFW），并引入了一种中心感知嵌入策略。 Result: 该方法在各种攻击场景中实现了最先进的验证和识别性能，超越了以前的方法。此外，消融研究证实了SFW对检测能力的影响，中心感知嵌入对裁剪的有效性，以及消息容量如何影响识别准确性。 Conclusion: SFW是一种在稳健性和图像保真度之间取得平衡的有效框架，解决了语义水印固有的权衡问题。 Abstract: Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at https://github.com/thomas11809/SFWMark

[91] Beyond Motion Cues and Structural Sparsity: Revisiting Small Moving Target Detection

Guoyi Zhang,Siyang Chen,Guangsheng Xu,Zhihua Shen,Han Wang,Xiaohu Zhang

Main category: cs.CV

TL;DR: This paper proposes TenRPCANet, a deep learning framework for small moving target detection that leverages tensor decomposition and self-attention mechanisms, achieving excellent results in challenging scenarios.

Details

Motivation: Detecting small moving targets is challenging due to low signal-to-noise ratios and cluttered backgrounds; existing methods lack robustness in complex environments. Method: The method reformulates small target detection as a tensor-based low-rank and sparse decomposition problem, using a deep neural network called TenRPCANet with a self-attention mechanism and feature refinement module. Result: The method achieves state-of-the-art performance on two challenging tasks: multi-frame infrared small target detection and space object detection. Conclusion: The proposed TenRPCANet effectively detects small moving targets by leveraging low-rank background priors and enhancing target saliency, showing strong performance and generalizability. Abstract: Small moving target detection is crucial for many defense applications but remains highly challenging due to low signal-to-noise ratios, ambiguous visual cues, and cluttered backgrounds. In this work, we propose a novel deep learning framework that differs fundamentally from existing approaches, which often rely on target-specific features or motion cues and tend to lack robustness in complex environments. Our key insight is that small target detection and background discrimination are inherently coupled, even cluttered video backgrounds often exhibit strong low-rank structures that can serve as stable priors for detection. We reformulate the task as a tensor-based low-rank and sparse decomposition problem and conduct a theoretical analysis of the background, target, and noise components to guide model design. Building on these insights, we introduce TenRPCANet, a deep neural network that requires minimal assumptions about target characteristics. Specifically, we propose a tokenization strategy that implicitly enforces multi-order tensor low-rank priors through a self-attention mechanism. This mechanism captures both local and non-local self-similarity to model the low-rank background without relying on explicit iterative optimization. In addition, inspired by the sparse component update in tensor RPCA, we design a feature refinement module to enhance target saliency. The proposed method achieves state-of-the-art performance on two highly distinct and challenging tasks: multi-frame infrared small target detection and space object detection. These results demonstrate both the effectiveness and the generalizability of our approach.

[92] EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

Haokai Zhu,Bo Qu,Si-Yuan Cao,Runmin Zhang,Shujie Chen,Bailin Yang,Hui-Liang Shen

Main category: cs.CV

TL;DR: 本文提出了一种新的深度图像配准方法EDFFDNet及其改进版EDFFDNet-2，通过自由变形和稀疏运动聚合技术，在减少计算资源的同时提升了配准精度和泛化能力。

Details

Motivation: 传统的基于单应性变换、多网格单应性或薄板样条的深度图像配准方法在处理具有深度差异的真实场景时存在局限性，因此提出了本方法以提高效率和性能。 Method: 提出了基于指数衰减基函数的自由变形网络（EDFFDNet）和自适应稀疏运动聚合器（ASMA），并采用渐进式相关性优化策略进行从粗到精的运动估计。 Result: 实验表明，EDFFDNet在参数、内存和总运行时间上分别减少了70.5%、32.6%和33.7%，同时比现有最先进方法的PSNR提高了0.5 dB；EDFFDNet-2进一步将PSNR提高了1.06 dB，并保持较低的计算成本。 Conclusion: EDFFDNet和EDFFDNet-2在图像配准任务中表现优异，不仅减少了计算资源消耗，还提高了配准精度，同时具有良好的泛化能力。 Abstract: Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage,EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.

[93] Nearest Neighbor Projection Removal Adversarial Training

Himanshu Singh,A. V. Subramanyam,Shivank Rajput,Mohan Kankanhalli

Main category: cs.CV

TL;DR: 本文提出一种通过减少特征空间中类间依赖性来增强对抗鲁棒性的新方法，在多个数据集上表现出色。

Details

Motivation: 现有的对抗训练方法通常未能明确解决类间特征重叠问题，而这是导致对抗样本易受攻击的重要因素。 Method: 通过从对抗样本和干净样本的特征空间中去除类间依赖性，主动缓解类间接近性问题，并引入了logits校正方法。 Result: 实验表明，该方法在CIFAR-10、CIFAR-100和SVHN等多个标准基准上表现优异，兼具良好的鲁棒性和清洁准确率。 Conclusion: 该研究提出了一种新的对抗训练框架，通过减少特征空间中的类间依赖性来增强深度神经网络的对抗鲁棒性，并在多个标准基准上验证了其在提升鲁棒性和准确率方面的有效性。 Abstract: Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.

[94] CAViAR: Critic-Augmented Video Agentic Reasoning

Sachit Menon,Ahmet Iscen,Arsha Nagrani,Tobias Weyand,Carl Vondrick,Cordelia Schmid

Main category: cs.CV

TL;DR: This paper proposes a large language model agent with a critic to improve complex video reasoning performance, achieving strong results on the mentioned datasets.

Details

Motivation: To explore whether existing perception capabilities can be leveraged to successfully perform more complex video reasoning, as performance wanes for tasks requiring complex reasoning on videos according to multiple recent benchmarks. Method: Developing a large language model agent that uses video modules as subagents or tools and introducing a critic to distinguish between successful and unsuccessful sequences from the agent. Result: The agent and critic achieve strong performance on the previously-mentioned datasets. Conclusion: The combination of the large language model agent and critic successfully improves complex video reasoning performance on the mentioned datasets. Abstract: Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

[95] SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression

Chunhang Zheng,Zichang Ren,Dou Li

Main category: cs.CV

TL;DR: SEEC improves lossless image compression by using multiple entropy models guided by semantic segmentation, leading to better compression ratios and ROI coding support.

Details

Motivation: Existing image compression methods use a single entropy model for all pixel values, limiting their ability to capture diverse statistical characteristics across different semantic regions in images. Method: SEEC utilizes semantic segmentation to identify different regions in an image, assigns each region a specialized entropy model, and employs a multi-channel discrete logistic mixture likelihood to model pixel value distributions. Result: SEEC achieves state-of-the-art compression ratios on benchmark datasets while maintaining minimal encoding and decoding latency, and supports ROI coding based on segmentation masks. Conclusion: The proposed SEEC framework enhances lossless image compression by leveraging semantic segmentation to adapt multiple entropy models, achieving state-of-the-art compression ratios with minimal latency and supporting ROI coding. Abstract: Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at https://github.com/chunbaobao/SEEC.

[96] XSRD-Net: EXplainable Stroke Relapse Detection

Christian Gapp,Elias Tappeiner,Martin Welk,Karl Fritscher,Stephanie Mangesius,Constantin Eisenschink,Philipp Deisl,Michael Knoflach,Astrid E. Grams,Elke R. Gizewski,Rainer Schubert

Main category: cs.CV

TL;DR: This study uses multimodal deep learning to predict stroke recurrence and survival time by combining medical imaging and patient data, showing promising results in early risk detection and clinical decision-making.

Details

Motivation: Stroke is a leading cause of death globally, with high recurrence and mortality rates. Early detection of patients at risk of stroke recurrence is crucial for effective therapy planning and reducing relapse rates. Method: The researchers collected 3D intracranial CTA image data along with tabular data including concomitant heart diseases, age, and gender of stroke patients from 2010 to 2024. They trained single- and multimodal deep learning neural networks for two tasks: binary relapse detection (Task 1) and relapse-free survival (RFS) time prediction with subsequent classification (Task 2). Result: For Task 1 (binary relapse detection), the model achieved an AUC of 0.84 using tabular data. For Task 2 (RFS time prediction), the multimodal XSRD-net model processed vision and tabular data with a modality contribution ratio of 0.68:0.32, achieving a c-index of 0.68 and an AUC of 0.71 on the test dataset. Interpretability analysis showed a link between heart diseases (tabular data) and carotid arteries (vision data) in predicting relapses. Conclusion: The study concludes that multimodal deep learning models, particularly the XSRD-net, can effectively predict stroke recurrence and relapse-free survival time by integrating 3D intracranial CTA image data with tabular patient information, highlighting a link between heart diseases and carotid arteries for stroke relapse detection. Abstract: Stroke is the second most frequent cause of death world wide with an annual mortality of around 5.5 million. Recurrence rates of stroke are between 5 and 25% in the first year. As mortality rates for relapses are extraordinarily high (40%) it is of utmost importance to reduce the recurrence rates. We address this issue by detecting patients at risk of stroke recurrence at an early stage in order to enable appropriate therapy planning. To this end we collected 3D intracranial CTA image data and recorded concomitant heart diseases, the age and the gender of stroke patients between 2010 and 2024. We trained single- and multimodal deep learning based neural networks for binary relapse detection (Task 1) and for relapse free survival (RFS) time prediction together with a subsequent classification (Task 2). The separation of relapse from non-relapse patients (Task 1) could be solved with tabular data (AUC on test dataset: 0.84). However, for the main task, the regression (Task 2), our multimodal XSRD-net processed the modalities vision:tabular with 0.68:0.32 according to modality contribution measures. The c-index with respect to relapses for the multimodal model reached 0.68, and the AUC is 0.71 for the test dataset. Final, deeper interpretability analysis results could highlight a link between both heart diseases (tabular) and carotid arteries (vision) for the detection of relapses and the prediction of the RFS time. This is a central outcome that we strive to strengthen with ongoing data collection and model retraining.

[97] HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting

Yimin Pan,Matthias Nießner,Tobias Kirschstein

Main category: cs.CV

TL;DR: 该论文提出了一种基于3D高斯点绘的头发重建方法，通过多阶段管道实现了发丝级几何重建，并引入了新的评估指标来衡量拓扑结构的准确性，实验表明方法高效且适用于多种发型。

Details

Motivation: 头发重建在计算机视觉中具有挑战性，且对虚拟现实和数字人类建模等应用日益重要。现有的头发重建方法通常在几何层面评估重建质量，但忽略了头发丝的连通性和拓扑结构，因此需要一种新的方法和评估指标来解决这些问题。 Method: 该论文的方法包括三个主要阶段：首先使用可微分高斯光栅化器重建详细的头发几何结构，然后通过一种新的合并方案将单个高斯片段合并为连贯的发丝，最后在光度监督下对发丝进行优化和生长。 Result: 实验表明，该方法能够稳健地处理多种发型，并在合成和真实世界数据集上实现了高效的重建，通常在一小时内完成。同时，提出的新评估指标能够有效反映发丝重建的拓扑准确性。 Conclusion: 该论文提出了一种基于3D高斯点绘的多阶段管道方法，能够从多视角图像中实现发丝级的头发几何重建，并通过新的评估指标验证了方法在拓扑结构上的有效性。 Abstract: Human hair reconstruction is a challenging problem in computer vision, with growing importance for applications in virtual reality and digital human modeling. Recent advances in 3D Gaussians Splatting (3DGS) provide efficient and explicit scene representations that naturally align with the structure of hair strands. In this work, we extend the 3DGS framework to enable strand-level hair geometry reconstruction from multi-view images. Our multi-stage pipeline first reconstructs detailed hair geometry using a differentiable Gaussian rasterizer, then merges individual Gaussian segments into coherent strands through a novel merging scheme, and finally refines and grows the strands under photometric supervision. While existing methods typically evaluate reconstruction quality at the geometric level, they often neglect the connectivity and topology of hair strands. To address this, we propose a new evaluation metric that serves as a proxy for assessing topological accuracy in strand reconstruction. Extensive experiments on both synthetic and real-world datasets demonstrate that our method robustly handles a wide range of hairstyles and achieves efficient reconstruction, typically completing within one hour. The project page can be found at: https://yimin-pan.github.io/hair-gs/

[98] RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

Hugo Blanc,Jean-Emmanuel Deschaud,Alexis Paljic

Main category: cs.CV

TL;DR: RayGaussX通过多项关键技术改进了RayGauss，在保持高质量渲染的同时大幅提高了渲染速度。

Details

Motivation: RayGauss虽然在合成和室内场景中表现优异，但其计算成本过高，无法在现实场景中实现实时渲染。 Method: RayGaussX在RayGauss的基础上引入了空空间跳过、自适应采样、增强光线一致性、引入尺度正则化、以及改进的密度化准则等关键技术。 Result: RayGaussX实现了5到12倍的训练加速和50到80倍的渲染速度提升(FPS)，同时在视觉质量上提高了+0.56 dB PSNR。 Conclusion: RayGaussX通过加速训练和推理过程，同时提高视觉质量，实现了在现实世界数据集上比RayGauss快得多的渲染速度。 Abstract: RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. Project page with videos and code: https://raygaussx.github.io/.

[99] Faster, Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss

Maja Schlereth,Moritz Schillinger,Katharina Breininger

Main category: cs.CV

TL;DR: This paper introduces a self-supervised deep learning method to fuse low-resolution MR images, enhancing resolution and detail without high-resolution training data, leading to faster and more accurate reconstructions.

Details

Motivation: High-resolution MR imaging is challenging due to time and patient comfort constraints. Current methods analyzing LR scans individually are inefficient and error-prone. Method: A self-supervised multi-view neural network with a sparse coordinate-based loss was developed to fuse LR images and reconstruct high-resolution details. Result: The method achieved comparable or better SR performance than SOTA self-supervised methods across different upsampling scales with a 10x speed-up in patient-specific reconstruction. Conclusion: The proposed method effectively fuses two orthogonal anisotropic LR MR images for improved SR performance without requiring HR data, offering faster patient-specific reconstruction. Abstract: Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at https://github.com/MajaSchle/tripleSR.

[100] SplatFill: 3D Scene Inpainting via Depth-Guided Gaussian Splatting

Mahtab Dahaghin,Milind G. Padalkar,Matteo Toso,Alessio Del Bue

Main category: cs.CV

TL;DR: SplatFill is a depth-guided approach for 3D Gaussian Splatting scene inpainting that improves perceptual quality and efficiency.

Details

Motivation: Inpainting missing regions in 3DGS remains challenging, often leading to blurry details, artifacts, and inconsistent geometry. Method: SplatFill combines joint depth-based and object-based supervision with a consistency-aware refinement scheme. Result: Evaluations on the SPIn-NeRF dataset show SplatFill surpasses existing methods in visual fidelity and reduces training time by 24.5%. Conclusion: SplatFill provides a novel depth-guided approach for 3DGS scene inpainting that outperforms existing methods in visual fidelity and reduces training time. Abstract: 3D Gaussian Splatting (3DGS) has enabled the creation of highly realistic 3D scene representations from sets of multi-view images. However, inpainting missing regions, whether due to occlusion or scene editing, remains a challenging task, often leading to blurry details, artifacts, and inconsistent geometry. In this work, we introduce SplatFill, a novel depth-guided approach for 3DGS scene inpainting that achieves state-of-the-art perceptual quality and improved efficiency. Our method combines two key ideas: (1) joint depth-based and object-based supervision to ensure inpainted Gaussians are accurately placed in 3D space and aligned with surrounding geometry, and (2) we propose a consistency-aware refinement scheme that selectively identifies and corrects inconsistent regions without disrupting the rest of the scene. Evaluations on the SPIn-NeRF dataset demonstrate that SplatFill not only surpasses existing NeRF-based and 3DGS-based inpainting methods in visual fidelity but also reduces training time by 24.5%. Qualitative results show our method delivers sharper details, fewer artifacts, and greater coherence across challenging viewpoints.

[101] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

Zhuoxu Huang,Mingqi Gao,Jungong Han

Main category: cs.CV

TL;DR: The paper introduces PLM, a framework that aligns LLM semantics with 3D point clouds for improved 3D object segmentation without requiring large-scale pre-alignment.

Details

Motivation: 3D object segmentation with LLMs is limited by misalignment between high-level semantic tokens and dense geometric point clouds, affecting both input and output stages. Method: The Point Linguist Model (PLM) uses Object-centric Discriminative Representation (OcDR) and a Geometric Reactivation Decoder (GRD) to align semantic tokens with dense 3D features. Result: PLM achieves significant improvements, with +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer, showing consistent gains across 7 benchmarks. Conclusion: PLM effectively bridges the representation gap between LLMs and 3D point clouds, enabling robust 3D understanding without large-scale pre-alignment. Abstract: 3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.

[102] Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets

Seyd Teymoor Seydi

Main category: cs.CV

TL;DR: 本研究提出了一种基于AlphaEarth数据集和Siamese U-Net架构的新型自动化火烧区域识别方法，具有高准确性与良好的泛化能力。

Details

Motivation: 准确及时的火烧区域制图对于环境监测、灾害管理和气候变化评估至关重要，但传统方法存在局限性，因此需要一种更高效的自动化方法。 Method: 结合AlphaEarth数据集与Siamese U-Net深度学习模型，利用MTBS数据集进行训练，并在欧洲17个区域进行测试评估。 Result: 实验结果表明，所提出的方法在测试数据集上达到了95%的整体准确率、0.6的IoU和74%的F1分数，能够有效识别复杂背景下的火烧区域，尤其在部分燃烧植被和火边界检测方面表现出色。 Conclusion: 该研究通过使用AlphaEarth数据集和Siamese U-Net深度学习架构，开发出一种高效的自动化火烧区域制图方法，对全球火烧区域监测具有重要的推动作用。 Abstract: Accurate and timely mapping of burned areas is crucial for environmental monitoring, disaster management, and assessment of climate change. This study presents a novel approach to automated burned area mapping using the AlphaEArth dataset combined with the Siamese U-Net deep learning architecture. The AlphaEArth Dataset, comprising high-resolution optical and thermal infrared imagery with comprehensive ground-truth annotations, provides an unprecedented resource for training robust burned area detection models. We trained our model with the Monitoring Trends in Burn Severity (MTBS) dataset in the contiguous US and evaluated it with 17 regions cross in Europe. Our experimental results demonstrate that the proposed ensemble approach achieves superior performance with an overall accuracy of 95%, IoU of 0.6, and F1-score of 74% on the test dataset. The model successfully identifies burned areas across diverse ecosystems with complex background, showing particular strength in detecting partially burned vegetation and fire boundaries and its transferability and high generalization in burned area mapping. This research contributes to the advancement of automated fire damage assessment and provides a scalable solution for global burn area monitoring using the AlphaEarth dataset.

[103] D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLMs via Layer-to-head Attention Diagnostics

Tiancheng Yang,Lin Zhang,Jiaye Lin,Guimin Hu,Di Wang,Lijie Hu

Main category: cs.CV

TL;DR: This paper proposes D-LEAF, a new method to detect and correct hallucinations in MLLMs by dynamically localizing errors using Layer Image Attention Entropy (LIAE) and Image Attention Focus (IAF), resulting in significant performance improvements while maintaining efficiency.

Details

Motivation: Multimodal Large Language Models (MLLMs) suffer from hallucinations due to insufficient visual attention. Existing attention-based correction methods apply uniform adjustments, making it hard to identify the source of errors. This work aims to address this limitation by localizing and correcting errors more precisely. Method: The authors introduce two diagnostics, Layer Image Attention Entropy (LIAE) and Image Attention Focus (IAF), to identify problematic layers and attention heads in MLLMs. Based on these diagnostics, they propose a task-agnostic correction method called Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF). Result: D-LEAF achieves a 53% relative improvement on standard captioning benchmarks and improves both accuracy and F1-score by approximately 4% on VQA, significantly suppressing hallucinations with negligible overhead. Conclusion: The proposed D-LEAF method effectively localizes and corrects hallucinations in MLLMs during inference, achieving significant improvements in performance while maintaining efficiency. Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.

[104] Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning

Daniel DeAlcala,Aythami Morales,Julian Fierrez,Gonzalo Mancera,Ruben Tolosana,Javier Ortega-Garcia

Main category: cs.CV

TL;DR: Active MINT is a multitask learning method that improves model auditability by accurately detecting if data was used in training, achieving over 80% accuracy.

Details

Motivation: The motivation is to improve the auditability of machine learning models, enabling better security, privacy, and copyright protection by detecting whether specific data was used in training. Method: Active MINT uses a multitask learning framework, training two models simultaneously: the Audited Model and the MINT Model, which uses intermediate activation maps to detect training data. Result: Active MINT achieves over 80% accuracy in detecting training data usage across various neural network architectures and outperforms existing methods. Conclusion: Active MINT contributes to enhancing transparency in AI models and improves detection of training data usage, offering better performance than previous methods. Abstract: Active Membership Inference Test (aMINT) is a method designed to detect whether given data were used during the training of machine learning models. In Active MINT, we propose a novel multitask learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to the MINT layers, which are trained to enhance the detection of training data. We present results using a wide range of neural networks, from lighter architectures such as MobileNet to more complex ones such as Vision Transformers, evaluated in 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our aMINT and related methodological developments contribute to increasing transparency in AI models, facilitating stronger safeguards in AI deployments to achieve proper security, privacy, and copyright protection.

[105] Object-level Correlation for Few-Shot Segmentation

Chunlin Wen,Yu Zhang,Jie Fan,Hongyuan Zhu,Xiu-Shen Wei,Yijun Wang,Zhiqiang Kou,Shuzhou Sun

Main category: cs.CV

TL;DR: 本文提出了一种用于少样本语义分割的对象级相关网络（OCNet），通过建立支持目标对象与查询通用对象之间的对象级相关性，有效抑制了背景噪声，提高了分割性能。

Details

Motivation: 现有方法主要建立支持目标对象与整个查询图像之间的图像级相关性，但这种相关性包含难以追踪和抑制的无关背景对象（硬像素噪声），导致背景过拟合。因此需要一种更有效的方法来处理低数据情况下的少样本语义分割问题。 Method: 设计了一个对象级相关网络（OCNet），主要包括通用对象挖掘模块（GOMM）和相关性构建模块（CCM）。GOMM通过学习显著性和高层相似性线索来构建查询通用对象特征，而CCM则通过分配目标原型来匹配通用对象特征，从而建立对象级相关性。 Result: 在PASCAL-${5}^{i}$和COCO-${20}^{i}$数据集上的大量实验表明，该模型达到了最先进的性能。 Conclusion: OCNet通过对象级相关性有效抑制了硬像素噪声，在少样本语义分割任务中表现出色，尤其是在低数据情况下。 Abstract: Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, \textit{i.e.}, irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL-${5}^{i}$ and COCO-${20}^{i}$ show that our model achieves the state-of-the-art performance.

[106] ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Ao Li,Jinpeng Liu,Yixuan Zhu,Yansong Tang

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的优化方法ScoreHOI，用于重建人类与物体的交互，通过引入扩散先验知识和物理约束，在标准数据集上表现出优于现有方法的性能。

Details

Motivation: 以往的优化方法由于缺乏关于人与物体交互的先验知识，难以实现物理上合理的重建结果。 Method: 本文提出了ScoreHOI，利用基于分数的采样控制扩散模型，结合图像观察和物体特征重建人类和物体姿态的条件分布，并在推理过程中引入物理约束和接触驱动的迭代优化策略。 Result: 在标准基准测试中，ScoreHOI在重建精度和接触合理性方面均优于现有最先进方法。 Conclusion: ScoreHOI通过引入扩散先验知识和物理约束，有效提高了人类与物体交互重建的精确性和鲁棒性。 Abstract: Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI's superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.

[107] Multimodal Contrastive Pretraining of CBCT and IOS for Enhanced Tooth Segmentation

Moo Hyun Son,Juyoung Bae,Zelin Qiu,Jiale Peng,Kai Xin Li,Yifan Lin,Hao Chen

Main category: cs.CV

TL;DR: 本文介绍了一种名为ToothMCL的多模态预训练框架，用于牙齿分割，有效整合了CBCT和IOS模态，并在性能和泛化能力方面超过了现有方法。

Details

Motivation: 现有的牙齿分割方法缺乏严格的验证，并表现出有限的性能和临床适用性。 Method: 引入了一种多模态预训练框架ToothMCL，通过多模态对比学习捕捉模态不变表示，整合体积（CBCT）和基于表面（IOS）模态。 Result: ToothMCL在CBCT分割和IOS分割的Dice相似系数（DSC）上分别提高了12%和8%，并且在牙齿组别上始终超过现有方法。 Conclusion: ToothMCL实现了CBCT和IOS分割的最先进的性能，并表现出跨不同成像条件和临床场景的强泛化能力。 Abstract: Digital dentistry represents a transformative shift in modern dental practice. The foundational step in this transformation is the accurate digital representation of the patient's dentition, which is obtained from segmented Cone-Beam Computed Tomography (CBCT) and Intraoral Scans (IOS). Despite the growing interest in digital dental technologies, existing segmentation methodologies frequently lack rigorous validation and demonstrate limited performance and clinical applicability. To the best of our knowledge, this is the first work to introduce a multimodal pretraining framework for tooth segmentation. We present ToothMCL, a Tooth Multimodal Contrastive Learning for pretraining that integrates volumetric (CBCT) and surface-based (IOS) modalities. By capturing modality-invariant representations through multimodal contrastive learning, our approach effectively models fine-grained anatomical features, enabling precise multi-class segmentation and accurate identification of F\'ed\'eration Dentaire Internationale (FDI) tooth numbering. Along with the framework, we curated CBCT-IOS3.8K, the largest paired CBCT and IOS dataset to date, comprising 3,867 patients. We then evaluated ToothMCL on a comprehensive collection of independent datasets, representing the largest and most diverse evaluation to date. Our method achieves state-of-the-art performance in both internal and external testing, with an increase of 12\% for CBCT segmentation and 8\% for IOS segmentation in the Dice Similarity Coefficient (DSC). Furthermore, ToothMCL consistently surpasses existing approaches in tooth groups and demonstrates robust generalizability across varying imaging conditions and clinical scenarios.

[108] Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s

Mahmudul Islam Masum,Miad Islam,Arif I. Sarwat

Main category: cs.CV

TL;DR: 这篇论文提出了一种名为Two-Pass Adaptive Inference的算法，无需改变模型架构就能在消费级设备上实现高效的实时AI推理。

Details

Motivation: 随着本地AI的普及，目标检测器的基准性能与其在消费级硬件上的实际可行性之间存在关键差距。虽然像YOLOv10s这样的模型承诺了实时速度，但这些指标通常是在高性能桌面级GPU上实现的。而在资源受限的系统上，如带有RTX 4060 GPU的笔记本电脑，性能并不受计算限制，而是受系统级瓶颈主导。 Method: 引入了一种Two-Pass Adaptive Inference算法，这是一种无需架构更改的模型无关方法。该方法使用快速低分辨率通道，并且仅在检测置信度低时升级到高分辨率模型通道。 Result: 在5000张图像的COCO数据集上，该方法相比PyTorch Early-Exit基线实现了1.85倍的速度提升，同时mAP损失仅为5.51%。 Conclusion: 本文提出了一个实用且可重复的蓝图，通过将重点从纯模型优化转移到硬件感知的推理策略，以在消费级设备上部署高性能实时AI。 Abstract: As local AI grows in popularity, there is a critical gap between the benchmark performance of object detectors and their practical viability on consumer-grade hardware. While models like YOLOv10s promise real-time speeds, these metrics are typically achieved on high-power, desktop-class GPUs. This paper reveals that on resource-constrained systems, such as laptops with RTX 4060 GPUs, performance is not compute-bound but is instead dominated by system-level bottlenecks, as illustrated by a simple bottleneck test. To overcome this hardware-level constraint, we introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes. This study mainly focuses on adaptive inference strategies and undertakes a comparative analysis of architectural early-exit and resolution-adaptive routing, highlighting their respective trade-offs within a unified evaluation framework. The system uses a fast, low-resolution pass and only escalates to a high-resolution model pass when detection confidence is low. On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%. This work provides a practical and reproducible blueprint for deploying high-performance, real-time AI on consumer-grade devices by shifting the focus from pure model optimization to hardware-aware inference strategies that maximize throughput.

[109] Dynamic Scene 3D Reconstruction of an Uncooperative Resident Space Object

Bala Prenith Reddy Gopu,Timothy Jacob Huber,George M. Nehma,Patrick Quinn,Madhur Tiwari,Matt Ueckermann,David Hinckley,Christopher McKenna

Main category: cs.CV

TL;DR: 本研究评估了3D重建算法在非合作翻滚卫星建模中的表现，开发了仿真环境并验证了Neuralangelo在静态场景下的重建能力。

Details

Motivation: 为了应对翻滚的非合作目标重建的挑战，研究需要评估最先进的动态场景3D重建算法的能力，以生成高保真度的几何模型。 Method: 开发了一个基于Isaac Sim的仿真环境，用于生成在真实轨道光照条件下翻滚卫星的2D图像序列，并使用Cloud Compare评估了Neuralangelo的重建质量。 Result: 静态场景的初步结果显示，Neuralangelo能够生成与原始CAD模型高度匹配的3D网格，误差和伪影极少，并能捕捉任务规划所需的关键细节。 Conclusion: 该研究初步展示了在静态场景下使用Neuralangelo进行高质量3D重建的能力，并为正在进行的动态场景重建评估提供了基线。 Abstract: Characterization of uncooperative Resident Space Objects (RSO) play a crucial role in On-Orbit Servicing (OOS) and Active Debris Removal (ADR) missions to assess the geometry and motion properties. To address the challenges of reconstructing tumbling uncooperative targets, this study evaluates the performance of existing state-of-the-art 3D reconstruction algorithms for dynamic scenes, focusing on their ability to generate geometrically accurate models with high-fidelity. To support our evaluation, we developed a simulation environment using Isaac Sim to generate physics-accurate 2D image sequences of tumbling satellite under realistic orbital lighting conditions. Our preliminary results on static scenes using Neuralangelo demonstrate promising reconstruction quality. The generated 3D meshes closely match the original CAD models with minimal errors and artifacts when compared using Cloud Compare (CC). The reconstructed models were able to capture critical fine details for mission planning. This provides a baseline for our ongoing evaluation of dynamic scene reconstruction.

[110] Feature Space Analysis by Guided Diffusion Model

Kimiaki Shirahama,Miki Yanobu,Kaduki Yamashita,Miho Ohsaki

Main category: cs.CV

TL;DR: This paper introduces a decoder based on a guided diffusion model that generates images with features closely matching a user-specified feature, offering insights into how DNNs encode image attributes.

Details

Motivation: The motivation stems from the black-box nature of DNNs' internal feature extraction process, making it difficult to understand which image attributes are encoded into a feature. Method: The method involves implementing a guided diffusion model that minimizes the Euclidean distance between the feature of a clean image and a user-specified feature during the reverse image generation process. Result: The experimental results show that the generated images have features remarkably similar to the user-specified ones, demonstrating the decoder's effectiveness in analyzing DNN feature spaces. Conclusion: The paper concludes that their decoder provides a reliable way to analyze the feature space of DNNs, revealing valuable insights into how different attributes of images are encoded. Abstract: One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP's image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs' feature spaces.

[111] One View, Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

Zheng Geng,Nan Wang,Shaocong Xu,Chongjie Ye,Bohan Li,Zhaoxi Chen,Sida Peng,Hao Zhao

Main category: cs.CV

TL;DR: OnePoseViaGen是一种用于单视角6D姿态估计的新方法，结合了多视角特征匹配与文本引导的域随机化策略，实现了高效且可靠的姿态估计，并在多个基准上取得了最先进的性能。

Details

Motivation: 在现实世界实例的长尾分布中，机器人操作需要从单个参考图像估计任意未见物体的6D姿态，但此设定极具挑战性，主要问题包括：3D模型稀缺、单视角重建缺乏度量尺度、生成模型与真实世界图像之间的域差异影响鲁棒性。 Method: 提出了一种名为OnePoseViaGen的流水线方法，包含两个核心组件：1）粗到细的对齐模块，通过结合多视角特征匹配和渲染对比优化来共同优化尺度和姿态；2）文本引导的生成域随机化策略，用于多样化纹理，从而使用合成数据有效微调姿态估计器。 Result: OnePoseViaGen在多个挑战性基准（YCBInEOAT, Toyota-Light, LM-O）上取得了远超以往方法的性能，并通过真实机器人手的灵活抓取实验验证了其在现实操作中的实用性。 Conclusion: OnePoseViaGen通过粗到细的对齐模块和文本引导的生成域随机化策略，实现了高效的单视角3D生成以支持可靠的一次性6D姿态估计，并在多个挑战性基准测试中达到了最先进的性能。 Abstract: Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: https://gzwsama.github.io/OnePoseviaGen.github.io/

[112] Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon,Jaewoo Jung,Junwan Kim,Hyungyu Choi,Heeseong Shin,Sangbeom Lim,Honggyu An,Chaehyun Kim,Jisang Han,Donghyun Kim,Chanho Eom,Sunghwan Hong,Seungryong Kim

Main category: cs.CV

TL;DR: VIRAL improves the visual reasoning ability of MLLMs by aligning their visual representations with pre-trained models, leading to better performance on vision-centric tasks.

Details

Motivation: MLLMs struggle with vision-centric tasks due to text-only supervision, which leads to the loss of fine-grained visual details. Method: VIRAL aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). Result: Experiments show consistent improvements across multimodal benchmarks, validated by ablation studies. Conclusion: VIRAL is a promising regularization strategy that improves the integration of visual information in MLLMs by aligning their visual representations with those of pre-trained VFMs. Abstract: Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.

Table of Contents

cs.CL [Back]

[1] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

[2] The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties

[3] Toward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

[4] Towards EnergyGPT: A Large Language Model Specialized for the Energy Sector

[5] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge

[6] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

[7] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

[8] Causal Attention with Lookahead Keys

[9] Basis Vector Metric: A Method for Robust Open-Ended State Change Detection

[10] Instance-level Performance Prediction for Long-form Generation Tasks

[11] Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations

[12] Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation

[13] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

[14] Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

[15] The Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering

[16] LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

[17] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

[18] Understanding Stigmatizing Language Lexicons: A Comparative Analysis in Clinical Contexts

[19] From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

[20] HALT-RAG: A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention

[21] ALLabel: Three-stage Active Learning for LLM-based Entity Recognition using Demonstration Retrieval

[22] VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

[23] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition

[24] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment

[25] MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLMs

[26] MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

[27] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models

[28] Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts

[29] Are LLMs Enough for Hyperpartisan, Fake, Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

[30] SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP

[31] Dual Knowledge-Enhanced Two-Stage Reasoner for Multimodal Dialog Systems

[32] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

[33] Are Humans as Brittle as Large Language Models?

[34] From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing

[35] Biased Tales: Cultural and Topic Bias in Generating Children's Stories

[36] GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models

[37] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

[38] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

cs.CV [Back]

[39] CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis

[40] FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

[41] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection

[42] DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining

[43] FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models

[44] Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025)

[45] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

[46] The Protocol Genome A Self Supervised Learning Framework from DICOM Headers

[47] Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

[48] K-Syn: K-space Data Synthesis in Ultra Low-data Regimes

[49] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

[50] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

[51] MEGS$^{2}$: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

[52] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models

[53] SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards

[54] Enhancing Classification of Streaming Data with Image Distillation

[55] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

[56] Faster VGGT with Block-Sparse Global Attention

[57] Detection and Recovery of Adversarial Slow-Pose Drift in Offloaded Visual-Inertial Odometry

[58] Realism to Deception: Investigating Deepfake Detectors Against Face Enhancement

[59] Dimensionally Reduced Open-World Clustering: DROWCULA

[60] XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

[61] Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion

[62] Reconstruction Alignment Improves Unified Multimodal Models

[63] DEPF: A UAV Multispectral Object Detector with Dual-Domain Enhancement and Priority-Guided Mamba Fusion

[64] G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition

[65] Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

[66] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

[67] In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

[68] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

[69] XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning

[70] Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting

[71] ANYPORTAL: Zero-Shot Consistent Video Background Replacement

[72] MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

[73] LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors

[74] Fine-Tuning Vision-Language Models for Visual Navigation Assistance

[75] DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

[76] Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition

[77] MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection