cs.CL [Back]

[1] INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

Shisong Chen,Qian Zhu,Wenyan Yang,Chengyi Yang,Zhong Wang,Ping Wang,Xuan Lin,Bo Xu,Daqian Li,Chao Yuan,Licai Qi,Wanqing Xu,sun zhenxing,Xin Lu,Shiqiang Xiong,Chao Chen,Haixiang Hu,Yanghua Xiao

Main category: cs.CL

TL;DR: 本文介绍了一个新的中文基准测试 INSEva，用于评估 AI 系统在保险领域的性能，发现现有模型在处理复杂保险场景时仍有显著差距。

Details

Motivation: 现有的基准测试无法捕捉保险领域独特的特点和需求。 Method: 开发了一个多维评估分类法，包括业务领域、任务格式、难度级别和认知知识维度，并使用 38,704 个高质量评估示例对 8 个最先进的大型语言模型进行评估。 Result: 识别出不同维度上的显著性能差异，尽管通用大型语言模型显示出基本的保险领域能力，但在处理复杂的现实保险场景时仍存在重大差距。 Conclusion: INSEva 是一个全面的中文基准测试，用于评估人工智能系统在保险领域的知识和能力，该基准测试将很快向公众开放。 Abstract: Insurance, as a critical component of the global financial system, demands high standards of accuracy and reliability in AI applications. While existing benchmarks evaluate AI capabilities across various domains, they often fail to capture the unique characteristics and requirements of the insurance domain. To address this gap, we present INSEva, a comprehensive Chinese benchmark specifically designed for evaluating AI systems' knowledge and capabilities in insurance. INSEva features a multi-dimensional evaluation taxonomy covering business areas, task formats, difficulty levels, and cognitive-knowledge dimension, comprising 38,704 high-quality evaluation examples sourced from authoritative materials. Our benchmark implements tailored evaluation methods for assessing both faithfulness and completeness in open-ended responses. Through extensive evaluation of 8 state-of-the-art Large Language Models (LLMs), we identify significant performance variations across different dimensions. While general LLMs demonstrate basic insurance domain competency with average scores above 80, substantial gaps remain in handling complex, real-world insurance scenarios. The benchmark will be public soon.

[2] Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support

Anandi Dutta,Shivani Mruthyunjaya,Jessica Saddington,Kazi Sifatul Islam

Main category: cs.CL

TL;DR: This paper presents Mentalic Net Conversational AI, a mental health support chatbot developed using a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuning on novel datasets. The chatbot achieved a BERT Score of 0.898 and satisfactory results in other evaluation metrics. The study emphasizes the importance of a human-in-the-loop approach and a responsible strategy in developing such transformative technologies.

Details

Motivation: The motivation behind the research is to address the challenges posed by the emergence of large language models (LLMs) and to develop a safe and meaningful mental health support chatbot that augments professional healthcare. Method: The researchers used a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets to develop the chatbot. Result: The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. Conclusion: The paper concludes that a human-in-the-loop approach and a long-term, responsible strategy are essential in developing transformative technologies like Mentalic Net Conversational AI, as they have the potential to change lives but also pose risks if not carefully managed. Abstract: The emergence of large language models (LLMs) has unlocked boundless possibilities, along with significant challenges. In response, we developed a mental health support chatbot designed to augment professional healthcare, with a strong emphasis on safe and meaningful application. Our approach involved rigorous evaluation, covering accuracy, empathy, trustworthiness, privacy, and bias. We employed a retrieval-augmented generation (RAG) framework, integrated prompt engineering, and fine-tuned a pre-trained model on novel datasets. The resulting system, Mentalic Net Conversational AI, achieved a BERT Score of 0.898, with other evaluation metrics falling within satisfactory ranges. We advocate for a human-in-the-loop approach and a long-term, responsible strategy in developing such transformative technologies, recognizing both their potential to change lives and the risks they may pose if not carefully managed.

[3] Do MLLMs Really Understand the Charts?

Xiao Zhang,Dongyuan Li,Liuyu Xiang,Yao Zhang,Cheng Zhong,Zhaofeng He

Main category: cs.CL

TL;DR: This paper introduces ChartReasoner, which improves MLLMs' chart understanding and visual reasoning, outperforming leading models like GPT-4o and Gemini-2.5-Flash on a new benchmark called CRBench.

Details

Motivation: MLLMs exhibit hallucinations and performance degradation when handling non-annotated charts, raising the question of whether they truly understand charts. Method: A comprehensive Chart Reasoning Benchmark (CRBench) was established to evaluate the visual reasoning abilities of MLLMs on non-annotated charts. ChartReasoner was proposed to mimic human behavior by grounding estimation in chart understanding. Result: ChartReasoner-3B/7B achieved superior performance in chart reasoning on CRBench and demonstrated improved visual reasoning abilities in general chart comprehension on public benchmarks. Conclusion: ChartReasoner enables MLLMs to achieve better chart understanding and visual reasoning abilities, even outperforming GPT-4o and Gemini-2.5-Flash on the CRBench benchmark. Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated increasingly impressive performance in chart understanding, most of them exhibit alarming hallucinations and significant performance degradation when handling non-annotated charts. Therefore, a question arises: Do MLLMs really understand the charts? Since a human is capable of understanding charts and estimating the values by visual reasoning, we first carefully establish a comprehensive Chart Reasoning Benchmark CRBench to rigorously evaluate the visual reasoning abilities of MLLMs on non-annotated charts. We argue that MLLMs are primarily relying on recognition rather than reasoning to interpret the charts. To steer MLLMs to reasonable chart understanding, we propose ChartReasoner that mimics human behavior by grounding their estimation in chart understanding. Extensive results on the proposed CRBench show that ChartReasnoner-3B/7B achieves superior performance in chart reasoning, even compared to GPT-4o and Gemini-2.5-Flash. More importantly, ChartReasnoner also demonstrates the visual reasoning abilities in general chart comprehension on public benchmarks, leading to significant performance gains and enabling MLLMs to rationally understand the charts. The code and dataset will be publicly available upon publication.

[4] Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

Daniel B. Hier,Steven Keith Platt,Tayo Obafemi-Ajayi

Main category: cs.CL

TL;DR: This study identifies exposure to ontology identifiers as the key factor influencing the success of linking ontology terms to identifiers in large language models.

Details

Motivation: Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. This study aims to understand the reasons behind these failures. Method: The researchers analyzed predictions from two high-performing models, GPT-4o and LLaMa 3.1 405B, across two major ontologies (Human Phenotype Ontology and Gene Ontology). They evaluated nine candidate features using univariate and multivariate analyses to determine factors influencing linking success. Result: Exposure to ontology identifiers emerged as the strongest predictor of linking success. The analysis of various features related to term familiarity, identifier usage, morphology, and ontology structure provided insights into model performance. Conclusion: The study concludes that exposure to ontology identifiers is the most significant factor in successfully linking ontology terms to their correct identifiers in large language models. Abstract: Large language models often perform well on biomedical NLP tasks but may fail to link ontology terms to their correct identifiers. We investigate why these failures occur by analyzing predictions across two major ontologies, Human Phenotype Ontology and Gene Ontology, and two high-performing models, GPT-4o and LLaMa 3.1 405B. We evaluate nine candidate features related to term familiarity, identifier usage, morphology, and ontology structure. Univariate and multivariate analyses show that exposure to ontology identifiers is the strongest predictor of linking success.

[5] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis

Shiqin Han,Manning Gao,Menghua Jiang,Yuncheng Jiang,Haifeng Hu,Sijie Mai

Main category: cs.CL

TL;DR: This paper proposes U-ACS, a collaborative system that combines a lightweight model and an MLLM for efficient and accurate multimodal sentiment analysis by dynamically allocating resources based on input difficulty.

Details

Motivation: Multimodal Large Language Models (MLLMs) offer high performance but suffer from high computational demands, while smaller models are efficient but less accurate. The motivation is to reconcile this performance-efficiency trade-off in multimodal sentiment analysis. Method: The U-ACS system uses an uncertainty-driven cascade mechanism, where a lightweight model filters inputs and only difficult samples are escalated to a powerful MLLM. Advanced strategies like weighted averaging and prompt-based cross-verification are used to handle ambiguous or conflicting predictions. Result: Extensive experiments show that the proposed U-ACS system achieves state-of-the-art performance on benchmark datasets while requiring only a fraction of the computational resources compared to using a standalone MLLM. Conclusion: The proposed U-ACS system effectively balances performance and efficiency in multimodal sentiment analysis by dynamically allocating computational resources based on sample difficulty, achieving state-of-the-art results with reduced inference costs. Abstract: The advent of Multimodal Large Language Models (MLLMs) has significantly advanced the state-of-the-art in multimodal machine learning, yet their substantial computational demands present a critical barrier to real-world deployment. Conversely, smaller, specialized models offer high efficiency but often at the cost of performance. To reconcile this performance-efficiency trade-off, we propose a novel Uncertainty-Aware Collaborative System (U-ACS) that synergistically orchestrates a powerful MLLM (e.g., HumanOmni) and a lightweight baseline model for multimodal sentiment analysis. The core of our system is an uncertainty-driven cascade mechanism, where the efficient small model first acts as a rapid filter for all input samples. Only those samples yielding high predictive uncertainty, thereby indicating greater difficulty, are selectively escalated to the MLLM for more sophisticated analysis. Furthermore, our system introduces advanced strategies to handle ambiguous or conflicting predictions, including weighted averaging for predictions of similar polarity and a prompt-based cross-verification to resolve conflicting predictions when both models exhibit high uncertainty. This sample-difficulty-aware approach allows for a dynamic allocation of computational resources, drastically reducing inference costs while retaining the high accuracy of MLLM. Extensive experiments on benchmark datasets demonstrate that our proposed method achieves state-of-the-art performance, while requiring only a fraction of the computational resources compared to using a standalone MLLM.

[6] CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

Yihan Chen,Jiawei Chen,Guozhao Mo,Xuanang Chen,Ben He,Xianpei Han,Le Sun

Main category: cs.CL

TL;DR: 本文提出了一种新的基于内容的人工智能生成同行评审检测方法，并构建了相关数据集和工具，以解决现有检测技术的局限性。

Details

Motivation: 当前的人工智能文本检测工具容易受到改写攻击，并且难以区分语言润色与实质性内容生成，这可能导致对使用人工智能辅助语言优化的评审不公平怀疑，同时未能识别出经过人性化处理的人工智能生成内容。 Method: 本研究构建了一个细粒度的人工智能生成同行评审数据集（CoCoNUTS），并开发了一个基于多任务学习框架的检测工具（CoCoDet）。 Result: 论文开发的CoCoDet检测工具在识别同行评审中人工智能生成内容方面表现出更高的准确性和鲁棒性。 Conclusion: 本论文提出了一种基于内容的检测方法，用于识别同行评审中人工智能生成的内容，以提高检测方法的准确性、公平性和可靠性。 Abstract: The growing integration of large language models (LLMs) into the peer review process presents potential risks to the fairness and reliability of scholarly evaluation. While LLMs offer valuable assistance for reviewers with language refinement, there is growing concern over their use to generate substantive review content. Existing general AI-generated text detectors are vulnerable to paraphrasing attacks and struggle to distinguish between surface language refinement and substantial content generation, suggesting that they primarily rely on stylistic cues. When applied to peer review, this limitation can result in unfairly suspecting reviews with permissible AI-assisted language enhancement, while failing to catch deceptively humanized AI-generated reviews. To address this, we propose a paradigm shift from style-based to content-based detection. Specifically, we introduce CoCoNUTS, a content-oriented benchmark built upon a fine-grained dataset of AI-generated peer reviews, covering six distinct modes of human-AI collaboration. Furthermore, we develop CoCoDet, an AI review detector via a multi-task learning framework, designed to achieve more accurate and robust detection of AI involvement in review content. Our work offers a practical foundation for evaluating the use of LLMs in peer review, and contributes to the development of more precise, equitable, and reliable detection methods for real-world scholarly applications. Our code and data will be publicly available at https://github.com/Y1hanChen/COCONUTS.

Tian Ma,Kaiyu Feng,Yu Rong,Kangfei Zhao

Main category: cs.CL

TL;DR: This paper introduces PostToPersonality (PtoP), an improved method for predicting MBTI personality types from social media posts using Large Language Models (LLMs), addressing challenges like hallucination and imbalanced data. It achieves better performance than existing techniques.

Details

Motivation: The paper aims to improve MBTI prediction from social media posts by leveraging the capabilities of LLMs while addressing their limitations, such as hallucination and bias due to class imbalance. Method: PtoP uses Retrieval-Augmented Generation with in-context learning to reduce LLM hallucinations and employs fine-tuning with synthetic minority oversampling to handle class imbalance. Result: Experiments on a real-world dataset show that PtoP outperforms 10 ML and DL baselines in MBTI prediction. Conclusion: The proposed PostToPersonality (PtoP) framework effectively addresses key challenges in MBTI prediction using LLMs, achieving state-of-the-art performance compared to existing methods. Abstract: Personality prediction from social media posts is a critical task that implies diverse applications in psychology and sociology. The Myers Briggs Type Indicator (MBTI), a popular personality inventory, has been traditionally predicted by machine learning (ML) and deep learning (DL) techniques. Recently, the success of Large Language Models (LLMs) has revealed their huge potential in understanding and inferring personality traits from social media content. However, directly exploiting LLMs for MBTI prediction faces two key challenges: the hallucination problem inherent in LLMs and the naturally imbalanced distribution of MBTI types in the population. In this paper, we propose PostToPersonality (PtoP), a novel LLM based framework for MBTI prediction from social media posts of individuals. Specifically, PtoP leverages Retrieval Augmented Generation with in context learning to mitigate hallucination in LLMs. Furthermore, we fine tune a pretrained LLM to improve model specification in MBTI understanding with synthetic minority oversampling, which balances the class imbalance by generating synthetic samples. Experiments conducted on a real world social media dataset demonstrate that PtoP achieves state of the art performance compared with 10 ML and DL baselines.

[8] Benchmarking GPT-5 for biomedical natural language processing

Yu Hou,Zaifu Zhan,Rui Zhang

Main category: cs.CL

TL;DR: 本文更新了一个标准化的BioNLP基准测试，评估了GPT-5和GPT-4o在多种提示条件下的表现，并与先前的模型进行比较，结果显示GPT-5在多个任务上取得了显著进步。

Details

Motivation: 随着生物医学文献的迅速扩展，对可扩展的自然语言处理解决方案的需求日益增加，尽管GPT-4在某些任务上表现良好，但其在其他领域的表现仍然不均衡。 Method: 更新了一个标准化的BioNLP基准测试，评估了GPT-5和GPT-4o在零样本、一样本和五样本提示下的表现，并与GPT-4、GPT-3.5和LLaMA-2-13B的结果进行了比较。 Result: GPT-5在五样本提示下的宏观平均得分提高到0.557，超过了GPT-4的0.506和GPT-4o的0.508；在MedQA上达到94.1%的准确率，在PubMedQA上达到0.734，与监督系统相当。 Conclusion: GPT-5在生物医学自然语言处理任务中表现出色，尤其在问答任务上达到了可部署的性能，但在需要高精度的抽取任务和证据密集型摘要任务中仍落后于特定领域模型。 Abstract: The rapid expansion of biomedical literature has heightened the need for scalable natural language processing (NLP) solutions. While GPT-4 substantially narrowed the gap with task-specific systems, especially in question answering, its performance across other domains remained uneven. We updated a standardized BioNLP benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across 12 datasets spanning six task families: named entity recognition, relation extraction, multi-label document classification, question answering, text summarization, and text simplification. Using fixed prompt templates, identical decoding parameters, and batch inference, we report primary metrics per dataset and include prior results for GPT-4, GPT-3.5, and LLaMA-2-13B for comparison. GPT-5 achieved the strongest overall benchmark performance, with macro-average scores rising to 0.557 under five-shot prompting versus 0.506 for GPT-4 and 0.508 for GPT-4o. On MedQA, GPT-5 reached 94.1% accuracy, exceeding the previous supervised state of the art by over fifty points, and attained parity with supervised systems on PubMedQA (0.734). In extraction tasks, GPT-5 delivered major gains in chemical NER (0.886 F1) and ChemProt relation extraction (0.616 F1), outperforming GPT-4 and GPT-4o, though summarization and disease NER still lagged behind domain-specific baselines. These results establish GPT-5 as a general-purpose model now offering deployment-ready performance for reasoning-oriented biomedical QA, while precision-critical extraction and evidence-dense summarization continue to favor fine-tuned or hybrid approaches. The benchmark delineates where simple prompting suffices and where retrieval-augmented or planning-based scaffolds are likely required, providing actionable guidance for BioNLP system design as frontier models advance.

[9] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?

Yang Nan,Pengfei He,Ravi Tandon,Han Xu

Main category: cs.CL

TL;DR: 本文提出一种利用辅助LLM分析目标LLM响应分歧模式的方法，有效诊断LLM不确定性来源，如输入歧义或知识缺失，并在多个数据集上验证了该方法的有效性。

Details

Motivation: 尽管大型语言模型（LLMs）在多个领域取得了突破，但其输出仍可能存在不可靠或误导性，这对实际应用提出了关键挑战。虽然已有研究关注量化模型不确定性，但较少工作致力于诊断不确定性的来源。 Method: 收集目标LLM的多个响应，利用辅助LLM分析其分歧模式，以推断不确定性的可能来源，例如输入问题的歧义或知识的缺失，并在知识缺失的情况下识别具体的缺失事实或概念。 Result: 该方法在AmbigQA、OpenBookQA和MMLU-Pro等多个数据集上进行了验证，证明了其在诊断不同不确定性来源方面的通用性。诊断结果有助于进行针对性的人工干预，以提高LLM的性能和可靠性。 Conclusion: 利用辅助LLM分析目标LLM的多个响应之间的分歧模式，可以有效诊断不确定性来源，如输入问题的歧义或知识缺乏，这种方法在多个数据集上验证了其通用性和有效性。 Abstract: Large language models (LLMs) have delivered significant breakthroughs across diverse domains but can still produce unreliable or misleading outputs, posing critical challenges for real-world applications. While many recent studies focus on quantifying model uncertainty, relatively little work has been devoted to \textit{diagnosing the source of uncertainty}. In this study, we show that, when an LLM is uncertain, the patterns of disagreement among its multiple generated responses contain rich clues about the underlying cause of uncertainty. To illustrate this point, we collect multiple responses from a target LLM and employ an auxiliary LLM to analyze their patterns of disagreement. The auxiliary model is tasked to reason about the likely source of uncertainty, such as whether it stems from ambiguity in the input question, a lack of relevant knowledge, or both. In cases involving knowledge gaps, the auxiliary model also identifies the specific missing facts or concepts contributing to the uncertainty. In our experiment, we validate our framework on AmbigQA, OpenBookQA, and MMLU-Pro, confirming its generality in diagnosing distinct uncertainty sources. Such diagnosis shows the potential for relevant manual interventions that improve LLM performance and reliability.

[10] Emotionally-Aware Agents for Dispute Resolution

Sushrita Rakshit,James Hale,Kushal Chawla,Jeanne M. Brett,Jonathan Gratch

Main category: cs.CL

TL;DR: This paper investigates the role of emotional expressions in conflict resolution using advanced text emotion recognition methods, showing that such approaches can enhance understanding of conflict dynamics and aid in developing systems to manage disputes effectively.

Details

Motivation: The paper is motivated by the need to understand how emotional expressions influence conflict dynamics, particularly in the context of dispute resolution where emotions are typically stronger and social processes are different from negotiations. Method: The paper uses a large corpus of buyer-seller dispute dialogues and employs large-language models to analyze the impact of emotional expressions on conflict outcomes. Result: The research demonstrates that large-language models offer greater explanatory power for emotion intensity annotation and align better with human annotator decisions, supporting existing theoretical models on the role of emotions in conflict. Conclusion: This paper concludes that emotional expressions play a significant role in conflict escalation and resolution, and that agent-based systems can be effective in managing disputes by recognizing and mitigating emotional escalation. Abstract: In conflict, people use emotional expressions to shape their counterparts' thoughts, feelings, and actions. This paper explores whether automatic text emotion recognition offers insight into this influence in the context of dispute resolution. Prior work has shown the promise of such methods in negotiations; however, disputes evoke stronger emotions and different social processes. We use a large corpus of buyer-seller dispute dialogues to investigate how emotional expressions shape subjective and objective outcomes. We further demonstrate that large-language models yield considerably greater explanatory power than previous methods for emotion intensity annotation and better match the decisions of human annotators. Findings support existing theoretical models for how emotional expressions contribute to conflict escalation and resolution and suggest that agent-based systems could be useful in managing disputes by recognizing and potentially mitigating emotional escalation.

[11] Just-in-time and distributed task representations in language models

Yuxuan Li,Declan Campbell,Stephanie C. Y. Chan,Andrew Kyle Lampinen

Main category: cs.CL

TL;DR: 语言模型通过上下文学习新任务时，其可转移任务表示具有时间和语义上的局部性。

Details

Motivation: 探究语言模型在无需权重更新的情况下，如何通过上下文学习形成新任务的表示。 Method: 研究可转移任务表示如何随上下文演变，并分析其特性。 Result: 可转移任务表示以非单调和间歇的方式演变，并且在序列维度上表现出强局部性。 Conclusion: 语言模型在适应新证据和学习新任务时展现出一种即时计算过程，这体现在可转移任务表示的局部性（时间和语义）上。 Abstract: Many of language models' impressive capabilities originate from their in-context learning: based on instructions or examples, they can infer and perform new tasks without weight updates. In this work, we investigate \emph{when} representations for new tasks are formed in language models, and \emph{how} these representations change over the course of context. We focus on ''transferrable'' task representations -- vector representations that can restore task context in another instance of the model, even without the full prompt. We show that these representations evolve in non-monotonic and sporadic ways, and are distinct from a more inert representation of high-level task categories that persists throughout the context. Specifically, models often condense multiple evidence into these transferrable task representations, which align well with the performance improvement based on more examples in the context. However, this accrual process exhibits strong locality along the sequence dimension, coming online only at certain tokens -- despite task identity being reliably decodable throughout the context. Moreover, these local but transferrable task representations tend to capture minimal ''task scopes'', such as a semantically-independent subtask, and models rely on more temporally-distributed representations to support longer and composite tasks. This two-fold locality (temporal and semantic) underscores a kind of just-in-time computational process underlying language models' ability to adapt to new evidence and learn new tasks on the fly.

[12] Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

Hao Zhang,Mengsi Lyu,Yulong Ao,Yonghua Lin

Main category: cs.CL

TL;DR: 本文提出了一种新的大语言模型剪枝方法，专门针对Prefill-Decode拆分推理，显著提高了推理效率并减少了数据传输带宽消耗。

Details

Motivation: 现有的模型剪枝方法往往忽略了实际中的Prefill-Decode拆分特性，而本文旨在通过针对这一特性的优化，提高大语言模型的推理效率并降低计算和内存成本。 Method: 构建剪枝集和蒸馏集，独立地在Prefill和Decode阶段进行迭代块移除，并引入令牌感知的缓存剪枝机制，在Decode阶段有选择地重用选定层的第一个和最后一个令牌序列的KV缓存。 Result: 实验表明，该方法在Prefill-Decode拆分和未拆分的设置下均表现出色。在默认设置下，推理速度提高了20.56%，数据传输带宽消耗减少了4.95倍。 Conclusion: 本文提出了一种用于Prefill-Decode拆分推理的新型剪枝方法，通过在Prefill和Decode阶段独立进行块移除，并引入一种令牌感知的缓存剪枝机制，减少了通信成本，实现了更精确和高效的块和KV缓存剪枝。 Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across various tasks, but their deployment is constrained by high computational and memory costs. Model pruning provides an effective means to alleviate these demands. However, existing methods often ignore the characteristics of prefill-decode (PD) disaggregation in practice. In this paper, we propose a novel pruning method for PD disaggregation inference, enabling more precise and efficient block and KV Cache pruning. Our approach constructs pruning and distillation sets to perform iterative block removal independently for the prefill and decode stages, obtaining better pruning solutions. Moreover, we introduce a token-aware cache pruning mechanism that retains all KV Cache in the prefill stage but selectively reuses entries for the first and last token sequences in selected layers during decode, reducing communication costs with minimal overhead. Extensive experiments demonstrate that our approach consistently achieves strong performance in both PD disaggregation and PD unified settings without disaggregation. Under the default settings, our method achieves a 20.56% inference speedup and a 4.95 times reduction in data transmission bandwidth consumption.

[13] Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study

Xuan Yao,Qianteng Wang,Xinbo Liu,Ke-Wei Huang

Main category: cs.CL

TL;DR: 本文评估了不同设计优先级的大型语言模型在CFA考试中的表现，并提出了一种检索增强生成管道来提升模型在金融领域的推理准确性。

Details

Motivation: 大型语言模型在金融应用中的系统评估仍有限，尤其是在专业金融认证情境中。 Method: 使用1560个CFA模拟考试问题对最先进的语言模型进行评估，并提出了一种基于官方CFA课程内容的检索增强生成（RAG）管道。 Result: 推理导向模型在零样本设置中表现最佳，RAG管道在复杂场景中显著提升了模型表现。 Conclusion: 研究结果为金融领域语言模型的部署提供了基于证据的指导，帮助从业者进行模型选择和性能优化。 Abstract: The rapid advancement of large language models presents significant opportunities for financial applications, yet systematic evaluation in specialized financial contexts remains limited. This study presents the first comprehensive evaluation of state-of-the-art LLMs using 1,560 multiple-choice questions from official mock exams across Levels I-III of CFA, most rigorous professional certifications globally that mirror real-world financial analysis complexity. We compare models distinguished by core design priorities: multi-modal and computationally powerful, reasoning-specialized and highly accurate, and lightweight efficiency-optimized. We assess models under zero-shot prompting and through a novel Retrieval-Augmented Generation pipeline that integrates official CFA curriculum content. The RAG system achieves precise domain-specific knowledge retrieval through hierarchical knowledge organization and structured query generation, significantly enhancing reasoning accuracy in professional financial certification evaluation. Results reveal that reasoning-oriented models consistently outperform others in zero-shot settings, while the RAG pipeline provides substantial improvements particularly for complex scenarios. Comprehensive error analysis identifies knowledge gaps as the primary failure mode, with minimal impact from text readability. These findings provide actionable insights for LLM deployment in finance, offering practitioners evidence-based guidance for model selection and cost-performance optimization.

David Berghaus,Armin Berger,Lars Hillebrand,Kostadin Cvejoski,Rafet Sifa

Main category: cs.CL

TL;DR: This paper benchmarks multi-modal models for document processing, finding that direct image processing generally works better than structured parsing.

Details

Motivation: To determine the best models and processing strategies for automated document systems by comparing performance across different multi-modal large language models and methods. Method: Benchmarked eight multi-modal models from GPT-5, Gemini 2.5, and open-source Gemma 3 families on three invoice datasets using zero-shot prompting; compared direct image processing with structured parsing. Result: Direct image processing outperformed structured parsing, with results varying by model type and document features. Conclusion: Native image processing typically surpasses structured parsing approaches in multi-modal large language models, offering insights for model and strategy selection in automated document systems. Abstract: This paper benchmarks eight multi-modal large language models from three families (GPT-5, Gemini 2.5, and open-source Gemma 3) on three diverse openly available invoice document datasets using zero-shot prompting. We compare two processing strategies: direct image processing using multi-modal capabilities and a structured parsing approach converting documents to markdown first. Results show native image processing generally outperforms structured approaches, with performance varying across model types and document characteristics. This benchmark provides insights for selecting appropriate models and processing strategies for automated document systems. Our code is available online.

[15] COCORELI: Cooperative, Compositional Reconstitution \& Execution of Language Instructions

Swarnadeep Bhar,Omar Naim,Eleni Metheniti,Bastien Navarri,Loïc Cabannes,Morteza Ezzabady,Nicholas Asher

Main category: cs.CL

TL;DR: COCORELI是一个混合代理框架，旨在解决大语言模型在复杂指令、减少幻觉和空间推理方面的局限性。

Details

Motivation: 大语言模型在处理需要复杂指令、减少幻觉和空间推理的任务时存在局限性，需要一种新的框架来克服这些问题。 Method: COCORELI通过整合中等大小的语言模型代理、新的抽象机制以及一个用于解析指令的语篇模块，动态学习环境的高级表示。 Result: 在自然协作构建任务的实验中，COCORELI优于使用较大语言模型的单语言模型思维链（CoT）和代理语言模型系统，能够有效避免幻觉，识别缺失信息，并进行澄清和更新学习对象。 Conclusion: COCORELI为大语言模型在复杂任务中的应用提供了一个有效的解决方案，其抽象能力也扩展到了环境以外的领域。 Abstract: We present COCORELI, a hybrid agent framework designed to tackle the limitations of large language models (LLMs) in tasks requiring: following complex instructions, minimizing hallucination, and spatial reasoning. COCORELI integrates medium-sized LLM agents with novel abstraction mechanisms and a discourse module to parse instructions to in-context learn dynamic, high-level representations of the environment. Experiments on natural collaborative construction tasks show that COCORELI outperforms single-LLM CoT and agentic LLM systems, all using larger LLMs. It manages to largely avoid hallucinations, identify missing information, ask for clarifications, and update its learned objects. COCORELI's abstraction abilities extend beyond ENVIRONMENT, as shown in the ToolBench API completion task.

[16] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

Alice Schiavone,Marco Fraccaro,Lea Marie Pehrson,Silvia Ingala,Rasmus Bonnevie,Michael Bachmann Nielsen,Vincent Beliveau,Melanie Ganz,Desmond Elliott

Main category: cs.CL

TL;DR: MOSAIC是一种多语言、分类法无关且计算效率高的放射报告分类方法，基于MedGemma-4B模型，可在消费级GPU上部署，并在多个数据集中表现出接近专家水平的性能。

Details

Motivation: 现有的放射报告分类方法面临关键限制，包括基于规则的方法在语言可变性方面的不足、监督模型需要大规模注释数据集以及基于LLM的系统依赖于封闭源或资源密集型模型。此外，当前解决方案主要局限于英语和单模态、单分类数据集。 Method: MOSAIC基于一个紧凑的开放访问语言模型（MedGemma-4B），支持零/少样本提示和轻量级微调，可在消费级GPU上部署。 Result: MOSAIC在七个涵盖英语、西班牙语、法语和丹麦语的数据集中进行评估，平均宏F1得分为88，接近或超过专家水平。使用数据增强，仅需80个注释样本即可在丹麦语报告上达到82的加权F1分数，而完整训练集为1600个样本。 Conclusion: MOSAIC提供了一种实用的替代方案，用于在临床环境中使用大型或专有LLMs进行放射报告分类，同时具有多语言、分类法无关和计算效率高的特点。 Abstract: Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

[17] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

Kushan Mitra,Dan Zhang,Hannah Kim,Estevam Hruschka

Main category: cs.CL

TL;DR: RECAP is introduced as a new benchmark to evaluate intent rewriting in conversational assistants, showing that improved intent rewriting enhances agent planning in open-domain dialogue systems.

Details

Motivation: User intent is often ambiguous, underspecified, or dynamic in real-world dialogues, making it challenging for traditional classification-based approaches to generalize and provide accurate planning in conversational assistants. Method: The researchers introduced RECAP, a new benchmark for evaluating intent rewriting, and developed an LLM-based evaluator along with a prompt-based rewriting approach and fine-tuned DPO-based rewriters. Result: The prompt-based rewriting approach outperformed baselines, and fine-tuning two DPO-based rewriters provided additional utility gains in evaluating and advancing intent rewriting. Conclusion: The study concludes that intent rewriting is a critical and manageable component for enhancing agent planning in open-domain dialogue systems. Abstract: Understanding user intent is essential for effective planning in conversational assistants, particularly those powered by large language models (LLMs) coordinating multiple agents. However, real-world dialogues are often ambiguous, underspecified, or dynamic, making intent detection a persistent challenge. Traditional classification-based approaches struggle to generalize in open-ended settings, leading to brittle interpretations and poor downstream planning. We propose RECAP (REwriting Conversations for Agent Planning), a new benchmark designed to evaluate and advance intent rewriting, reframing user-agent dialogues into concise representations of user goals. RECAP captures diverse challenges such as ambiguity, intent drift, vagueness, and mixed-goal conversations. Alongside the dataset, we introduce an LLM-based evaluator that assesses planning utility given the rewritten intent. Using RECAP, we develop a prompt-based rewriting approach that outperforms baselines. We further demonstrate that fine-tuning two DPO-based rewriters yields additional utility gains. Our results highlight intent rewriting as a critical and tractable component for improving agent planning in open-domain dialogue systems.

[18] SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

Jaekwon Yoo,Kunal Chandiramani,Divya Tadimeti,Abenezer Girma,Chandra Dhir

Main category: cs.CL

TL;DR: 论文提出了一种参数高效的语音编码器与大语言模型集成的解决方案，通过适配器和合成数据标注技术，在多个语音任务上取得了显著性能提升。

Details

Motivation: 由于将语音编码器与大语言模型集成需要大量的数据和资源，实际应用面临数据和资源不足的限制，因此需要一种高效解决方案。 Method: 提出了一种参数高效的适配器，将语音嵌入转换为LLM兼容的标记，并利用基于LLM的合成数据集标注技术降低标注成本，同时采用分类器正则化器和低秩适应（LoRA）优化LLM。 Result: 适配器在LibriSpeech ASR任务上实现了26%的相对词错误率（WER）改进，在NER任务上实现了6.3%的相对F1分数提升，在SA任务上实现了32%的相对F1分数提升，同时使用先进技术使Spoken Language Understanding Evaluation（SLUE）得分提升了6.6%和9.5%。 Conclusion: 通过使用参数高效的适配器和先进的技术，如添加分类器正则化器和使用低秩适应（LoRA）优化大语言模型（LLM），论文在语音识别、命名实体识别和情感分析任务上取得了显著的性能提升。 Abstract: While integrating speech encoder with LLM requires substantial data and resources, use cases face limitations due to insufficient availability. To address this, we propose a solution with a parameter-efficient adapter that converts speech embeddings into LLM-compatible tokens, focusing on end-to-end automatic speech recognition (ASR), named entity recognition (NER), and sentiment analysis (SA). To reduce labeling costs, we employ an LLM-based synthetic dataset annotation technique. The proposed adapter, using 7x fewer trainable parameters, achieves significant performance gains: a 26% relative Word Error Rates (WER) improvement on the LibriSpeech ASR task, a 6.3% relative F1 score increase on the NER task, and a 32% relative F1 score boost on the SA task. Moreover, using advanced techniques such as adding a classifier regularizer and optimizing the LLM with Low-Rank Adaptation (LoRA) yields notable performance gains, with Spoken Language Understanding Evaluation (SLUE) score improvement of 6.6% and 9.5%

[19] Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

Shengyin Sun,Yiming Li,Xing Li,Yingzhao Lian,Weizhe Lin,Hui-Ling Zhen,Zhiyuan Yang,Chen Chen,Xianzhi Yu,Mingxuan Yuan,Chen Ma

Main category: cs.CL

TL;DR: This paper introduces a benchmark for evaluating speculative decoding methods in LLM test-time scaling, showing that n-gram-based approaches effectively accelerate reasoning by handling repetitive patterns.

Details

Motivation: Test-time scaling often leads to redundant reasoning traces and high computational overhead. While speculative decoding is a promising solution, its effectiveness in structured, repetition-rich scenarios like test-time scaling is not well studied. Method: The researchers introduced a comprehensive benchmark to evaluate speculative decoding methods for LLM test-time scaling. They compared three categories of methods—model-based, training-based, and n-gram-based—using consistent protocols across paradigms like Best-of-N sampling and multi-round thinking. Result: Experiments showed that simple n-gram-based methods are effective at capturing repetitive reasoning patterns, offering significant acceleration potential. This suggests that combining n-gram-based methods with other approaches can improve both efficiency and reasoning quality. Conclusion: The study concludes that n-gram-based speculative decoding methods show unique potential in accelerating test-time scaling by capturing repetitive patterns, and integrating these with model-based or training-based methods can balance acceleration for both repetitive and diverse reasoning. Abstract: Test-time scaling has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs) by allocating additional computational resources during inference. However, this paradigm is inherently inefficient due to the generation of redundant and repetitive reasoning traces, leading to significant computational overhead. Speculative decoding offers a promising avenue for mitigating this inefficiency, yet its efficacy in the structured, repetition-rich context of test-time scaling remains largely unexplored. To bridge this gap, we introduce the first comprehensive benchmark designed to evaluate speculative decoding methods for accelerating LLM test-time scaling. Our benchmark provides consistent experimental protocols across representative test-time scaling paradigms (e.g., Best-of-N sampling and multi-round thinking), enabling a fair comparison of three major categories of speculative decoding: model-based, training-based, and n-gram-based methods. Extensive experiments reveal that simple n-gram-based methods effectively capture repetitive patterns, demonstrating unique potential in accelerating test-time scaling. This phenomenon demonstrates the value of integrating n-gram-based methods with model-based or training-based approaches to balance acceleration for both repetitive and diverse reasoning in test-time scaling. We hope this benchmark spurs further research on speculative decoding for test-time scaling, enabling faster and more practical reasoning in LLMs through better handling of repetitive and diverse reasoning paths.

[20] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

Hao Wen,Yifan Su,Feifei Zhang,Yunxin Liu,Yunhao Liu,Ya-Qin Zhang,Yuanchun Li

Main category: cs.CL

TL;DR: ParaThinker introduces parallel reasoning in LLMs to overcome the 'Tunnel Vision' limitation of sequential processing, achieving better accuracy with minimal additional computational cost.

Details

Motivation: Traditional test-time compute scaling strategies hit a performance ceiling due to 'Tunnel Vision' where imperfect initial reasoning steps lock the model into suboptimal paths. Method: Developed an end-to-end framework called ParaThinker that enables LLMs to generate and synthesize multiple, diverse reasoning paths in parallel. Result: ParaThinker achieved significant accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), with minimal latency overhead (7.1%). Conclusion: ParaThinker's parallel thought process effectively overcomes the 'Tunnel Vision' problem in traditional LLMs, providing a more efficient way to improve reasoning performance compared to sequential computation scaling. Abstract: Recent advances in Large Language Models (LLMs) have been driven by test-time compute scaling - a strategy that improves reasoning by generating longer, sequential thought processes. While effective, this approach encounters a significant bottleneck as computation increases, where further computation offers only marginal performance gains. We argue this ceiling is not an inherent limit of the model's capability but a flaw in the scaling strategy itself, a phenomenon we term "Tunnel Vision", where a model's imperfect initial steps lock it into a suboptimal reasoning path. To overcome this, we introduce a new scaling paradigm: native thought parallelism. We present ParaThinker, an end-to-end framework that trains an LLM to generate multiple, diverse reasoning paths in parallel and synthesize them into a superior final answer. By exploring different lines of thoughts simultaneously, ParaThinker effectively sidesteps the Tunnel Vision issue and unlocks the model's latent reasoning potential. Our approach demonstrates that scaling compute in parallel (width) is a more effective and efficient way to superior reasoning than simply scaling sequentially (depth). On challenging reasoning benchmarks, ParaThinker achieves substantial accuracy improvements over sequential LLMs (12.3% for 1.5B and 7.5% for 7B models on average with 8 parallel paths), while adding only negligible latency overhead (7.1%). This enables smaller models to surpass much larger counterparts and establishes parallel thinking as a critical, efficient dimension for scaling future LLMs.

[21] Training Text-to-Molecule Models with Context-Aware Tokenization

Seojin Kim,Hyeontae Song,Jaehyun Nam,Jinwoo Shin

Main category: cs.CL

TL;DR: 本文提出了一种新的文本到分子模型CAMT5，该模型使用子结构级分词和基于重要性的训练策略，以更好地捕捉分子语义并提高生成性能。

Details

Motivation: 现有的文本到分子模型依赖于原子级的分词，主要关注局部连接性，从而限制了模型捕捉分子内全局结构上下文的能力。 Method: 作者提出了一种子结构级的分词方案，并在此基础上开发了一个基于重要性的训练策略，以优先考虑关键子结构。 Result: 实验结果显示，CAMT5在各种文本到分子生成任务中表现出色，仅使用2%的训练token就超过了最先进的方法。此外，作者还提出了一种简单而有效的集成策略，进一步提高了生成性能。 Conclusion: CAMT5提供了一个更有效的方式来捕捉分子语义，其子结构级分词和基于重要性的训练策略显著提高了文本到分子生成的性能。 Abstract: Recently, text-to-molecule models have shown great potential across various chemical applications, e.g., drug-discovery. These models adapt language models to molecular data by representing molecules as sequences of atoms. However, they rely on atom-level tokenizations, which primarily focus on modeling local connectivity, thereby limiting the ability of models to capture the global structural context within molecules. To tackle this issue, we propose a novel text-to-molecule model, coined Context-Aware Molecular T5 (CAMT5). Inspired by the significance of the substructure-level contexts in understanding molecule structures, e.g., ring systems, we introduce substructure-level tokenization for text-to-molecule models. Building on our tokenization scheme, we develop an importance-based training strategy that prioritizes key substructures, enabling CAMT5 to better capture the molecular semantics. Extensive experiments verify the superiority of CAMT5 in various text-to-molecule generation tasks. Intriguingly, we find that CAMT5 outperforms the state-of-the-art methods using only 2% of training tokens. In addition, we propose a simple yet effective ensemble strategy that aggregates the outputs of text-to-molecule models to further boost the generation performance. Code is available at https://github.com/Songhyeontae/CAMT5.git.

[22] An End-to-End System for Culturally-Attuned Driving Feedback using a Dual-Component NLG Engine

Iniakpokeikiye Peter Thompson,Yi Dewei,Reiter Ehud

Main category: cs.CL

TL;DR: 本文提出了一种在尼日利亚这种资源有限且基础设施存在挑战的环境中，能够向驾驶员提供文化适应性安全驾驶反馈的端到端移动系统。

Details

Motivation: 尼日利亚作为一个资源有限且基础设施存在显著挑战的环境，需要一种能够提供文化适应性安全驾驶反馈的系统，以改善当地的驾驶行为和交通安全。 Method: 系统的核心是一个创新的双组件自然语言生成（NLG）引擎，可以提供法律依据的安全提示和具有说服力的行为报告。系统架构包括自动行程检测服务、设备上的行为分析和一个利用两步反思过程的高质量反馈NLG流水线。此外，系统还集成了一种专门用于检测酒精影响驾驶的机器学习模型，并在90名驾驶员中进行了试点部署。 Result: 试点部署结果表明了该方法的可行性，并展示了在检测不安全驾驶行为方面的初步成果。 Conclusion: 这项工作为应用数据到文本和人工智能系统实现社会效益提供了框架。 Abstract: This paper presents an end-to-end mobile system that delivers culturally-attuned safe driving feedback to drivers in Nigeria, a low-resource environment with significant infrastructural challenges. The core of the system is a novel dual-component Natural Language Generation (NLG) engine that provides both legally-grounded safety tips and persuasive, theory-driven behavioural reports. We describe the complete system architecture, including an automatic trip detection service, on-device behaviour analysis, and a sophisticated NLG pipeline that leverages a two-step reflection process to ensure high-quality feedback. The system also integrates a specialized machine learning model for detecting alcohol-influenced driving, a key local safety issue. The architecture is engineered for robustness against intermittent connectivity and noisy sensor data. A pilot deployment with 90 drivers demonstrates the viability of our approach, and initial results on detected unsafe behaviours are presented. This work provides a framework for applying data-to-text and AI systems to achieve social good.

[23] No Clustering, No Routing: How Transformers Actually Process Rare Tokens

Jing Liu

Main category: cs.CL

TL;DR: 研究显示，大语言模型中的稀有词汇处理是通过分布式的、训练驱动的神经元分化实现的，而非通过架构模块化，这保持了上下文敏感的灵活性并实现自适应容量分配。

Details

Motivation: 大语言模型在预测稀有词汇时表现困难，但其专业化的机制尚不清楚。之前的研究发现了专门的“平台期”神经元，但其功能组织未知。 Method: 通过分析神经元影响、基于图的聚类以及GPT-2 XL和Pythia模型中的注意力头消融实验进行研究。 Result: 研究发现：1）稀有词汇处理需要超出常见词汇所需幂律区域的额外平台期神经元，形成双重计算机制；2）平台期神经元空间分布，而非形成模块化簇；3）注意力机制未表现出对专业神经元的偏好路由。 Conclusion: 稀有词汇的专业化是通过分布式的、训练驱动的神经元分化实现的，而不是架构模块化，这保持了上下文敏感的灵活性，并实现了自适应容量分配。 Abstract: Large language models struggle with rare token prediction, yet the mechanisms driving their specialization remain unclear. Prior work identified specialized ``plateau'' neurons for rare tokens following distinctive three-regime influence patterns \cite{liu2025emergent}, but their functional organization is unknown. We investigate this through neuron influence analyses, graph-based clustering, and attention head ablations in GPT-2 XL and Pythia models. Our findings show that: (1) rare token processing requires additional plateau neurons beyond the power-law regime sufficient for common tokens, forming dual computational regimes; (2) plateau neurons are spatially distributed rather than forming modular clusters; and (3) attention mechanisms exhibit no preferential routing to specialists. These results demonstrate that rare token specialization arises through distributed, training-driven differentiation rather than architectural modularity, preserving context-sensitive flexibility while achieving adaptive capacity allocation.

[24] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

Ryo Takahashi,Naoki Saito,Keisuke Maeda,Takahiro Ogawa,Miki Haseyama

Main category: cs.CL

TL;DR: 本研究探讨了视觉情感识别（VER）的个性化识别问题，并提出了一种基于离散提示调优的方法来改进多模态大语言模型（MLLMs）在个性化VER中的应用。

Details

Motivation: 现有的多模态大语言模型在视觉情感识别中倾向于多数观点和熟悉模式，限制了其在个性化情感识别中的表现，因此需要改进以适应实际应用需求。 Method: 该方法受到人类提示工程的启发，采用离散提示调优技术，从生成的提示中选择最佳自然语言表示，并用于更新提示，以实现准确的个性化视觉情感识别。 Result: 提出的方法能够有效提升多模态大语言模型在个性化视觉情感识别任务中的性能，解决了现有模型偏向大众化情感识别的问题。 Conclusion: 研究表明，通过离散提示调优技术，可以显著提高多模态大语言模型在个性化视觉情感识别中的准确性和适用性，为实际应用提供了新的解决方案。 Abstract: Visual Emotion Recognition (VER) is an important research topic due to its wide range of applications, including opinion mining and advertisement design. Extending this capability to recognize emotions at the individual level further broadens its potential applications. Recently, Multimodal Large Language Models (MLLMs) have attracted increasing attention and demonstrated performance comparable to that of conventional VER methods. However, MLLMs are trained on large and diverse datasets containing general opinions, which causes them to favor majority viewpoints and familiar patterns. This tendency limits their performance in a personalized VER, which is crucial for practical and real-world applications, and indicates a key area for improvement. To address this limitation, the proposed method employs discrete prompt tuning inspired by the process of humans' prompt engineering to adapt the VER task to each individual. Our method selects the best natural language representation from the generated prompts and uses it to update the prompt for the realization of accurate personalized VER.

[25] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

Ravi Shankar,Sheng Wong,Lin Li,Magdalena Bachmann,Alex Silverthorne,Beth Albert,Gabriel Davis Jones

Main category: cs.CL

TL;DR: 本文提出了一种基于能量模型（EBM）的方法，用于在检索增强生成（RAG）系统中实现可靠的拒绝回答机制，特别是在女性健康等安全关键领域。

Details

Motivation: 在安全关键领域中，错误的回答可能带来严重后果，因此需要一种可靠的拒绝回答机制。 Method: 该研究使用能量模型（EBM）在一个包含260万个基于指南问题的密集语义语料库上进行训练，以决定何时生成答案或拒绝回答。 Result: EBM在面对语义挑战较大的问题时表现出色，其AUROC为0.961，优于softmax的0.950，并且FPR@95降低至0.235，而softmax为0.331。 Conclusion: 基于能量的拒绝评分提供了一个比基于概率的softmax置信度更可靠的置信信号，为安全的RAG系统提供了一个可扩展且可解释的基础。 Abstract: Reliable abstention is critical for retrieval-augmented generation (RAG) systems, particularly in safety-critical domains such as women's health, where incorrect answers can lead to harm. We present an energy-based model (EBM) that learns a smooth energy landscape over a dense semantic corpus of 2.6M guideline-derived questions, enabling the system to decide when to generate or abstain. We benchmark the EBM against a calibrated softmax baseline and a k-nearest neighbour (kNN) density heuristic across both easy and hard abstention splits, where hard cases are semantically challenging near-distribution queries. The EBM achieves superior abstention performance abstention on semantically hard cases, reaching AUROC 0.961 versus 0.950 for softmax, while also reducing FPR@95 (0.235 vs 0.331). On easy negatives, performance is comparable across methods, but the EBM's advantage becomes most pronounced in safety-critical hard distributions. A comprehensive ablation with controlled negative sampling and fair data exposure shows that robustness stems primarily from the energy scoring head, while the inclusion or exclusion of specific negative types (hard, easy, mixed) sharpens decision boundaries but is not essential for generalisation to hard cases. These results demonstrate that energy-based abstention scoring offers a more reliable confidence signal than probability-based softmax confidence, providing a scalable and interpretable foundation for safe RAG systems.

[26] DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs

Minghui Huang

Main category: cs.CL

TL;DR: 本文提出DecMetrics，通过三个新指标评估并优化声明分解模型，提高了事实核查的可靠性。

Details

Motivation: 当前研究主要关注生成式分解方法，但缺乏对分解后原子声明质量的充分评估，因此需要引入有效的评估指标。 Method: 开发了DecMetrics，包括COMPLETENESS、CORRECTNESS和SEMANTIC ENTROPY三个新指标，并将其作为奖励函数整合到轻量级声明分解模型中。 Result: 通过自动评估，DecMetrics能够有效优化分解模型性能，为声明分解设定了新基准。 Conclusion: DecMetrics有效地评估了声明分解模型的质量，并通过集成这些指标优化模型性能，提高了事实核查系统的可靠性和有效性。 Abstract: Claim decomposition plays a crucial role in the fact-checking process by breaking down complex claims into simpler atomic components and identifying their unfactual elements. Despite its importance, current research primarily focuses on generative methods for decomposition, with insufficient emphasis on evaluating the quality of these decomposed atomic claims. To bridge this gap, we introduce \textbf{DecMetrics}, which comprises three new metrics: \texttt{COMPLETENESS}, \texttt{CORRECTNESS}, and \texttt{SEMANTIC ENTROPY}, designed to automatically assess the quality of claims produced by decomposition models. Utilizing these metrics, we develop a lightweight claim decomposition model, optimizing its performance through the integration of these metrics as a reward function. Through automatic evaluation, our approach aims to set a benchmark for claim decomposition, enhancing both the reliability and effectiveness of fact-checking systems.

[27] The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

Abdelrahman Sadallah,Tim Baumgärtner,Iryna Gurevych,Ted Briscoe

Main category: cs.CL

TL;DR: 本文介绍了RevUtil数据集和使用该数据集评估和开发评估审稿评论模型的方法。

Details

Motivation: 随着审稿时间减少，需要自动化支持系统来确保高质量的审稿，使反馈对作者有用。 Method: 引入了RevUtil数据集，包含1430个人工标注的评论和10k个合成标注的评论，用于评估和开发评估评论的模型。 Result: 实验表明，这些微调模型与人类达成的协议水平相当，有时甚至超过了一些强大的封闭模型，如GPT-4o。 Conclusion: 机器生成的评论在四个方面普遍不如人类评论表现好。 Abstract: Providing constructive feedback to paper authors is a core component of peer review. With reviewers increasingly having less time to perform reviews, automated support systems are required to ensure high reviewing quality, thus making the feedback in reviews useful for authors. To this end, we identify four key aspects of review comments (individual points in weakness sections of reviews) that drive the utility for authors: Actionability, Grounding & Specificity, Verifiability, and Helpfulness. To enable evaluation and development of models assessing review comments, we introduce the RevUtil dataset. We collect 1,430 human-labeled review comments and scale our data with 10k synthetically labeled comments for training purposes. The synthetic data additionally contains rationales, i.e., explanations for the aspect score of a review comment. Employing the RevUtil dataset, we benchmark fine-tuned models for assessing review comments on these aspects and generating rationales. Our experiments demonstrate that these fine-tuned models achieve agreement levels with humans comparable to, and in some cases exceeding, those of powerful closed models like GPT-4o. Our analysis further reveals that machine-generated reviews generally underperform human reviews on our four aspects.

[28] ASCENDgpt: A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records

Chris Sainsbury,Andreas Karwath

Main category: cs.CL

TL;DR: ASCENDgpt is a transformer-based model for cardiovascular risk prediction from EHRs that uses a phenotype-aware tokenization scheme and achieves strong performance across five cardiovascular outcomes.

Details

Motivation: The motivation of the paper is to improve cardiovascular risk prediction from longitudinal electronic health records (EHRs) by developing a model that can consolidate diagnosis codes while preserving semantic information. Method: The paper introduces ASCENDgpt, a transformer-based model that uses a novel phenotype-aware tokenization scheme to map raw ICD codes to clinically meaningful phenotype tokens. The model is pretrained on sequences derived from 19,402 unique individuals using a masked language modeling objective and then fine-tuned for time-to-event prediction of five cardiovascular outcomes. Result: ASCENDgpt achieves excellent discrimination on the held-out test set with an average C-index of 0.816, demonstrating strong performance across all outcomes (MI: 0.792, stroke: 0.824, MACE: 0.800, cardiovascular death: 0.842, all-cause mortality: 0.824). Conclusion: The paper concludes that ASCENDgpt, a transformer-based model using a phenotype-based tokenization and pretraining approach, is effective for cardiovascular risk prediction from EHRs. The method enables clinically interpretable predictions while maintaining computational efficiency. Abstract: We present ASCENDgpt, a transformer-based model specifically designed for cardiovascular risk prediction from longitudinal electronic health records (EHRs). Our approach introduces a novel phenotype-aware tokenization scheme that maps 47,155 raw ICD codes to 176 clinically meaningful phenotype tokens, achieving 99.6\% consolidation of diagnosis codes while preserving semantic information. This phenotype mapping contributes to a total vocabulary of 10,442 tokens - a 77.9\% reduction when compared with using raw ICD codes directly. We pretrain ASCENDgpt on sequences derived from 19402 unique individuals using a masked language modeling objective, then fine-tune for time-to-event prediction of five cardiovascular outcomes: myocardial infarction (MI), stroke, major adverse cardiovascular events (MACE), cardiovascular death, and all-cause mortality. Our model achieves excellent discrimination on the held-out test set with an average C-index of 0.816, demonstrating strong performance across all outcomes (MI: 0.792, stroke: 0.824, MACE: 0.800, cardiovascular death: 0.842, all-cause mortality: 0.824). The phenotype-based approach enables clinically interpretable predictions while maintaining computational efficiency. Our work demonstrates the effectiveness of domain-specific tokenization and pretraining for EHR-based risk prediction tasks.

[29] Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

Hao Shi,Yusuke Fujita,Tomoya Mizumoto,Lianbo Liu,Atsushi Kojima,Yui Sudo

Main category: cs.CL

TL;DR: 本文提出了一种基于结构化提示的多说话人自动语音识别系统，通过插入分离器和序列化CTC层以及设计三阶段训练策略，显著提升了系统在两说话人和三说话人场景下的性能。

Details

Motivation: 现有的基于大语言模型的多说话人自动语音识别系统未充分探索提示的设计来提升性能，本文旨在通过设计有效的提示方法来改进系统表现。 Method: 本文提出了一种序列化输出提示（SOP）方法，并在语音编码器后插入分离器和序列化CTC层，采用三阶段训练策略来优化模型。 Result: 实验结果表明，所提出的SOP方法在LibriMix数据集的两说话人和三说话人场景下均显著提升了性能。 Conclusion: 通过结构化提示和三阶段训练策略，本文提出的SOP-MT-ASR系统有效提升了多说话人语音识别的性能。 Abstract: Prompts are crucial for task definition and for improving the performance of large language models (LLM)-based systems. However, existing LLM-based multi-talker (MT) automatic speech recognition (ASR) systems either omit prompts or rely on simple task-definition prompts, with no prior work exploring the design of prompts to enhance performance. In this paper, we propose extracting serialized output prompts (SOP) and explicitly guiding the LLM using structured prompts to improve system performance (SOP-MT-ASR). A Separator and serialized Connectionist Temporal Classification (CTC) layers are inserted after the speech encoder to separate and extract MT content from the mixed speech encoding in a first-speaking-first-out manner. Subsequently, the SOP, which serves as a prompt for LLMs, is obtained by decoding the serialized CTC outputs using greedy search. To train the model effectively, we design a three-stage training strategy, consisting of serialized output training (SOT) fine-tuning, serialized speech information extraction, and SOP-based adaptation. Experimental results on the LibriMix dataset show that, although the LLM-based SOT model performs well in the two-talker scenario, it fails to fully leverage LLMs under more complex conditions, such as the three-talker scenario. The proposed SOP approach significantly improved performance under both two- and three-talker conditions.

[30] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR

Xinnian Zhao,Hugo Van Hamme

Main category: cs.CL

TL;DR: 这项研究提出了一种新的弱监督语音识别方法，通过将电视字幕作为上下文提示来生成伪转录本，并利用加权注意力机制提高转录准确性。

Details

Motivation: 电视字幕虽然广泛可用，但其与音频的不精确对齐限制了其在直接监督学习中的应用，因此需要一种新方法来更好地利用字幕信息。 Method: 研究提出了一种弱监督自动语音识别框架，利用电视字幕作为生成伪转录本的提示，并通过迭代优化和加权注意力机制增强转录过程。 Result: 实验表明，该方法在转录准确性方面有显著提升，证明了其在优化转录过程中的有效性。 Conclusion: 该研究得出结论，通过将电视字幕作为上下文提示而非直接监督信号，结合加权注意力机制，可以显著提高语音识别的准确性。 Abstract: This study proposes a novel approach to using TV subtitles within a weakly supervised (WS) Automatic Speech Recognition (ASR) framework. Although TV subtitles are readily available, their imprecise alignment with corresponding audio limits their applicability as supervised targets for verbatim transcription. Rather than using subtitles as direct supervision signals, our method reimagines them as context-rich prompts. This design enables the model to handle discrepancies between spoken audio and subtitle text. Instead, generated pseudo transcripts become the primary targets, with subtitles acting as guiding cues for iterative refinement. To further enhance the process, we introduce a weighted attention mechanism that emphasizes relevant subtitle tokens during inference. Our experiments demonstrate significant improvements in transcription accuracy, highlighting the effectiveness of the proposed method in refining transcripts. These enhanced pseudo-labeled datasets provide high-quality foundational resources for training robust ASR systems.

[31] Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

Charles Moslonka,Hicham Randrianarivo,Arthur Garnier,Emmanuel Malherbe

Main category: cs.CL

TL;DR: This paper proposes a method for detecting hallucinations in LLM outputs for QA tasks, especially useful in data-limited scenarios. It uses entropy-based metrics and supervised learning to improve detection, showing high performance and practicality in real-world applications like finance.

Details

Motivation: The motivation behind this research is to address the issue of hallucinations in LLM outputs for QA tasks, which undermines the reliability of these models in real-world scenarios. The focus is on developing a robust, one-shot detection methodology suitable for environments with limited data access, such as when interacting with black-box LLM APIs. Method: The paper introduces an applied methodology for hallucination detection that uses uncertainty indicators derived from log-probabilities generated during non-greedy decoding. It employs an Entropy Production Rate (EPR) metric and enhances it with supervised learning. The model uses features based on the entropic contributions of top-ranked tokens within a single generated sequence without requiring multiple query re-runs. Result: The proposed methodology significantly improves hallucination detection compared to using EPR alone. The approach demonstrates high performance using only a small set of available log-probabilities, confirming its practical efficiency. The utility of the method is showcased in a finance framework analyzing responses to queries on annual reports from an industrial dataset. Conclusion: This paper concludes that the proposed methodology for hallucination detection in LLM outputs is effective, particularly in scenarios with limited data access. It demonstrates that utilizing entropy-based features and supervised learning can significantly enhance the detection of hallucinations, making the approach practical and efficient for real-world applications such as QA and RAG systems in finance. Abstract: Hallucinations in Large Language Model (LLM) outputs for Question Answering (QA) tasks critically undermine their real-world reliability. This paper introduces an applied methodology for robust, one-shot hallucination detection, specifically designed for scenarios with limited data access, such as interacting with black-box LLM APIs that typically expose only a few top candidate log-probabilities per token. Our approach derives uncertainty indicators directly from these readily available log-probabilities generated during non-greedy decoding. We first derive an Entropy Production Rate (EPR) metric that offers baseline performance, later augmented with supervised learning. Our learned model uses features representing the entropic contributions of the accessible top-ranked tokens within a single generated sequence, requiring no multiple query re-runs. Evaluated across diverse QA datasets and multiple LLMs, this estimator significantly improves hallucination detection over using EPR alone. Crucially, high performance is demonstrated using only the typically small set of available log-probabilities (e.g., top <10 per token), confirming its practical efficiency and suitability for these API-constrained deployments. This work provides a readily deployable technique to enhance the trustworthiness of LLM responses from a single generation pass in QA and Retrieval-Augmented Generation (RAG) systems, with its utility further demonstrated in a finance framework analyzing responses to queries on annual reports from an industrial dataset.

[32] A Narrative-Driven Computational Framework for Clinician Burnout Surveillance

Syed Ahmad Chan Bukhari,Fazel Keshtkar,Alyssa Meczkowska

Main category: cs.CL

TL;DR: 通过分析ICU出院摘要，本研究开发了一种混合方法，利用自然语言处理技术识别临床医生倦怠的风险信号，显示出比仅使用元数据的方法更高的性能。

Details

Motivation: 现有研究主要依赖回顾性调查工具或广泛的EHR元数据，而忽略了临床记录中的叙述性信息。 Method: 将BioBERT情感嵌入与临床压力词典和LDA主题建模相结合的混合管道。 Result: 提供者级别的逻辑回归分类器在分层保留集上达到了0.80的精确度，0.89的召回率和0.84的F1分数。 Conclusion: ICU临床叙事包含用于主动监测健康状况的可行信号。 Abstract: Clinician burnout poses a substantial threat to patient safety, particularly in high-acuity intensive care units (ICUs). Existing research predominantly relies on retrospective survey tools or broad electronic health record (EHR) metadata, often overlooking the valuable narrative information embedded in clinical notes. In this study, we analyze 10,000 ICU discharge summaries from MIMIC-IV, a publicly available database derived from the electronic health records of Beth Israel Deaconess Medical Center. The dataset encompasses diverse patient data, including vital signs, medical orders, diagnoses, procedures, treatments, and deidentified free-text clinical notes. We introduce a hybrid pipeline that combines BioBERT sentiment embeddings fine-tuned for clinical narratives, a lexical stress lexicon tailored for clinician burnout surveillance, and five-topic latent Dirichlet allocation (LDA) with workload proxies. A provider-level logistic regression classifier achieves a precision of 0.80, a recall of 0.89, and an F1 score of 0.84 on a stratified hold-out set, surpassing metadata-only baselines by greater than or equal to 0.17 F1 score. Specialty-specific analysis indicates elevated burnout risk among providers in Radiology, Psychiatry, and Neurology. Our findings demonstrate that ICU clinical narratives contain actionable signals for proactive well-being monitoring.

[33] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

Krithi Shailya,Akhilesh Kumar Mishra,Gokul S Krishnan,Balaraman Ravindran

Main category: cs.CL

TL;DR: 该论文研究了大语言模型在推荐大学和课程时的偏见问题，发现模型推荐倾向于全球北方、强化性别刻板印象，并提出了新的评估框架来衡量公平性和多样性。

Details

Motivation: 大语言模型越来越多地用于日常推荐系统，包括教育规划，但这些推荐可能延续社会偏见。因此，论文旨在研究这些模型推荐中是否存在地理、人口和经济偏差，并提出解决方案。 Method: 该论文使用360个模拟用户档案（涵盖性别、国籍和经济状况的变化），对三个开源大语言模型LLaMA-3.1-8B、Gemma-7B和Mistral-7B生成的25,000多条推荐进行了分析，并提出了一种新的多维度评估框架来衡量推荐的多样性和公平性。 Result: 研究结果显示，全球北方的大学被不成比例地推荐，性别刻板印象被强化，且推荐重复性高。尽管LLaMA-3.1推荐的大学最多（481所，涵盖58个国家），但系统性偏差仍然存在。 Conclusion: 该论文得出结论，教育领域的大语言模型（LLMs）推荐存在系统性偏差，需要在模型设计中更多关注公平性问题，以确保全球高等教育的平等获取。 Abstract: Large Language Models (LLMs) are increasingly used as daily recommendation systems for tasks like education planning, yet their recommendations risk perpetuating societal biases. This paper empirically examines geographic, demographic, and economic biases in university and program suggestions from three open-source LLMs: LLaMA-3.1-8B, Gemma-7B, and Mistral-7B. Using 360 simulated user profiles varying by gender, nationality, and economic status, we analyze over 25,000 recommendations. Results show strong biases: institutions in the Global North are disproportionately favored, recommendations often reinforce gender stereotypes, and institutional repetition is prevalent. While LLaMA-3.1 achieves the highest diversity, recommending 481 unique universities across 58 countries, systemic disparities persist. To quantify these issues, we propose a novel, multi-dimensional evaluation framework that goes beyond accuracy by measuring demographic and geographic representation. Our findings highlight the urgent need for bias consideration in educational LMs to ensure equitable global access to higher education.

[34] DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

Pranav Narayanan Venkit,Philippe Laban,Yilun Zhou,Kung-Hsiang Huang,Yixin Mao,Chien-Sheng Wu

Main category: cs.CL

TL;DR: DeepTRACE评估显示，生成式搜索引擎和深度研究代理在回答争议问题时普遍存在单方面性、过度自信和引用准确性不足的问题。

Details

Motivation: 生成式搜索引擎和深度研究LLM代理虽然承诺提供可信赖的、基于来源的综合信息，但用户经常遇到过度自信、来源薄弱和引用混乱的问题。 Method: 开发了DeepTRACE审计框架，通过分解回答内容、置信度评分、构建引用和事实支持矩阵，对系统在引用和推理方面的表现进行端到端评估。 Result: 研究发现，生成式搜索引擎和深度研究代理在辩论性问题上频繁提供单方面且高度自信的回答，其中包含大量未被其列出来源支持的陈述；尽管深度研究配置减少了过度自信并提高了引用完整性，但其在争议性问题上的表现依然单方面，且引用准确性仅在40%-80%之间。 Conclusion: DeepTRACE框架揭示了生成式搜索引擎和深度研究代理在回答争议性问题时经常表现出单方面性、过度自信以及引用准确性不足的问题。 Abstract: Generative search engines and deep research LLM agents promise trustworthy, source-grounded synthesis, yet users regularly encounter overconfidence, weak sourcing, and confusing citation practices. We introduce DeepTRACE, a novel sociotechnically grounded audit framework that turns prior community-identified failure cases into eight measurable dimensions spanning answer text, sources, and citations. DeepTRACE uses statement-level analysis (decomposition, confidence scoring) and builds citation and factual-support matrices to audit how systems reason with and attribute evidence end-to-end. Using automated extraction pipelines for popular public models (e.g., GPT-4.5/5, You.com, Perplexity, Copilot/Bing, Gemini) and an LLM-judge with validated agreement to human raters, we evaluate both web-search engines and deep-research configurations. Our findings show that generative search engines and deep research agents frequently produce one-sided, highly confident responses on debate queries and include large fractions of statements unsupported by their own listed sources. Deep-research configurations reduce overconfidence and can attain high citation thoroughness, but they remain highly one-sided on debate queries and still exhibit large fractions of unsupported statements, with citation accuracy ranging from 40--80% across systems.

[35] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

Rushi Wang,Jiateng Liu,Cheng Qian,Yifan Shen,Yanzhou Pan,Zhaozhuo Xu,Ahmed Abbasi,Heng Ji,Denghui Zhang

Main category: cs.CL

TL;DR: This study identifies LLMs' vulnerability to mixed contexts and proposes RW-Steering, a robust solution to enhance response quality by ignoring inappropriate signals.

Details

Motivation: Real-world contexts often mix relevant and inappropriate content, which poses reliability risks for LLMs. This study aims to understand and mitigate LLMs' susceptibility to such risks. Method: Poisoned Context Testbed was introduced, and the Rescorla-Wagner model was adapted to analyze LLM behavior. RW-Steering, a two-stage finetuning method, was proposed. Result: LLMs tend to incorporate less prevalent, inappropriate information. RW-Steering improves response quality by 39.8% and reduces vulnerability to mixed contexts. Conclusion: RW-Steering effectively improves LLMs' ability to ignore inappropriate context, enhancing response quality and safety. Abstract: Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.

[36] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

Rohit Patel

Main category: cs.CL

TL;DR: This paper provides a detailed, step-by-step explanation of key algorithms for instruction tuning of models, aiming to provide a clear understanding and reduce cognitive overhead.

Details

Motivation: The motivation is to provide a clear and intuitive understanding of key algorithms for instruction tuning of models, eliminating ambiguity and reducing cognitive overhead. Method: The paper uses a step-by-step approach with simplified and explicit notation focused on LLMs to explain key algorithms for instruction tuning of models. Result: The result is a self-contained, from-scratch exposition of key algorithms for instruction tuning of models, including SFT, Rejection Sampling, REINFORCE, TRPO, PPO, GRPO, and DPO, as well as a literature review of new techniques and approaches. Conclusion: The paper concludes by presenting new ideas for research and exploration, specifically introducing GRAPE (Generalized Relative Advantage Policy Evolution). Abstract: This paper provides a self-contained, from-scratch, exposition of key algorithms for instruction tuning of models: SFT, Rejection Sampling, REINFORCE, Trust Region Policy Optimization (TRPO), Proximal Policy Optimization (PPO), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Explanations of these algorithms often assume prior knowledge, lack critical details, and/or are overly generalized and complex. Here, each method is discussed and developed step by step using simplified and explicit notation focused on LLMs, aiming to eliminate ambiguity and provide a clear and intuitive understanding of the concepts. By minimizing detours into the broader RL literature and connecting concepts to LLMs, we eliminate superfluous abstractions and reduce cognitive overhead. Following this exposition, we provide a literature review of new techniques and approaches beyond those detailed. Finally, new ideas for research and exploration in the form of GRAPE (Generalized Relative Advantage Policy Evolution) are presented.

[37] VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples

Qixin Sun,Ziqin Wang,Hengyuan Zhao,Yilin Li,Kaiyou Song,Linjiang Huang,Xiaolin Hu,Qingpei Guo,Si Liu

Main category: cs.CL

TL;DR: 本文提出VaccineRAG和Partial-GRPO，通过思维链分析和改进的偏好选择机制，显著提升了检索增强生成模型的性能。

Details

Motivation: 检索增强生成（RAG）的有效性常受限于检索器的精度，许多检索到的样本在生成阶段是无关或误导的，这对LLM的性能构成了瓶颈。 Method: VaccineRAG利用基准测试评估模型在不同正负样本比例数据上的表现，并通过生成显式的思维链分析提升模型的样本辨别能力；Partial-GRPO通过将LLM输出建模为多个组件来增强对复杂序列的偏好选择。 Result: 实验验证了VaccineRAG和Partial-GRPO在提升模型性能方面的有效性，特别是在处理长序列复杂思维链内容时的表现。 Conclusion: VaccineRAG通过引入基于思维链的检索增强生成数据集和Partial-GRPO方法，有效提升了模型在样本辨别和复杂序列学习方面的能力，从而解决了RAG中检索器精度不足的问题。 Abstract: Retrieval Augmented Generation enhances the response accuracy of Large Language Models (LLMs) by integrating retrieval and generation modules with external knowledge, demonstrating particular strength in real-time queries and Visual Question Answering tasks. However, the effectiveness of RAG is frequently hindered by the precision of the retriever: many retrieved samples fed into the generation phase are irrelevant or misleading, posing a critical bottleneck to LLMs' performance. To address this challenge, we introduce VaccineRAG, a novel Chain-of-Thought-based retrieval-augmented generation dataset. On one hand, VaccineRAG employs a benchmark to evaluate models using data with varying positive/negative sample ratios, systematically exposing inherent weaknesses in current LLMs. On the other hand, it enhances models' sample-discrimination capabilities by prompting LLMs to generate explicit Chain-of-Thought (CoT) analysis for each sample before producing final answers. Furthermore, to enhance the model's ability to learn long-sequence complex CoT content, we propose Partial-GRPO. By modeling the outputs of LLMs as multiple components rather than a single whole, our model can make more informed preference selections for complex sequences, thereby enhancing its capacity to learn complex CoT. Comprehensive evaluations and ablation studies on VaccineRAG validate the effectiveness of the proposed scheme. The code and dataset will be publicly released soon.

[38] Behavioral Fingerprinting of Large Language Models

Zehua Pei,Hui-Ling Zhen,Ying Zhang,Zhiyuan Yang,Xing Li,Xianzhi Yu,Mingxuan Yuan,Bei Yu

Main category: cs.CL

TL;DR: 本文提出了一个“行为指纹”框架，用于分析大型语言模型的内在认知和交互风格，揭示了模型之间的深层行为差异以及其交互特性与开发者对齐策略的关系。

Details

Motivation: 当前的大型语言模型基准主要关注性能指标，往往无法捕捉到区分它们的细微行为特征。为此，论文提出了一种超越传统评估方法的新框架。 Method: 论文采用了一种创新的“行为指纹”框架，包括一个精选的诊断提示套件和一个自动化评估流程，其中强大的LLM作为公正的评判者，分析了跨能力层级的十八个模型。 Result: 研究结果显示，在LLM领域存在关键分歧：虽然顶级模型在抽象和因果推理等核心能力上趋于一致，但在对齐相关行为（如阿谀奉承和语义鲁棒性）方面却存在显著差异。此外，研究还记录了跨模型默认人格聚类（ISTJ/ESTJ），这可能反映了常见的对齐激励。 Conclusion: 论文得出结论：大型语言模型（LLM）的交互特性并非其规模或推理能力的自发属性，而是开发者特定且高度可变的对齐策略的直接结果。 Abstract: Current benchmarks for Large Language Models (LLMs) primarily focus on performance metrics, often failing to capture the nuanced behavioral characteristics that differentiate them. This paper introduces a novel ``Behavioral Fingerprinting'' framework designed to move beyond traditional evaluation by creating a multi-faceted profile of a model's intrinsic cognitive and interactive styles. Using a curated \textit{Diagnostic Prompt Suite} and an innovative, automated evaluation pipeline where a powerful LLM acts as an impartial judge, we analyze eighteen models across capability tiers. Our results reveal a critical divergence in the LLM landscape: while core capabilities like abstract and causal reasoning are converging among top models, alignment-related behaviors such as sycophancy and semantic robustness vary dramatically. We further document a cross-model default persona clustering (ISTJ/ESTJ) that likely reflects common alignment incentives. Taken together, this suggests that a model's interactive nature is not an emergent property of its scale or reasoning power, but a direct consequence of specific, and highly variable, developer alignment strategies. Our framework provides a reproducible and scalable methodology for uncovering these deep behavioral differences. Project: https://github.com/JarvisPei/Behavioral-Fingerprinting

[39] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach

Nithyashree Sivasubramaniam

Main category: cs.CL

TL;DR: 我们提出了一种结合基于Transformer的声学模型和LLM的无声语音接口自动语音识别框架。这种方法利用Transformer来捕捉完整的语音上下文，同时利用LLM来保证语言一致性，从而在无声语音接口的可懂度方面实现了显著的改进。

Details

Motivation: 为了解决合成语音的识别和下游处理问题，因为这些语音常常遭受语音模糊和噪音的困扰。 Method: 提出了一种增强的自动语音识别框架，该框架结合了基于Transformer的声学模型和用于后处理的LLM。 Result: 实验结果表明，与36%的基线相比，词错误率（WER）相对降低了16%，绝对降低了6%。 Conclusion: 结合基于Transformer的声学模型和大语言模型（LLM）进行后处理，可以显著提高无声语音接口的可懂度。 Abstract: Silent Speech Interfaces (SSIs) have gained attention for their ability to generate intelligible speech from non-acoustic signals. While significant progress has been made in advancing speech generation pipelines, limited work has addressed the recognition and downstream processing of synthesized speech, which often suffers from phonetic ambiguity and noise. To overcome these challenges, we propose an enhanced automatic speech recognition framework that combines a transformer-based acoustic model with a large language model (LLM) for post-processing. The transformer captures full utterance context, while the LLM ensures linguistic consistency. Experimental results show a 16% relative and 6% absolute reduction in word error rate (WER) over a 36% baseline, demonstrating substantial improvements in intelligibility for silent speech interfaces.

[40] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

Biddut Sarker Bijoy,Mohammad Saqib Hasan,Pegah Alipoormolabashi,Avirup Sil,Aruna Balasubramanian,Niranjan Balasubramanian

Main category: cs.CL

TL;DR: This paper explores multi-agent systems with smaller language models (SLMs) as an alternative to single-agent systems with large language models (LLMs). It introduces a progressive sub-task training method that improves performance and efficiency, showing that multi-agent setups can outperform single-agent ones when using SLMs.

Details

Motivation: The study aims to explore whether multi-agent systems with smaller language models (SLMs) can serve as an effective and efficient alternative to single-agent systems using large language models (LLMs) for complex problem-solving. Method: The authors compare single-agent and multi-agent systems using different-sized language models on complex tasks in the AppWorld environment. They introduce a progressive sub-task training approach inspired by curriculum learning and perform ablation studies to evaluate its impact. Result: The authors find that SLMs struggle with long-trajectory learning and subtask mastery. Their proposed progressive training strategy improves multi-agent effectiveness across all configurations, and Pareto analysis confirms better trade-offs between effectiveness and efficiency. Conclusion: Multi-agent systems using smaller language models (SLMs) can achieve better effectiveness-efficiency trade-offs when applying a progressive sub-task training strategy, compared to single-agent systems using large language models (LLMs). Abstract: Multi-agent systems with smaller language models (SLMs) present a viable alternative to single agent systems powered by large language models (LLMs) for addressing complex problems. In this work, we study how these alternatives compare in terms of both effectiveness and efficiency. To study this trade-off, we instantiate single and multi-agent systems for the complex problems in the AppWorld environment using different sized language models. We find that difficulties with long-trajectory learning in smaller language models (SLMs) limit their performance. Even when trained for specialized roles, SLMs fail to learn all subtasks effectively. To address this issue, we introduce a simple progressive sub-task training strategy, which introduces new sub-tasks progressively in each training epoch. We find that this novel strategy, analogous to instance level curriculum learning, consistently improves the effectiveness of multi-agents at all configurations. Our Pareto analysis shows that fine-tuned multi-agent systems yield better effectiveness-efficiency trade-offs. Additional ablations and analyses shows the importance of our progressive training strategy and its ability to reduce subtask error rates.

[41] Combine Virtual Reality and Machine-Learning to Identify the Presence of Dyslexia: A Cross-Linguistic Approach

Michele Materazzini,Gianluca Morciano,Jose Manuel Alcalde-Llergo,Enrique Yeguas-Bolivar,Giuseppe Calabro,Andrea Zingoni,Juri Taborri

Main category: cs.CL

TL;DR: This study explores the use of VR and AI to predict dyslexia in Italian and Spanish university students, using VR-based tasks and ML models to assess reading performance and self-esteem, achieving varying levels of classification accuracy.

Details

Motivation: The research investigates whether VR-derived data from Silent Reading tests and self-esteem assessments can differentiate between students affected by dyslexia and those who are not. Method: Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) was performed, and supervised ML models were trained and tested. Result: Statistical analysis revealed significant differences in completion time for the SR test but not in accuracy or self-esteem. ML models classified the presence/absence of dyslexia with an accuracy of 87.5% for Italian, 66.6% for Spanish, and 75.0% for the pooled group. Conclusion: VR and ML can be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy. Abstract: This study explores the use of virtual reality (VR) and artificial intelligence (AI) to predict the presence of dyslexia in Italian and Spanish university students. In particular, the research investigates whether VR-derived data from Silent Reading (SR) tests and self-esteem assessments can differentiate between students that are affected by dyslexia and students that are not, employing machine learning (ML) algorithms. Participants completed VR-based tasks measuring reading performance and self-esteem. A preliminary statistical analysis (t tests and Mann Whitney tests) on these data was performed, to compare the obtained scores between individuals with and without dyslexia, revealing significant differences in completion time for the SR test, but not in accuracy, nor in self esteem. Then, supervised ML models were trained and tested, demonstrating an ability to classify the presence/absence of dyslexia with an accuracy of 87.5 per cent for Italian, 66.6 per cent for Spanish, and 75.0 per cent for the pooled group. These findings suggest that VR and ML can effectively be used as supporting tools for assessing dyslexia, particularly by capturing differences in task completion speed, but language-specific factors may influence classification accuracy.

[42] Scaling behavior of large language models in emotional safety classification across sizes and tasks

Edoardo Pinzuti,Oliver Tüscher,André Ferreira Castro

Main category: cs.CL

TL;DR: 本文研究了大语言模型在情感安全内容处理中的表现，发现较小模型经过微调后可以达到与大型模型相当的效果。

Details

Motivation: 了解大语言模型如何处理情感敏感内容对于构建安全可靠的系统，尤其是在心理健康背景下至关重要。 Method: 构建了一个包含15K样本的新数据集，并使用ChatGPT生成的情感再解释提示进行增强。评估了四个LLaMA模型在零样本、少样本和微调设置下的表现。 Result: 较大的语言模型在细微的多标签分类和零样本设置中表现出更强的平均性能。然而，经过轻量级微调后，1B模型在某些高数据类别中的表现可与更大的模型和BERT媲美。 Conclusion: 较小的本地模型可以作为敏感应用的可行替代方案，能够解释情感背景并维护安全对话边界。 Abstract: Understanding how large language models (LLMs) process emotionally sensitive content is critical for building safe and reliable systems, particularly in mental health contexts. We investigate the scaling behavior of LLMs on two key tasks: trinary classification of emotional safety (safe vs. unsafe vs. borderline) and multi-label classification using a six-category safety risk taxonomy. To support this, we construct a novel dataset by merging several human-authored mental health datasets (> 15K samples) and augmenting them with emotion re-interpretation prompts generated via ChatGPT. We evaluate four LLaMA models (1B, 3B, 8B, 70B) across zero-shot, few-shot, and fine-tuning settings. Our results show that larger LLMs achieve stronger average performance, particularly in nuanced multi-label classification and in zero-shot settings. However, lightweight fine-tuning allowed the 1B model to achieve performance comparable to larger models and BERT in several high-data categories, while requiring <2GB VRAM at inference. These findings suggest that smaller, on-device models can serve as viable, privacy-preserving alternatives for sensitive applications, offering the ability to interpret emotional context and maintain safe conversational boundaries. This work highlights key implications for therapeutic LLM applications and the scalable alignment of safety-critical systems.

[43] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations

Martha O. Dimgba,Sharon Oba,Ameeta Agrawal,Philippe J. Giabbanelli

Main category: cs.CL

TL;DR: 本文提出了一种名为BAME的偏差缓解策略，通过模型生成的解释来指导针对性的提示工程，有效减少AI生成职业故事中的性别和种族偏见，而无需修改模型参数。

Details

Motivation: 语言模型在输出中会传播社会偏见，尤其是在性别和种族的表示方面。 Method: 通过分析25个职业群体、三个大型语言模型（Claude 3.5 Sonnet、Llama 3.1 70B Instruct和GPT-4 Turbo）以及多个 demographics 维度生成的故事，评估偏差缓解策略BAME的效果。 Result: 在应用偏差缓解策略BAME后，人口统计表示的改进范围为2%到20%。 Conclusion: 引导模型使用其内部推理机制可以显著增强人口统计平等，有助于开发更透明的生成式AI系统。 Abstract: Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.

[44] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

Sophie Jaffer,Simeon Sayer

Main category: cs.CL

TL;DR: The study finds that models trained in a language perform better than those processing translated inputs, emphasizing the need for equitable training data for underrepresented languages.

Details

Motivation: The motivation stems from concerns about the equity of performance of large language models (LLMs) across languages. Given the dominance of English in training data, there is a risk of disadvantaging non-English speakers, prompting the need to investigate whether data disparities affect model performance. Method: This study compares two monolingual BERT models—one trained and tested entirely on Swahili data, and another on comparable English news data. The Swahili news data is translated into English and evaluated using the English-trained model to simulate how multilingual LLMs process non-English queries. Result: The results show that despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors (0.36% vs. 1.47%). This gap indicates that translation alone does not resolve representational differences between languages. Conclusion: The study concludes that native-language training is crucial for reliable outcomes as translation alone does not bridge representational differences between languages. Models trained in one language may struggle with translated inputs, highlighting the importance of addressing dataset disparities for underrepresented languages to reduce digital divides. Abstract: As large language models (LLMs) expand multilingual capabilities, questions remain about the equity of their performance across languages. While many communities stand to benefit from AI systems, the dominance of English in training data risks disadvantaging non-English speakers. To test the hypothesis that such data disparities may affect model performance, this study compares two monolingual BERT models: one trained and tested entirely on Swahili data, and another on comparable English news data. To simulate how multilingual LLMs process non-English queries through internal translation and abstraction, we translated the Swahili news data into English and evaluated it using the English-trained model. This approach tests the hypothesis by evaluating whether translating Swahili inputs for evaluation on an English model yields better or worse performance compared to training and testing a model entirely in Swahili, thus isolating the effect of language consistency versus cross-lingual abstraction. The results prove that, despite high-quality translation, the native Swahili-trained model performed better than the Swahili-to-English translated model, producing nearly four times fewer errors: 0.36% vs. 1.47% respectively. This gap suggests that translation alone does not bridge representational differences between languages and that models trained in one language may struggle to accurately interpret translated inputs due to imperfect internal knowledge representation, suggesting that native-language training remains important for reliable outcomes. In educational and informational contexts, even small performance gaps may compound inequality. Future research should focus on addressing broader dataset development for underrepresented languages and renewed attention to multilingual model evaluation, ensuring the reinforcing effect of global AI deployment on existing digital divides is reduced.

[45] Analysis of Voluntarily Reported Data Post Mesh Implantation for Detecting Public Emotion and Identifying Concern Reports

Indu Bala,Lewis Mitchell,Marianne H Gillam

Main category: cs.CL

TL;DR: 本研究通过分析患者报告，揭示了网状植入物手术后患者情感体验的变化趋势，并强调了情感因素在医疗实践中的重要性。

Details

Motivation: 研究旨在分析接受网状植入物手术的患者情感体验，并探究与医疗设备监管变化及医疗技术进步相关的情感变化趋势。 Method: 利用自然语言处理（NLP）技术，结合加拿大国家研究委员会（NRC）情感词典和TextBlob进行情感分析，对2000年至2021年期间的患者报告进行分类和评估情感极性。 Result: 研究发现2011-2012年和2017-2018年期间，“关注报告”数量增加且情感强度更高，为医疗从业者提供了有价值的见解，有助于改善术前咨询、术后护理及患者准备。 Conclusion: 该研究强调了医疗实践中情感因素的重要性，并指出情感分析在改善患者护理方面的潜力。 Abstract: Mesh implants are widely utilized in hernia repair surgeries, but postoperative complications present a significant concern. This study analyzes patient reports from the Manufacturer and User Facility Device Experience (MAUDE) database spanning 2000 to 2021 to investigate the emotional aspects of patients following mesh implantation using Natural Language Processing (NLP). Employing the National Research Council Canada (NRC) Emotion Lexicon and TextBlob for sentiment analysis, the research categorizes patient narratives into eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and assesses sentiment polarity. The goal is to discern patterns in patient sentiment over time and to identify reports signaling urgent concerns, referred to as "Concern Reports," thereby understanding shifts in patient experiences in relation to changes in medical device regulation and technological advancements in healthcare. The study detected an increase in Concern Reports and higher emotional intensity during the periods of 2011-2012 and 2017-2018. Through temporal analysis of Concern Reports and overall sentiment, this research provides valuable insights for healthcare practitioners, enhancing their understanding of patient experiences post-surgery, which is critical for improving preoperative counselling, postoperative care, and preparing patients for mesh implant surgeries. The study underscores the importance of emotional considerations in medical practices and the potential for sentiment analysis to inform and enhance patient care.

[46] Advancing SLM Tool-Use Capability using Reinforcement Learning

Dhruvi Paprunia,Vansh Kharidia,Pankti Doshi

Main category: cs.CL

TL;DR: This paper explores using GRPO-based reinforcement learning to enhance tool-use abilities in small language models, improving their effectiveness for complex tasks.

Details

Motivation: Small Language Models (SLMs) struggle with tool use compared to Large Language Models (LLMs) due to limited training data and contextual understanding, creating a need for methods to enhance their efficiency and adaptability. Method: The research employs Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to improve the tool-use capabilities of Small Language Models (SLMs). Result: The proposed GRPO method effectively boosts SLM tool-use accuracy, providing a more efficient and adaptable solution compared to traditional fine-tuning approaches. Conclusion: The study concludes that using GRPO-based reinforcement learning significantly enhances the tool-use proficiency of small language models (SLMs), making them more practical and effective for complex tasks. Abstract: Large Language Models (LLMs) have progressed beyond simple text creation, and tool use has become increasingly important for complex, real-world tasks. Tool use in LLMs refers to their ability to utilize external resources such as APIs, databases, or software functions to extend their functionality beyond generating text.Tools are used for tasks such as performing calculations, making API calls to retrieve the current time and date, and more. This capability enables models to fetch real-time data, execute commands, or solve problems requiring dynamic interaction, making it indispensable for applications like AI agents in virtual assistants, robotic control, or automated workflows. However, while LLMs are usually adept tool use, their vast resource requirements and computation complexity restrict their use in every use case.As a result, there is an increasing need for more compact and efficient Small Language Models (SLMs). Small language models (SLMs) struggle in tool use compared to large language models (LLMs). As soon in Table 1. SLMs are typically trained on smaller, more specific datasets, resulting in a narrower knowledge base and limited contextual understanding compared to LLMs. This research addresses these challenges by using Reinforcement Learning (RL), specifically Group Relative Policy Optimization (GRPO), to enhance tool-use proficiency in SLMs. Unlike conventional fine-tuning approaches that require heavy computation and often lack adaptability, our method provides an efficient, effective solution that significantly boosts SLM tool-use accuracy, increasing their practical utility.

[47] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn's Disease

Zvi Badash,Hadas Ben-Atya,Naama Gavrielov,Liam Hazan,Gili Focht,Ruth Cytter-Kuint,Talar Hagopian,Dan Turner,Moti Freiman

Main category: cs.CL

TL;DR: HSMP-BERT是一种用于从希伯来语放射文本中提取信息的提示模型，它提供了一种可扩展的放射学结构化提取解决方案，能够进行群体水平的克罗恩病分析，并展示了人工智能在低资源环境中的潜力。

Details

Motivation: 从放射学报告中提取结构化的临床信息具有挑战性，尤其是在资源不足的语言中。克罗恩病的情况尤为明显，其多器官发现代表稀少。 Method: HSMP-BERT，一种基于提示的模型，用于从希伯来语放射文本中提取信息。 Result: 在24种器官-发现组合中，HSMP-BERT的平均F1得分为0.83±0.08，κ为0.65±0.17，优于SMP零样本基线（F1 0.49±0.07，κ 0.06±0.07）和标准微调（F1 0.30±0.27，κ 0.27±0.34；配对t检验p < 10^{-7}）。层次推理将运行时间减少了5.1倍。 Conclusion: HSMP-BERT提供了一种可扩展的放射学结构化提取解决方案，能够进行群体水平的克罗恩病分析，并展示了人工智能在低资源环境中的潜力。 Abstract: Extracting structured clinical information from radiology reports is challenging, especially in low-resource languages. This is pronounced in Crohn's disease, with sparsely represented multi-organ findings. We developed Hierarchical Structured Matching Prediction BERT (HSMP-BERT), a prompt-based model for extraction from Hebrew radiology text. In an administrative database study, we analyzed 9,683 reports from Crohn's patients imaged 2010-2023 across Israeli providers. A subset of 512 reports was radiologist-annotated for findings across six gastrointestinal organs and 15 pathologies, yielding 90 structured labels per subject. Multilabel-stratified split (66% train+validation; 33% test), preserving label prevalence. Performance was evaluated with accuracy, F1, Cohen's $\kappa$, AUC, PPV, NPV, and recall. On 24 organ-finding combinations with $>$15 positives, HSMP-BERT achieved mean F1 0.83$\pm$0.08 and $\kappa$ 0.65$\pm$0.17, outperforming the SMP zero-shot baseline (F1 0.49$\pm$0.07, $\kappa$ 0.06$\pm$0.07) and standard fine-tuning (F1 0.30$\pm$0.27, $\kappa$ 0.27$\pm$0.34; paired t-test $p < 10^{-7}$). Hierarchical inference cuts runtime 5.1$\times$ vs. traditional inference. Applied to all reports, it revealed associations among ileal wall thickening, stenosis, and pre-stenotic dilatation, plus age- and sex-specific trends in inflammatory findings. HSMP-BERT offers a scalable solution for structured extraction in radiology, enabling population-level analysis of Crohn's disease and demonstrating AI's potential in low-resource settings.

[48] Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

David Anderson,Galia Benitez,Margret Bjarnadottir,Shriyan Reyya

Main category: cs.CL

TL;DR: 该研究利用大型语言模型分析大量西班牙语新闻文章，重建哥伦比亚冲突的历史记忆，并揭示暴力与古柯作物根除之间的关系。

Details

Motivation: 哥伦比亚数十年来陷入武装冲突，但直到最近，政府并未优先系统记录暴力事件，导致缺乏公开的冲突信息和历史记录。 Method: 利用GPT这一大型语言模型（LLM）对超过20万篇与暴力相关的西班牙语新闻文章进行阅读和问题回答，生成数据集，并进行描述性分析以及暴力与古柯作物根除之间关系的研究。 Result: 通过生成的数据集，该研究不仅进行了描述性分析，还研究了暴力与古柯作物根除之间的关系，并展示了此类数据对政策分析的支持作用。 Conclusion: 该研究展示了大型语言模型（LLM）如何开辟新的研究机会，使得以前所未有的深度对大量文本语料进行分析成为可能。 Abstract: Colombia has been submerged in decades of armed conflict, yet until recently, the systematic documentation of violence was not a priority for the Colombian government. This has resulted in a lack of publicly available conflict information and, consequently, a lack of historical accounts. This study contributes to Colombia's historical memory by utilizing GPT, a large language model (LLM), to read and answer questions about over 200,000 violence-related newspaper articles in Spanish. We use the resulting dataset to conduct both descriptive analysis and a study of the relationship between violence and the eradication of coca crops, offering an example of policy analyses that such data can support. Our study demonstrates how LLMs have opened new research opportunities by enabling examinations of large text corpora at a previously infeasible depth.

[49] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation

Zaifu Zhan,Shuang Zhou,Min Zeng,Kai Yu,Meijia Song,Xiaoyi Chen,Jun Wang,Yu Hou,Rui Zhang

Main category: cs.CL

TL;DR: This study demonstrates that quantization can significantly reduce the computational demands of large biomedical language models while maintaining performance, making them practical for secure, local deployment in healthcare settings.

Details

Motivation: The motivation is to address the barriers in deploying large language models in healthcare settings due to their computational demands and data privacy concerns, by exploring the effectiveness of quantization. Method: The study systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks. Result: Quantization was found to reduce GPU memory requirements by up to 75% while maintaining model performance across various tasks, enabling deployment of 70B-parameter models on 40GB consumer-grade GPUs. Domain-specific knowledge and responsiveness to prompting methods were also preserved. Conclusion: This study concludes that quantization is a practical and effective strategy for deploying large biomedical language models on local, resource-limited settings without compromising performance, thereby bridging the gap between AI advancements and clinical applications. Abstract: Large language models have demonstrated remarkable capabilities in biomedical natural language processing, yet their rapid growth in size and computational requirements present a major barrier to adoption in healthcare settings where data privacy precludes cloud deployment and resources are limited. In this study, we systematically evaluated the impact of quantization on 12 state-of-the-art large language models, including both general-purpose and biomedical-specific models, across eight benchmark datasets covering four key tasks: named entity recognition, relation extraction, multi-label classification, and question answering. We show that quantization substantially reduces GPU memory requirements-by up to 75%-while preserving model performance across diverse tasks, enabling the deployment of 70B-parameter models on 40GB consumer-grade GPUs. In addition, domain-specific knowledge and responsiveness to advanced prompting methods are largely maintained. These findings provide significant practical and guiding value, highlighting quantization as a practical and effective strategy for enabling the secure, local deployment of large yet high-capacity language models in biomedical contexts, bridging the gap between technical advances in AI and real-world clinical translation.

[50] Sample-efficient Integration of New Modalities into Large Language Models

Osman Batur İnce,André F. T. Martins,Oisin Mac Aodha,Edoardo M. Ponti

Main category: cs.CL

TL;DR: 本文提出了一种高效的新模态集成方法 SEMI，通过超网络调整投影器，实现对低资源模态的少量样本学习。

Details

Motivation: 由于模态空间庞大且不断发展，从头训练涵盖所有模态的模型不可行，且现有方法需要大量配对数据，这对低资源模态不可行。 Method: 设计了一个超网络来调整共享投影器，使其适应任意模态。通过等距变换增加训练模态的多样性，并在推理时利用少量样本生成合适的适配器。 Result: SEMI 在少量样本下显著提高了新模态集成的效率。例如，达到32样本下的相同准确率，从头训练需要64倍的数据。 Conclusion: SEMI 提出了一种高效的方法，用于将新模态集成到大型语言模型中，显著减少了所需数据量，并展示了其在扩展基础模型模态覆盖范围方面的潜力。 Abstract: Multimodal foundation models can process several modalities. However, since the space of possible modalities is large and evolving over time, training a model from scratch to encompass all modalities is unfeasible. Moreover, integrating a modality into a pre-existing foundation model currently requires a significant amount of paired data, which is often not available for low-resource modalities. In this paper, we introduce a method for sample-efficient modality integration (SEMI) into Large Language Models (LLMs). To this end, we devise a hypernetwork that can adapt a shared projector -- placed between modality-specific encoders and an LLM -- to any modality. The hypernetwork, trained on high-resource modalities (i.e., text, speech, audio, video), is conditioned on a few samples from any arbitrary modality at inference time to generate a suitable adapter. To increase the diversity of training modalities, we artificially multiply the number of encoders through isometric transformations. We find that SEMI achieves a significant boost in sample efficiency during few-shot integration of new modalities (i.e., satellite images, astronomical images, inertial measurements, and molecules) with encoders of arbitrary embedding dimensionality. For instance, to reach the same accuracy as 32-shot SEMI, training the projector from scratch needs 64$\times$ more data. As a result, SEMI holds promise to extend the modality coverage of foundation models.

[51] Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

Faruk Alpay,Taylan Alpay

Main category: cs.CL

TL;DR: 本文探讨了通过提示、激活和权重三个层面的干预方法来实现对Transformer模型的细粒度控制，旨在解决可控文本生成问题。

Details

Motivation: Transformer模型在自然语言处理任务中表现出色，但在实现细粒度控制方面仍具挑战性。 Method: 将可控文本生成形式化为可通过提示工程、参数高效微调、模型编辑和强化学习解决的优化问题，并引入了一个统一的框架，涵盖提示层引导、激活干预和权重空间编辑。 Result: 理论上证明了最小权重更新可以实现目标行为改变且副作用有限；实证结果显示情感控制和事实编辑的成功率超过90%，同时保持基础性能。 Conclusion: 本文为设计可控且鲁棒的语言模型奠定了基础，同时讨论了伦理上的双重用途风险和评估的重要性。 Abstract: Transformer-based language models excel in NLP tasks, but fine-grained control remains challenging. This paper explores methods for manipulating transformer models through principled interventions at three levels: prompts, activations, and weights. We formalize controllable text generation as an optimization problem addressable via prompt engineering, parameter-efficient fine-tuning, model editing, and reinforcement learning. We introduce a unified framework encompassing prompt-level steering, activation interventions, and weight-space edits. We analyze robustness and safety implications, including adversarial attacks and alignment mitigations. Theoretically, we show minimal weight updates can achieve targeted behavior changes with limited side-effects. Empirically, we demonstrate >90% success in sentiment control and factual edits while preserving base performance, though generalization-specificity trade-offs exist. We discuss ethical dual-use risks and the need for rigorous evaluation. This work lays groundwork for designing controllable and robust language models.

[52] Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization

Lee Kezar,Zed Sehyr,Jesse Thomason

Main category: cs.CL

TL;DR: This research explores how linguistically-motivated biases can enhance the generalization capabilities of models for sign language, particularly for handling unseen signs.

Details

Motivation: The motivation stems from the issue that sign language datasets often lack representativeness in vocabulary, necessitating models that can generalize to unseen signs. Method: The study uses a vector-quantized autoencoder with two phonological inductive biases: Parameter Disentanglement as an architectural bias and Phonological Semi-Supervision as a regularization technique. Result: The proposed model shows improved performance in both one-shot reconstruction of unseen signs and sign identification compared to a baseline model. Conclusion: This work concludes that explicit, linguistically-motivated biases can significantly improve the generalization of learned representations in sign language models, particularly for unseen signs. Abstract: Sign language datasets are often not representative in terms of vocabulary, underscoring the need for models that generalize to unseen signs. Vector quantization is a promising approach for learning discrete, token-like representations, but it has not been evaluated whether the learned units capture spurious correlations that hinder out-of-vocabulary performance. This work investigates two phonological inductive biases: Parameter Disentanglement, an architectural bias, and Phonological Semi-Supervision, a regularization technique, to improve isolated sign recognition of known signs and reconstruction quality of unseen signs with a vector-quantized autoencoder. The primary finding is that the learned representations from the proposed model are more effective for one-shot reconstruction of unseen signs and more discriminative for sign identification compared to a controlled baseline. This work provides a quantitative analysis of how explicit, linguistically-motivated biases can improve the generalization of learned representations of sign language.

[53] Spoken in Jest, Detected in Earnest: A Systematic Review of Sarcasm Recognition -- Multimodal Fusion, Challenges, and Future Prospects

Xiyuan Gao,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: This systematic review explores speech-based sarcasm recognition, emphasizing the need for multimodal approaches, cross-cultural studies, and improved datasets to better understand sarcasm in spoken language.

Details

Motivation: Sarcasm is a complex aspect of human communication with challenges in interpersonal and human-machine interactions, and the role of speech data in recognizing sarcasm has been underexplored despite its importance in improving machine understanding and aiding individuals with neurodegenerative conditions. Method: The paper conducts a systematic review of speech-based sarcasm recognition, analyzing datasets, feature extraction techniques, and classification methods ranging from unimodal to multimodal approaches. Result: The findings highlight limitations in current datasets, the evolution from traditional acoustic features to deep learning-based representations, and the shift from unimodal to multimodal classification techniques. Conclusion: The paper concludes that sarcasm recognition should shift focus from text-based to multimodal approaches, emphasizing the importance of cross-cultural and multilingual studies. Abstract: Sarcasm, a common feature of human communication, poses challenges in interpersonal interactions and human-machine interactions. Linguistic research has highlighted the importance of prosodic cues, such as variations in pitch, speaking rate, and intonation, in conveying sarcastic intent. Although previous work has focused on text-based sarcasm detection, the role of speech data in recognizing sarcasm has been underexplored. Recent advancements in speech technology emphasize the growing importance of leveraging speech data for automatic sarcasm recognition, which can enhance social interactions for individuals with neurodegenerative conditions and improve machine understanding of complex human language use, leading to more nuanced interactions. This systematic review is the first to focus on speech-based sarcasm recognition, charting the evolution from unimodal to multimodal approaches. It covers datasets, feature extraction, and classification methods, and aims to bridge gaps across diverse research domains. The findings include limitations in datasets for sarcasm recognition in speech, the evolution of feature extraction techniques from traditional acoustic features to deep learning-based representations, and the progression of classification methods from unimodal approaches to multimodal fusion techniques. In so doing, we identify the need for greater emphasis on cross-cultural and multilingual sarcasm recognition, as well as the importance of addressing sarcasm as a multimodal phenomenon, rather than a text-based challenge.

[54] PRIM: Towards Practical In-Image Multilingual Machine Translation

Yanzhi Tian,Zeming Liu,Zhengyang Liu,Chong Feng,Xin Li,Heyan Huang,Yuhang Guo

Main category: cs.CL

TL;DR: 本文提出了VisTrans模型和PRIM数据集，推动了实际场景中的图像内多语言机器翻译研究。

Details

Motivation: 现有端到端的IIMT研究主要基于合成数据，与现实世界存在显著差距，因此需要研究实际场景中的多语言IIMT。 Method: 提出了VisTrans模型，分别处理图像中的视觉文本和背景信息，以应对实际条件下的挑战。 Result: 实验结果显示，VisTrans在翻译质量和视觉效果方面优于其他模型。 Conclusion: VisTrans模型在PRIM数据集上实现了更好的翻译质量和视觉效果，推动了实际场景中的IIMMT研究。 Abstract: In-Image Machine Translation (IIMT) aims to translate images containing texts from one language to another. Current research of end-to-end IIMT mainly conducts on synthetic data, with simple background, single font, fixed text position, and bilingual translation, which can not fully reflect real world, causing a significant gap between the research and practical conditions. To facilitate research of IIMT in real-world scenarios, we explore Practical In-Image Multilingual Machine Translation (IIMMT). In order to convince the lack of publicly available data, we annotate the PRIM dataset, which contains real-world captured one-line text images with complex background, various fonts, diverse text positions, and supports multilingual translation directions. We propose an end-to-end model VisTrans to handle the challenge of practical conditions in PRIM, which processes visual text and background information in the image separately, ensuring the capability of multilingual translation while improving the visual quality. Experimental results indicate the VisTrans achieves a better translation quality and visual effect compared to other models. The code and dataset are available at: https://github.com/BITHLP/PRIM.

[55] Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

Brennen Hill,Surendra Parla,Venkata Abhijeeth Balabhadruni,Atharv Prajod Padmalayam,Sujay Chandra Shekara Sharma

Main category: cs.CL

TL;DR: This paper surveys prompt-based attack methodologies on Large Language Models to understand security threats and develop robust countermeasures.

Details

Motivation: The proliferation of Large Language Models (LLMs) has introduced critical security challenges, where adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments. Method: The paper presents a comprehensive literature survey of prompt-based attack methodologies, categorizing them to provide a clear threat model. Result: The paper provides a systematic understanding of prompt-based attack vectors and aims to inform the research community's efforts in building the next generation of secure LLMs. Conclusion: The paper concludes that understanding prompt-based attack vectors is crucial for developing robust countermeasures and building secure LLMs resistant to unauthorized distillation, fine-tuning, and editing. Abstract: The proliferation of Large Language Models (LLMs) has introduced critical security challenges, where adversarial actors can manipulate input prompts to cause significant harm and circumvent safety alignments. These prompt-based attacks exploit vulnerabilities in a model's design, training, and contextual understanding, leading to intellectual property theft, misinformation generation, and erosion of user trust. A systematic understanding of these attack vectors is the foundational step toward developing robust countermeasures. This paper presents a comprehensive literature survey of prompt-based attack methodologies, categorizing them to provide a clear threat model. By detailing the mechanisms and impacts of these exploits, this survey aims to inform the research community's efforts in building the next generation of secure LLMs that are inherently resistant to unauthorized distillation, fine-tuning, and editing.

[56] Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety

Sharif Noor Zisad,Ragib Hasan

Main category: cs.CL

TL;DR: 基于Transformer的模型（如BERT）比传统机器学习方法更好地理解和分类灾难相关的社交媒体文本，从而提高应急服务的响应速度和效果。

Details

Motivation: 传统的机器学习模型（如逻辑回归、朴素贝叶斯和支持向量机）在理解非正式、隐喻或模糊语言的上下文或深层含义方面存在不足，而Transformer模型可能在此类任务中表现更好。 Method: 评估了基于Transformer的模型（包括BERT、DistilBERT、RoBERTa和DeBERTa）在分类灾难相关推文中的有效性，并将其与传统机器学习方法进行了比较。 Result: 实验结果显示，BERT取得了最高的准确率（91%），显著优于逻辑回归和朴素贝叶斯等传统模型（均为82%）。 Conclusion: Transformer模型在分类灾难相关推文方面优于传统机器学习模型，提供更高的准确性、更深入的语言理解和更好的泛化能力。 Abstract: Twitter and other social media platforms have become vital sources of real time information during disasters and public safety emergencies. Automatically classifying disaster related tweets can help emergency services respond faster and more effectively. Traditional Machine Learning (ML) models such as Logistic Regression, Naive Bayes, and Support Vector Machines have been widely used for this task, but they often fail to understand the context or deeper meaning of words, especially when the language is informal, metaphorical, or ambiguous. We posit that, in this context, transformer based models can perform better than traditional ML models. In this paper, we evaluate the effectiveness of transformer based models, including BERT, DistilBERT, RoBERTa, and DeBERTa, for classifying disaster related tweets. These models are compared with traditional ML approaches to highlight the performance gap. Experimental results show that BERT achieved the highest accuracy (91%), significantly outperforming traditional models like Logistic Regression and Naive Bayes (both at 82%). The use of contextual embeddings and attention mechanisms allows transformer models to better understand subtle language in tweets, where traditional ML models fall short. This research demonstrates that transformer architectures are far more suitable for public safety applications, offering improved accuracy, deeper language understanding, and better generalization across real world social media text.

[57] Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

Ayush Gupta,Ramneet Kaur,Anirban Roy,Adam D. Cobb,Rama Chellappa,Susmit Jha

Main category: cs.CL

TL;DR: The paper proposes an out-of-domain detection method for specialized large language models using the ICAD framework and dropout tolerance, achieving better performance than baseline methods.

Details

Motivation: Specialized large language models (LLMs) are vulnerable to unreliable outputs when handling out-of-domain (OOD) inputs, which poses risks in critical applications. The paper aims to address this issue by proposing a novel OOD detection algorithm. Method: The method uses the Inductive Conformal Anomaly Detection (ICAD) framework with a new non-conformity measure based on the model's dropout tolerance. Dropout tolerance is aggregated across multiple layers using an ensemble approach. Result: Experiments on medical-specialized LLMs demonstrate that the proposed method outperforms baseline methods in detecting OOD inputs, achieving AUROC improvements of 2% to 37%. Conclusion: The paper concludes that the proposed OOD detection method improves detection performance compared to baseline methods, with significant AUROC improvements in experiments on medical-specialized LLMs. Abstract: We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model's dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of $2\%$ to $37\%$ when treating OOD datapoints as positives and in-domain test datapoints as negatives.

[58] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

Aisha Alansari,Hamzah Luqman

Main category: cs.CL

TL;DR: This paper evaluates hallucinations in Arabic and multilingual LLMs, finding that factual hallucinations are common and that the Arabic pre-trained model Allam performs well compared to other models.

Details

Motivation: The motivation stems from the knowledge gap in evaluating LLM hallucinations in the Arabic context despite its widespread use and importance in global communication, along with the increasing number of multilingual and Arabic-specific LLMs. Method: The study uses a fine-grained hallucination evaluation framework with 12 hallucination indicators to assess 12 LLMs, including Arabic pre-trained, multilingual, and reasoning-based models, on Arabic natural language generation tasks. Result: The results indicate that factual hallucinations are more common than faithfulness errors across all models and tasks, with the Allam Arabic pre-trained model outperforming multilingual models and performing comparably to reasoning-based models. Conclusion: The paper concludes that factual hallucinations are more prevalent than faithfulness errors in Arabic and multilingual LLMs, with the Arabic pre-trained model Allam showing consistently lower hallucination rates compared to multilingual models and performing comparably with reasoning-based models. Abstract: Recently, extensive research on the hallucination of the large language models (LLMs) has mainly focused on the English language. Despite the growing number of multilingual and Arabic-specific LLMs, evaluating LLMs' hallucination in the Arabic context remains relatively underexplored. The knowledge gap is particularly pressing given Arabic's widespread use across many regions and its importance in global communication and media. This paper presents the first comprehensive hallucination evaluation of Arabic and multilingual LLMs on two critical Arabic natural language generation tasks: generative question answering (GQA) and summarization. This study evaluates a total of 12 LLMs, including 4 Arabic pre-trained models, 4 multilingual models, and 4 reasoning-based models. To assess the factual consistency and faithfulness of LLMs' outputs, we developed a fine-grained hallucination evaluation framework consisting of 12 fine-grained hallucination indicators that represent the varying characteristics of each task. The results reveal that factual hallucinations are more prevalent than faithfulness errors across all models and tasks. Notably, the Arabic pre-trained model Allam consistently demonstrates lower hallucination rates than multilingual models and a comparative performance with reasoning-based models. The code is available at: \href{https://github.com/aishaalansari57/AraHalluEval}{Github link}.

[59] Evaluating NL2SQL via SQL2NL

Mohammadtaher Safarzadeh,Afshin Oroojlooyjadid,Dan Roth

Main category: cs.CL

TL;DR: 本文提出了一种新的模式对齐的改写框架，用于评估NL2SQL模型在语言变化下的鲁棒性，揭示了现有模型的脆弱性。

Details

Motivation: 现有基准测试很少以系统或受控的方式解决语言变化对NL2SQL模型泛化能力的影响。 Method: 提出了一种新颖的模式对齐的改写框架，利用SQL2NL自动生成语义等效且词汇多样的查询，同时保持与原始模式和意图的一致性。 Result: 分析表明，最先进的模型比标准基准测试所显示的要脆弱得多。例如，LLaMa3.3-70B在改写后的Spider查询中执行准确率下降了10.23%（从77.11%到66.9%），而LLaMa3.1-8B的下降幅度更大，接近20%（从62.9%到42.5%）。 Conclusion: 评估NL2SQL模型在语言变化下的鲁棒性对于理解其泛化能力至关重要，而现有基准测试很少系统地解决这一因素。 Abstract: Robust evaluation in the presence of linguistic variation is key to understanding the generalization capabilities of Natural Language to SQL (NL2SQL) models, yet existing benchmarks rarely address this factor in a systematic or controlled manner. We propose a novel schema-aligned paraphrasing framework that leverages SQL-to-NL (SQL2NL) to automatically generate semantically equivalent, lexically diverse queries while maintaining alignment with the original schema and intent. This enables the first targeted evaluation of NL2SQL robustness to linguistic variation in isolation-distinct from prior work that primarily investigates ambiguity or schema perturbations. Our analysis reveals that state-of-the-art models are far more brittle than standard benchmarks suggest. For example, LLaMa3.3-70B exhibits a 10.23% drop in execution accuracy (from 77.11% to 66.9%) on paraphrased Spider queries, while LLaMa3.1-8B suffers an even larger drop of nearly 20% (from 62.9% to 42.5%). Smaller models (e.g., GPT-4o mini) are disproportionately affected. We also find that robustness degradation varies significantly with query complexity, dataset, and domain -- highlighting the need for evaluation frameworks that explicitly measure linguistic generalization to ensure reliable performance in real-world settings.

[60] Why Language Models Hallucinate

Adam Tauman Kalai,Ofir Nachum,Santosh S. Vempala,Edwin Zhang

Main category: cs.CL

TL;DR: This paper argues that hallucinations in language models are a result of current training and evaluation practices that reward guessing, and proposes modifying benchmark scoring to create more trustworthy AI systems.

Details

Motivation: Hallucinations in large language models undermine trust and persist even in state-of-the-art systems, which is problematic as these models are increasingly used in important applications. Method: The paper analyzes the statistical causes of hallucinations in the modern training pipeline of language models and examines how the current evaluation procedures contribute to the persistence of hallucinations. Result: The authors find that hallucinations originate as errors in binary classification and persist due to the way most evaluations are graded, which rewards guessing over acknowledging uncertainty. Conclusion: Modifying the scoring of existing benchmarks can address the persistence of hallucinations in language models and lead to more trustworthy AI systems. Abstract: Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

[61] ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs

Samira Khorshidi,Azadeh Nikfarjam,Suprita Shankar,Yisi Sang,Yash Govind,Hyun Jang,Ali Kasgari,Alexis McClimans,Mohamed Soliman,Vishnu Konda,Ahmed Fakhry,Xiaoguang Qi

Main category: cs.CL

TL;DR: ODKE+ is a production-grade system that efficiently extracts and ingests millions of facts into knowledge graphs with high precision, leveraging large language models and ontology-guided workflows.

Details

Motivation: Maintaining the freshness and completeness of knowledge graphs (KGs) is costly, and traditional methods have limitations in coverage and update frequency. Method: ODKE+ uses a scalable pipeline with modular components, including an Extraction Initiator, Evidence Retriever, hybrid Knowledge Extractors, a Grounder, and a Corroborator. It dynamically generates ontology snippets and supports batch and streaming modes. Result: ODKE+ processes over 9 million Wikipedia pages and ingests 19 million high-confidence facts with 98.8% precision, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Conclusion: ODKE+ demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness and production-scale knowledge ingestion with broad real-world applicability. Abstract: Knowledge graphs (KGs) are foundational to many AI applications, but maintaining their freshness and completeness remains costly. We present ODKE+, a production-grade system that automatically extracts and ingests millions of open-domain facts from web sources with high precision. ODKE+ combines modular components into a scalable pipeline: (1) the Extraction Initiator detects missing or stale facts, (2) the Evidence Retriever collects supporting documents, (3) hybrid Knowledge Extractors apply both pattern-based rules and ontology-guided prompting for large language models (LLMs), (4) a lightweight Grounder validates extracted facts using a second LLM, and (5) the Corroborator ranks and normalizes candidate facts for ingestion. ODKE+ dynamically generates ontology snippets tailored to each entity type to align extractions with schema constraints, enabling scalable, type-consistent fact extraction across 195 predicates. The system supports batch and streaming modes, processing over 9 million Wikipedia pages and ingesting 19 million high-confidence facts with 98.8% precision. ODKE+ significantly improves coverage over traditional methods, achieving up to 48% overlap with third-party KGs and reducing update lag by 50 days on average. Our deployment demonstrates that LLM-based extraction, grounded in ontological structure and verification workflows, can deliver trustworthiness, production-scale knowledge ingestion with broad real-world applicability. A recording of the system demonstration is included with the submission and is also available at https://youtu.be/UcnE3_GsTWs.

[62] OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

Wei Chu,Yuanzhe Dong,Ke Tan,Dong Han,Xavier Menendez-Pidal,Ruchao Fan,Chenfeng Miao,Chanwoo Kim,Bhiksha Raj,Rita Singh

Main category: cs.CL

TL;DR: OleSpeech-IV是一个大型多说话人、多语言的对话语音数据集，适用于研究和非商业用途。

Details

Motivation: 为了推动语音识别和自然语言处理领域的发展，提供高质量的多语言对话数据。 Method: 数据集的音频内容来自公开的英语播客、脱口秀、电话会议等，通过专有管道处理，提取说话人信息、转录文本、时间戳和置信度评分。 Result: 发布了OleSpeech-IV数据集，并开源了其中的OleSpeech-IV-2025-EN-AR-100子集，适用于非商业研究。 Conclusion: OleSpeech-IV为语音处理研究提供了宝贵的资源，尤其是多语言和多说话人的场景。 Abstract: OleSpeech-IV dataset is a large-scale multispeaker and multilingual conversational speech dataset with diverse topics. The audio content comes from publicly-available English podcasts, talk shows, teleconferences, and other conversations. Speaker names, turns, and transcripts are human-sourced and refined by a proprietary pipeline, while additional information such as timestamps and confidence scores is derived from the pipeline. The IV denotes its position as Tier IV in the Olewave dataset series. In addition, we have open-sourced a subset, OleSpeech-IV-2025-EN-AR-100, for non-commercial research use.

[63] KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

Yushi Sun,Kai Sun,Yifan Ethan Xu,Xiao Yang,Xin Luna Dong,Nan Tang,Lei Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于知识图谱的问答新方法 KERAG，通过检索更广泛的子图并结合微调模型，显著提高了问答覆盖率和效果。

Details

Motivation: 传统的知识图谱问答方法依赖于语义解析，通常因严格模式要求和语义模糊而覆盖范围较低，因此需要一种新的方法来缓解这些问题。 Method: 提出了一种新的KG-based RAG流程 KERAG，并结合了检索-过滤-总结的方法以及为思维链推理微调的大语言模型。 Result: 实验表明，KERAG 在质量上比最先进的解决方案高出约 7%，并且比 GPT-4o（工具）高出 10-21%。 Conclusion: KERAG 是一种基于知识图谱的检索增强生成管道，它通过检索可能包含相关信息的更广泛子图来提高问答覆盖率。这种方法结合了检索-过滤-总结的方法以及为思维链推理微调的大语言模型，从而减少了噪声并提升了简单和复杂问题的问答效果。 Abstract: Retrieval-Augmented Generation (RAG) mitigates hallucination in Large Language Models (LLMs) by incorporating external data, with Knowledge Graphs (KGs) offering crucial information for question answering. Traditional Knowledge Graph Question Answering (KGQA) methods rely on semantic parsing, which typically retrieves knowledge strictly necessary for answer generation, thus often suffer from low coverage due to rigid schema requirements and semantic ambiguity. We present KERAG, a novel KG-based RAG pipeline that enhances QA coverage by retrieving a broader subgraph likely to contain relevant information. Our retrieval-filtering-summarization approach, combined with fine-tuned LLMs for Chain-of-Thought reasoning on knowledge sub-graphs, reduces noises and improves QA for both simple and complex questions. Experiments demonstrate that KERAG surpasses state-of-the-art solutions by about 7% in quality and exceeds GPT-4o (Tool) by 10-21%.

[64] A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

Cheng Peng,Xinyu Dong,Mengxian Lyu,Daniel Paredes,Yaoyun Zhang,Yonghui Wu

Main category: cs.CL

TL;DR: 本研究分析了大型语言模型（LLMs）在临床患者信息提取任务中的应用，发现解码器型模型和多任务指令调优技术在少样本和零样本学习中表现优异，为构建高效、通用的临床信息提取系统提供了新思路。

Details

Motivation: 尽管大型语言模型（LLMs）在自然语言处理领域取得了显著进展，但其在临床患者信息提取任务中的最佳应用方式仍需进一步研究。本研究旨在探讨LLM架构、微调策略及多任务指令调优技术对系统鲁棒性和泛化能力的影响。 Method: 研究评估了包括编码器型LLMs（如BERT、GatorTron）和解码器型LLMs（如GatorTronGPT、Llama 3.1）在内的多种模型，比较了传统全量微调与基于提示的参数高效微调（PEFT）。此外，还探索了结合多任务的指令调优框架，并使用留一数据集策略评估零样本和少样本学习性能。 Result: 研究表明，解码器型LLMs在某些临床信息提取任务中表现优于编码器型LLMs。基于提示的参数高效微调（PEFT）在少样本学习场景下表现出色，而多任务指令调优框架显著提升了零样本和少样本学习的性能。 Conclusion: 本研究通过比较不同架构的大型语言模型（LLMs）及其微调策略，探索了在临床患者信息提取任务中构建鲁棒且通用系统的最佳方法。研究发现，多任务指令调优和参数高效微调技术可以显著提高模型在零样本和少样本学习中的表现。 Abstract: Natural language processing (NLP) is a key technology to extract important patient information from clinical narratives to support healthcare applications. The rapid development of large language models (LLMs) has revolutionized many NLP tasks in the clinical domain, yet their optimal use in patient information extraction tasks requires further exploration. This study examines LLMs' effectiveness in patient information extraction, focusing on LLM architectures, fine-tuning strategies, and multi-task instruction tuning techniques for developing robust and generalizable patient information extraction systems. This study aims to explore key concepts of using LLMs for clinical concept and relation extraction tasks, including: (1) encoder-only or decoder-only LLMs, (2) prompt-based parameter-efficient fine-tuning (PEFT) algorithms, and (3) multi-task instruction tuning on few-shot learning performance. We benchmarked a suite of LLMs, including encoder-based LLMs (BERT, GatorTron) and decoder-based LLMs (GatorTronGPT, Llama 3.1, GatorTronLlama), across five datasets. We compared traditional full-size fine-tuning and prompt-based PEFT. We explored a multi-task instruction tuning framework that combines both tasks across four datasets to evaluate the zero-shot and few-shot learning performance using the leave-one-dataset-out strategy.

[65] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

Zucheng Liang,Wenxin Wei,Kaijie Zhang,Hongyi Chen

Main category: cs.CL

TL;DR: This paper proposes a multi-hop question decomposition method to improve LLMs' ability to answer complex questions. Using the MQUAKE framework and LLAMA3 model, it demonstrates that decomposing questions into multiple steps significantly enhances performance, especially before training, and maintains superiority even after fine-tuning with LoRA.

Details

Motivation: Accurately answering complex questions is a significant challenge for Large Language Models (LLMs), and this paper explores how multi-hop question decomposition within knowledge graphs can improve model comprehension and reasoning accuracy. Method: A multi-hop question decomposition method was developed using the MQUAKE framework and applied to the LLAMA3 model. The MQUAKE-T dataset was converted into single-hop and multi-hop datasets for comparison. The LLAMA3 model was fine-tuned using the LoRA method. Result: Without fine-tuning, the multi-hop question decomposition method outperformed the direct answering method. After fine-tuning with LoRA, both methods improved, but the multi-hop decomposition method consistently remained superior. Conclusion: The multi-hop decomposition method enhances the LLM's ability to answer complex questions effectively both before and after training. Abstract: Accurately answering complex questions has consistently been a significant challenge for Large Language Models (LLMs). To address this, this paper proposes a multi-hop question decomposition method for complex questions, building upon research within the MQUAKE framework. Utilizing the LLAMA3 model, we systematically investigate the impact of multi-hop question decomposition within knowledge graphs on model comprehension and reasoning accuracy, both before and after model training. In our experiments, we systematically partitioned and converted the MQUAKE-T dataset into two distinct formats: a single-hop dataset designed for directly answering complex questions, and a multi-hop dataset constructed using the multi-hop question decomposition method. We then fine-tuned the LLAMA3 model on these datasets and conducted inference tests. Our results demonstrate that, without fine-tuning the LLM, the prediction performance based on the multi-hop question decomposition method significantly outperforms the method of directly answering complex questions. After fine-tuning using the LoRA (Low-Rank Adaptation) method, the performance of both approaches improved compared to the untrained baseline. Crucially, the method utilizing multi-hop decomposition consistently maintained its superiority. These findings validate the effectiveness of the multi-hop decomposition method both before and after training, demonstrating its capability to effectively enhance the LLM's ability to answer complex questions.

[66] Decoders Laugh as Loud as Encoders

Eli Borodach,Raj Dandekar,Rajat Dandekar,Sreedath Panat

Main category: cs.CL

TL;DR: This study shows that GPT-4o, a fine-tuned decoder, performs almost as well as RoBERTa, a fine-tuned encoder, in understanding humor, suggesting that modern LLMs may grasp nuanced human communication.

Details

Motivation: The motivation stems from the ongoing question of whether computers can truly understand nuanced human communication, such as humor, especially with the recent advancements in Large Language Models (LLMs) showing human-like performance in various NLP tasks. Method: The researchers evaluated the performance of a fine-tuned decoder model (GPT-4o) in understanding humor and compared it with the performance of a fine-tuned encoder model (RoBERTa) using the F1-macro score as a metric. Result: The fine-tuned decoder (GPT-4o) achieved a Mean F1-macro score of 0.85, performing comparably to the best fine-tuned encoder (RoBERTa), which had a Mean F1-score of 0.86. Conclusion: The study concludes that fine-tuned decoders, such as GPT-4o, can perform comparably to encoder models like RoBERTa in understanding nuanced themes such as humor, as indicated by similar F1-score results. Abstract: From the dawn of the computer, Allen Turing dreamed of a robot that could communicate using language as a human being. The recent advances in the field of Large Language Models (LLMs) shocked the scientific community when a single model can apply for various natural language processing (NLP) tasks, while the output results are sometimes even better than most human communication skills. Models such as GPT, Claude, Grok, etc. have left their mark on the scientific community. However, it is unclear how much these models understand what they produce, especially in a nuanced theme such as humor. The question of whether computers understand humor is still open (among the decoders, the latest to be checked was GPT-2). We addressed this issue in this paper; we have showed that a fine-tuned decoder (GPT-4o) performed (Mean F1-macro score of 0.85) as well as the best fine-tuned encoder (RoBERTa with a Mean of F1-score 0.86)

[67] Enhancing Diversity in Large Language Models via Determinantal Point Processes

Yilei Chen,Souradip Chakraborty,Lorenz Wolf,Ioannis Ch. Paschalidis,Aldo Pacchiano

Main category: cs.CL

TL;DR: 提出 DQO 方法，通过行列式点过程提升大语言模型的语义多样性。

Details

Motivation: 监督微调和强化学习虽然提升了模型性能，但降低了输出多样性，现有方法在增强多样性方面存在局限。 Method: 基于行列式点过程（DPPs）的方法，通过采样和嵌入多组回复，并使用基于核的相似性矩阵的行列式来衡量多样性。 Result: 实验显示 DQO 在指令遵循、摘要、故事生成和推理任务中显著提升了语义多样性。 Conclusion: DQO 方法在不牺牲模型质量的前提下显著提高了语义多样性。 Abstract: Supervised fine-tuning and reinforcement learning are two popular methods for post-training large language models (LLMs). While improving the model's performance on downstream tasks, they often reduce the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

[68] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

Gunmay Handa,Zekun Wu,Adriano Koshiyama,Philip Treleaven

Main category: cs.CL

TL;DR: This paper systematically studies personality control in LLMs through the Big Five traits, comparing methods like ICL, PEFT, and MS, and identifies their trade-offs, effectiveness, and trait-level challenges, positioning mechanistic steering as a lightweight alternative to fine-tuning.

Details

Motivation: Personality manipulation in LLMs is increasingly used in customer service and agentic scenarios, but the mechanisms and trade-offs involved are not well understood. Method: The study constructs a contrastive dataset using the Big Five traits, introduces a unified evaluation framework based on within-run Δ analysis, develops trait purification techniques, and proposes a three-level stability framework for quantifying robustness. Result: Experiments revealed trade-offs among in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). ICL aligns well with minimal capability loss, PEFT offers the highest alignment but degrades task performance, and MS provides lightweight runtime control with competitive effectiveness. Openness was uniquely challenging, agreeableness resistant to ICL, and personality encoding consolidated around intermediate layers. Conclusion: Personality manipulation in LLMs serves as a multi-level probe into behavioral representation and positions mechanistic steering as a lightweight alternative to fine-tuning for deployment and interpretability. Abstract: Personality manipulation in large language models (LLMs) is increasingly applied in customer service and agentic scenarios, yet its mechanisms and trade-offs remain unclear. We present a systematic study of personality control using the Big Five traits, comparing in-context learning (ICL), parameter-efficient fine-tuning (PEFT), and mechanistic steering (MS). Our contributions are fourfold. First, we construct a contrastive dataset with balanced high/low trait responses, enabling effective steering vector computation and fair cross-method evaluation. Second, we introduce a unified evaluation framework based on within-run $\Delta$ analysis that disentangles, reasoning capability, agent performance, and demographic bias across MMLU, GAIA, and BBQ benchmarks. Third, we develop trait purification techniques to separate openness from conscientiousness, addressing representational overlap in trait encoding. Fourth, we propose a three-level stability framework that quantifies method-, trait-, and combination-level robustness, offering practical guidance under deployment constraints. Experiments on Gemma-2-2B-IT and LLaMA-3-8B-Instruct reveal clear trade-offs: ICL achieves strong alignment with minimal capability loss, PEFT delivers the highest alignment at the cost of degraded task performance, and MS provides lightweight runtime control with competitive effectiveness. Trait-level analysis shows openness as uniquely challenging, agreeableness as most resistant to ICL, and personality encoding consolidating around intermediate layers. Taken together, these results establish personality manipulation as a multi-level probe into behavioral representation, linking surface conditioning, parameter encoding, and activation-level steering, and positioning mechanistic steering as a lightweight alternative to fine-tuning for both deployment and interpretability.

[69] Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training

Figarri Keisha,Zekun Wu,Ze Wang,Adriano Koshiyama,Philip Treleaven

Main category: cs.CL

TL;DR: This paper identifies a three-stage phenomenon called knowledge collapse in large language models and proposes domain-specific synthetic training as a solution to improve collapse resistance while maintaining efficiency.

Details

Motivation: The motivation is to address the issue of knowledge collapse in large language models, which threatens factual reliability when models are trained recursively on their own outputs, especially in accuracy-dependent domains. Method: The authors conducted controlled experiments with recursive synthetic training to study the collapse trajectory and timing, using an evaluation framework that combines model-centric indicators with task-centric metrics. Result: The study demonstrates that collapse trajectory and timing are critically dependent on instruction format, and proposes domain-specific synthetic training as a mitigation strategy that significantly improves collapse resistance. Conclusion: The paper concludes that knowledge collapse in large language models can be mitigated through domain-specific synthetic training, which improves collapse resistance while maintaining computational efficiency. Abstract: Large language models increasingly rely on synthetic data due to human-written content scarcity, yet recursive training on model-generated outputs leads to model collapse, a degenerative process threatening factual reliability. We define knowledge collapse as a distinct three-stage phenomenon where factual accuracy deteriorates while surface fluency persists, creating "confidently wrong" outputs that pose critical risks in accuracy-dependent domains. Through controlled experiments with recursive synthetic training, we demonstrate that collapse trajectory and timing depend critically on instruction format, distinguishing instruction-following collapse from traditional model collapse through its conditional, prompt-dependent nature. We propose domain-specific synthetic training as a targeted mitigation strategy that achieves substantial improvements in collapse resistance while maintaining computational efficiency. Our evaluation framework combines model-centric indicators with task-centric metrics to detect distinct degradation phases, enabling reproducible assessment of epistemic deterioration across different language models. These findings provide both theoretical insights into collapse dynamics and practical guidance for sustainable AI training in knowledge-intensive applications where accuracy is paramount.

[70] Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Ilham Wicaksono,Zekun Wu,Theo King,Adriano Koshiyama,Philip Treleaven

Main category: cs.CL

TL;DR: 本文提出了AgentSeer框架，用于评估代理系统在部署中的安全风险，发现代理级漏洞特征与模型级存在显著差异，强调了代理情境评估的重要性。

Details

Motivation: 当前的大型语言模型向代理系统过渡，但现有安全评估框架在评估部署特定风险方面存在关键不足。 Method: 通过在GPT-OSS-20B和Gemini-2.0-flash上进行跨模型验证，使用HarmBench单轮和迭代优化攻击方法，对模型级和代理级的漏洞特征进行了系统比较。 Result: 发现了仅存在于代理情境中的漏洞，工具调用的攻击成功率（ASR）比模型级高出24-60%；代理级评估揭示了传统方法无法检测到的特定风险。 Conclusion: AgentSeer框架的提出填补了现有评估范式在代理情境下的空白，揭示了模型级和代理级漏洞特征之间的显著差异，并强调了开发特定于代理的评估方法的必要性。 Abstract: As large language models transition to agentic systems, current safety evaluation frameworks face critical gaps in assessing deployment-specific risks. We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs, enabling systematic agentic-situational assessment. Through cross-model validation on GPT-OSS-20B and Gemini-2.0-flash using HarmBench single turn and iterative refinement attacks, we demonstrate fundamental differences between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47% ASR) versus Gemini-2.0-flash (50.00% ASR), with both models showing susceptibility to social engineering while maintaining logic-based attack resistance. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, agent transfer operations as highest-risk tools, semantic rather than syntactic vulnerability mechanisms, and context-dependent attack effectiveness, alongside model-specific security profiles in absolute ASR levels and optimal injection strategies. Direct attack transfer from model-level to agentic contexts shows degraded performance (GPT-OSS-20B: 57% human injection ASR; Gemini-2.0-flash: 28%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic evaluation gaps. These findings establish the urgent need for agentic-situation evaluation paradigms, with AgentSeer providing the standardized methodology and empirical validation.

[71] Analyzing Finnish Inflectional Classes through Discriminative Lexicon and Deep Learning Models

Alexandre Nikolaev,Yu-Ying Chuang,R. Harald Baayen

Main category: cs.CL

TL;DR: This study investigates if the Discriminative Lexicon Model can process Finnish noun inflections without inflectional classes, finding that performance depends on word frequency and class productivity.

Details

Motivation: The study aims to determine whether inflectional classes are cognitively real by testing if a model can learn to inflect nouns without explicit inflectional classes, addressing implications for language teaching and computational modeling. Method: The study uses the Discriminative Lexicon Model (DLM) to analyze comprehension and production of Finnish inflected nouns, utilizing a dataset of 55,271 inflected nouns from 2000 high-frequency Finnish nouns across 49 inflectional classes. Models are tested both with and without frequency-informed learning. Result: Models performed with high accuracy on training data but showed decreased accuracy on held-out test data. Performance generally increased for inflectional classes with more types and lower-frequency words, indicating better productivity for those classes. However, frequency was found to be the dominant predictor of performance in usage-based models. Conclusion: The study concludes that the Discriminative Lexicon Model can understand and produce Finnish inflected nouns without explicit inflectional classes, with performance varying based on frequency and productivity of the classes. Abstract: Descriptions of complex nominal or verbal systems make use of inflectional classes. Inflectional classes bring together nouns which have similar stem changes and use similar exponents in their paradigms. Although inflectional classes can be very useful for language teaching as well as for setting up finite state morphological systems, it is unclear whether inflectional classes are cognitively real, in the sense that native speakers would need to discover these classes in order to learn how to properly inflect the nouns of their language. This study investigates whether the Discriminative Lexicon Model (DLM) can understand and produce Finnish inflected nouns without setting up inflectional classes, using a dataset with 55,271 inflected nouns of 2000 high-frequency Finnish nouns from 49 inflectional classes. Several DLM comprehension and production models were set up. Some models were not informed about frequency of use, and provide insight into learnability with infinite exposure (endstate learning). Other models were set up from a usage based perspective, and were trained with token frequencies being taken into consideration (frequency-informed learning). On training data, models performed with very high accuracies. For held-out test data, accuracies decreased, as expected, but remained acceptable. Across most models, performance increased for inflectional classes with more types, more lower-frequency words, and more hapax legomena, mirroring the productivity of the inflectional classes. The model struggles more with novel forms of unproductive and less productive classes, and performs far better for unseen forms belonging to productive classes. However, for usage-based production models, frequency was the dominant predictor of model performance, and correlations with measures of productivity were tenuous or absent.

[72] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

Yan Xie,Yibo Cui,Liang Xie,Erwei Yin

Main category: cs.CL

TL;DR: This paper proposes an efficient Adaptive Feature Distillation framework for SLU systems, using a dynamic adapter and DDC to transfer semantic representations and adaptively adjust distillation strength, achieving state-of-the-art results on a Chinese profile-based benchmark.

Details

Motivation: SLU systems face challenges due to limited labeled data and the computational burden of deploying LLMs, prompting the need for a more efficient framework. Method: Introduces a dynamic adapter with RPNN to align feature spaces and uses DDC to adaptively adjust distillation strength based on performance feedback. Result: AFD-SLU achieves 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy on the ProSLU benchmark. Conclusion: The proposed Adaptive Feature Distillation framework improves SLU systems by transferring semantic representations from a teacher model to a lightweight student model, achieving state-of-the-art results on a Chinese profile-based ProSLU benchmark. Abstract: Spoken Language Understanding (SLU) is a core component of conversational systems, enabling machines to interpret user utterances. Despite its importance, developing effective SLU systems remains challenging due to the scarcity of labeled training data and the computational burden of deploying Large Language Models (LLMs) in real-world applications. To further alleviate these issues, we propose an Adaptive Feature Distillation framework that transfers rich semantic representations from a General Text Embeddings (GTE)-based teacher model to a lightweight student model. Our method introduces a dynamic adapter equipped with a Residual Projection Neural Network (RPNN) to align heterogeneous feature spaces, and a Dynamic Distillation Coefficient (DDC) that adaptively modulates the distillation strength based on real-time feedback from intent and slot prediction performance. Experiments on the Chinese profile-based ProSLU benchmark demonstrate that AFD-SLU achieves state-of-the-art results, with 95.67% intent accuracy, 92.02% slot F1 score, and 85.50% overall accuracy.

[73] Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?

Boxiang Ma,Ru Li,Yuanlong Wang,Hongye Tan,Xiaoli Li

Main category: cs.CL

TL;DR: 研究提出了一种新的双视角评估框架，用于评估大型语言模型（LLMs）的场景认知能力，并发现当前的LLMs主要依赖表面记忆，而非深层语义理解。

Details

Motivation: 研究LLMs的泛化是源于对训练数据的记忆还是深层的语义理解。 Method: 提出了一种双视角评估框架，通过回答场景相关问题和探测其内部表示来评估LLMs的场景认知能力。 Result: 实验表明，目前的LLMs在简单的场景认知任务上也主要依赖于表面记忆，未能达到强大的语义场景认知。 Conclusion: 当前的LLMs主要依赖表面记忆，未能实现强大的语义场景认知，即使在简单的情况下也是如此。 Abstract: Driven by vast and diverse textual data, large language models (LLMs) have demonstrated impressive performance across numerous natural language processing (NLP) tasks. Yet, a critical question persists: does their generalization arise from mere memorization of training data or from deep semantic understanding? To investigate this, we propose a bi-perspective evaluation framework to assess LLMs' scenario cognition - the ability to link semantic scenario elements with their arguments in context. Specifically, we introduce a novel scenario-based dataset comprising diverse textual descriptions of fictional facts, annotated with scenario elements. LLMs are evaluated through their capacity to answer scenario-related questions (model output perspective) and via probing their internal representations for encoded scenario elements-argument associations (internal representation perspective). Our experiments reveal that current LLMs predominantly rely on superficial memorization, failing to achieve robust semantic scenario cognition, even in simple cases. These findings expose critical limitations in LLMs' semantic understanding and offer cognitive insights for advancing their capabilities.

[74] Using LLMs for Multilingual Clinical Entity Linking to ICD-10

Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

Main category: cs.CL

TL;DR: This paper proposes a multilingual clinical term linking system using LLMs and clinical dictionaries, achieving high performance on Spanish and Greek ICD-10 coding benchmarks.

Details

Motivation: Automatically assigning accurate ICD-10 codes to clinical texts simplifies healthcare professionals' work and ensures coding consistency, particularly across different languages. Method: The approach combines clinical dictionaries for matching unambiguous terms and uses in-context learning with GPT-4 for predicting ICD-10 codes for unmatched terms. Result: The system achieved strong results on benchmark datasets: 0.89 F1 for categories and 0.78 F1 for subcategories on CodiEsp (Spanish), and 0.85 F1 on ElCardioCC (Greek). Conclusion: The proposed multistage pipeline using clinical dictionaries and in-context learning with GPT-4 demonstrates promising performance in linking clinical terms to ICD-10 codes across different languages. Abstract: The linking of clinical entities is a crucial part of extracting structured information from clinical texts. It is the process of assigning a code from a medical ontology or classification to a phrase in the text. The International Classification of Diseases - 10th revision (ICD-10) is an international standard for classifying diseases for statistical and insurance purposes. Automatically assigning the correct ICD-10 code to terms in discharge summaries will simplify the work of healthcare professionals and ensure consistent coding in hospitals. Our paper proposes an approach for linking clinical terms to ICD-10 codes in different languages using Large Language Models (LLMs). The approach consists of a multistage pipeline that uses clinical dictionaries to match unambiguous terms in the text and then applies in-context learning with GPT-4.1 to predict the ICD-10 code for the terms that do not match the dictionary. Our system shows promising results in predicting ICD-10 codes on different benchmark datasets in Spanish - 0.89 F1 for categories and 0.78 F1 on subcategories on CodiEsp, and Greek - 0.85 F1 on ElCardioCC.

[75] L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning

Raul Singh,Nicolo Brunello,Vincenzo Scotti,Mark James Carman

Main category: cs.CL

TL;DR: L1RA is a new method for fine-tuning large language models more efficiently by dynamically redistributing low-rank adapters using L1 regularization, achieving better performance with less computational cost.

Details

Motivation: Fine-tuning large language models (LLMs) on downstream tasks has high computational demands, posing challenges when resources are limited. Method: L1RA uses L1 regularization to dynamically redistribute the rank of low-rank adapters during fine-tuning, optimizing resource utilization. Result: L1RA maintains comparable or even reduced computational overhead compared to other LoRA variants while achieving equal or better performance. Post-training analysis revealed that feed-forward layers and attention output projection required the most adaptation. Conclusion: L1RA is a promising technique for improving the efficiency and interpretability of LLM adaptation, especially in resource-constrained scenarios. Abstract: The ability of Large Language Models (LLMs) to solve complex tasks has made them crucial in the development of AI-based applications. However, the high computational requirements to fine-tune these LLMs on downstream tasks pose significant challenges, particularly when resources are limited. In response to this challenge, we introduce L1RA, a novel technique aimed at dynamically distributing the rank of low-rank adapters during fine-tuning using LoRA. Given a rank budget (i.e., total sum of adapters rank), L1RA leverages L1 regularisation to prune redundant ranks and redistribute them across adapters, thereby optimising resource utilisation. Through a series of comprehensive experiments, we empirically demonstrate that L1RA maintains comparable or even reduced computational overhead compared to other LoRA variants, including the vanilla approach, while achieving same or better performances. Moreover, the post-training analysis of rank distribution unveiled insights into the specific model components requiring the most adaptation to align with the task objective: the feed-forward layers and the attention output projection. These results highlight the efficacy of L1RA in not only enhancing the efficiency of LLM fine-tuning, but also in providing valuable diagnostic information for model refinement and customisation. In conclusion, L1RA stands as a promising technique for advancing the performance and interpretability of LLM adaptation, particularly in scenarios where computational resources are constrained.

[76] PLaMo 2 Technical Report

Preferred Networks,:,Kaizaburo Chubachi,Yasuhiro Fujita,Shinichi Hemmi,Yuta Hirokawa,Toshiki Kataoka,Goro Kobayashi,Kenichi Maehashi,Calvin Metzger,Hiroaki Mikami,Shogo Murai,Daisuke Nishino,Kento Nozawa,Shintarou Okada,Daisuke Okanohara,Shunta Saito,Shotaro Sano,Shuji Suzuki,Daisuke Tanaka,Avinash Ummadisingu,Hanqin Wang,Sixue Wang,Tianqi Xu

Main category: cs.CL

TL;DR: PLaMo 2 introduces efficient, large Japanese language models using a hybrid architecture, synthetic data training, and post-training optimization techniques to achieve state-of-the-art performance on Japanese benchmarks.

Details

Motivation: To overcome data scarcity and computational inefficiency in developing large Japanese language models. Method: PLaMo 2 uses a hybrid Samba-based architecture, continual pre-training with synthetic corpora, weight reuse, structured pruning, and post-training optimization techniques such as supervised fine-tuning (SFT) and direct preference optimization (DPO). Result: An 8B PLaMo 2 model achieves performance comparable to the previous 100B model, while being optimized for inference with minimal accuracy loss. It achieves state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge. Conclusion: PLaMo 2 is an efficient and effective solution for Japanese language modeling, combining architectural innovation, training strategies, and optimization techniques to deliver superior performance. Abstract: In this report, we introduce PLaMo 2, a series of Japanese-focused large language models featuring a hybrid Samba-based architecture that transitions to full attention via continual pre-training to support 32K token contexts. Training leverages extensive synthetic corpora to overcome data scarcity, while computational efficiency is achieved through weight reuse and structured pruning. This efficient pruning methodology produces an 8B model that achieves performance comparable to our previous 100B model. Post-training further refines the models using a pipeline of supervised fine-tuning (SFT) and direct preference optimization (DPO), enhanced by synthetic Japanese instruction data and model merging techniques. Optimized for inference using vLLM and quantization with minimal accuracy loss, the PLaMo 2 models achieve state-of-the-art results on Japanese benchmarks, outperforming similarly-sized open models in instruction-following, language fluency, and Japanese-specific knowledge.

[77] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

Jianghao Chen,Wei Sun,Qixiang Yin,Lingxing Kong,Zhixing Tan,Jiajun Zhang

Main category: cs.CL

TL;DR: ACE-RL框架通过自适应约束增强奖励机制，提升了大型语言模型在长文本生成中的质量，显著优于现有方法。

Details

Motivation: 现有研究主要存在两个局限性：一是对稀缺的高质量长文本响应数据的依赖，二是仅关注粗粒度的质量优化维度，忽略了不同长文本生成场景中的细粒度特性。 Method: 提出了ACE-RL框架，该框架通过识别指令的潜在意图和需求，自动将其分解为细粒度的自适应约束条件，并设计了一种奖励机制，将主观质量评估转化为约束验证，最后利用强化学习提升模型的长文本生成能力。 Result: 实验结果表明，ACE-RL框架在WritingBench上比现有SFT和RL基线分别高出20.70%和7.32%，其最佳模型甚至比GPT-4o高出7.10%。 Conclusion: ACE-RL提供了一种有效的训练范式，使大型语言模型在多样化的长文本生成场景中能够生成高质量内容，并在WritingBench上显著优于现有SFT和RL基线。 Abstract: Large Language Models (LLMs) have demonstrated remarkable progress in long-context understanding, yet they face significant challenges in high-quality long-form generation. Existing studies primarily suffer from two limitations: (1) A heavy reliance on scarce, high-quality long-form response data for supervised fine-tuning (SFT) or for pairwise preference reward in reinforcement learning (RL). (2) Focus on coarse-grained quality optimization dimensions, such as relevance, coherence, and helpfulness, overlooking the fine-grained specifics inherent to diverse long-form generation scenarios. To address this issue, we propose a framework using Adaptive Constraint-Enhanced reward for long-form generation Reinforcement Learning (ACE-RL). ACE-RL first automatically deconstructs each instruction into a set of fine-grained, adaptive constraint criteria by identifying its underlying intents and demands. Subsequently, we design a reward mechanism that quantifies the quality of long-form responses based on their satisfaction over corresponding constraints, converting subjective quality evaluation into constraint verification. Finally, we utilize reinforcement learning to guide models toward superior long-form generation capabilities. Experimental results demonstrate that our ACE-RL framework significantly outperforms existing SFT and RL baselines by 20.70% and 7.32% on WritingBench, and our top-performing model even surpasses proprietary systems like GPT-4o by 7.10%, providing a more effective training paradigm for LLMs to generate high-quality content across diverse long-form generation scenarios.

Midhun Shyam,Jim Basilakis,Kieran Luken,Steven Thomas,John Crozier,Paul M. Middleton,X. Rosalind Wang

Main category: cs.CL

TL;DR: 本文介绍了一种使用有限计算资源对分诊数据进行分类的管道，通过利用预训练的大语言模型和精心策划的数据集，成功实现了对敏感的医院数据的分析。

Details

Motivation: 分诊记录包含大量有助于理解急诊科患者流行病学和时间依赖性病症或伤害的信息，但现代自然语言处理和机器学习技术在分析这类数据时面临隐私、计算资源和专家输入需求等挑战。 Method: 首先使用一个小型（2k）开源数据集在GPU上对预训练的大语言模型进行微调，然后在CPU上使用1000个医院特定样本的数据集进一步微调模型。 Result: 通过精心策划数据集并利用现有模型和开源数据，成功使用有限的计算资源对分诊数据进行了分类。 Conclusion: 研究证明，即使在计算资源有限的情况下，也可以通过利用预训练模型和精心策划的数据集对敏感的医院分诊数据进行有效的分析和分类。 Abstract: Triage notes, created at the start of a patient's hospital visit, contain a wealth of information that can help medical staff and researchers understand Emergency Department patient epidemiology and the degree of time-dependent illness or injury. Unfortunately, applying modern Natural Language Processing and Machine Learning techniques to analyse triage data faces some challenges: Firstly, hospital data contains highly sensitive information that is subject to privacy regulation thus need to be analysed on site; Secondly, most hospitals and medical facilities lack the necessary hardware to fine-tune a Large Language Model (LLM), much less training one from scratch; Lastly, to identify the records of interest, expert inputs are needed to manually label the datasets, which can be time-consuming and costly. We present in this paper a pipeline that enables the classification of triage data using LLM and limited compute resources. We first fine-tuned a pre-trained LLM with a classifier using a small (2k) open sourced dataset on a GPU; and then further fine-tuned the model with a hospital specific dataset of 1000 samples on a CPU. We demonstrated that by carefully curating the datasets and leveraging existing models and open sourced data, we can successfully classify triage data with limited compute resources.

[79] Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short Texts

Julius Neumann,Robert Lange,Yuni Susanti,Michael Färber

Main category: cs.CL

TL;DR: 本文研究了基于BERT的小型模型在短文本多标签情感分类中的有效性，并探讨了优化策略。

Details

Motivation: 短文本情感分类面临类别不平衡、训练样本有限以及情感标签主观性等挑战，这些问题在上下文有限的短文本中进一步加剧。 Method: 评估了影响模型性能的三个关键因素：继续领域预训练、使用自动生成示例进行数据增强以及分类头的结构变体。 Result: 实验结果表明，数据增强可以提升分类性能，而基于增强数据集的继续预训练可能不会提高准确性，反而引入噪声。 Conclusion: 修改分类头对提升模型性能效果有限，数据增强能够有效提升分类表现，而继续预训练可能会引入噪声。 Abstract: Sentiment classification in short text datasets faces significant challenges such as class imbalance, limited training samples, and the inherent subjectivity of sentiment labels -- issues that are further intensified by the limited context in short texts. These factors make it difficult to resolve ambiguity and exacerbate data sparsity, hindering effective learning. In this paper, we evaluate the effectiveness of small Transformer-based models (i.e., BERT and RoBERTa, with fewer than 1 billion parameters) for multi-label sentiment classification, with a particular focus on short-text settings. Specifically, we evaluated three key factors influencing model performance: (1) continued domain-specific pre-training, (2) data augmentation using automatically generated examples, specifically generative data augmentation, and (3) architectural variations of the classification head. Our experiment results show that data augmentation improves classification performance, while continued pre-training on augmented datasets can introduce noise rather than boost accuracy. Furthermore, we confirm that modifications to the classification head yield only marginal benefits. These findings provide practical guidance for optimizing BERT-based models in resource-constrained settings and refining strategies for sentiment classification in short-text datasets.

[80] Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant

Inbal Bolshinsky,Shani Kupiec,Almog Sasson,Yehudit Aperstein,Alexander Apartsin

Main category: cs.CL

TL;DR: 该研究探讨了在生成服务响应时显式意图识别的必要性，并对最先进的语言模型进行了基准测试，以确定意图优先与直接响应生成的效果。

Details

Motivation: 在对话AI时代，生成准确且上下文适当的响应仍然是一个关键挑战。一个核心问题是显式意图识别是否是生成高质量响应的必要条件。 Method: 使用两个公开的服务交互数据集，对包括微调T5变体在内的多个最先进语言模型进行基准测试，涵盖意图优先响应生成和直接响应生成两种范式。 Result: 评估指标包括语言质量和任务成功率，揭示了显式意图建模的必要性或冗余性。 Conclusion: 研究发现挑战了传统对话AI流程中的假设，提供了更高效和有效的响应生成系统的设计指南。 Abstract: In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across both paradigms: Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass both linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.

[81] Masked Diffusion Language Models with Frequency-Informed Training

Despoina Kosmopoulou,Efthymios Georgiou,Vaggelis Dorovatas,Georgios Paraskevopoulos,Alexandros Potamianos

Main category: cs.CL

TL;DR: 本研究开发了一种适用于数据高效训练的掩码扩散语言建模框架，并在BabyLM基准测试中验证了其有效性。

Details

Motivation: 在严格的数据限制下，探索扩散训练目标在语言建模中的应用，以提高学习效率。 Method: 应用扩散训练目标进行语言建模，并采用频率知情掩码策略，优先从罕见标记中学习。 Result: 在BabyLM基准测试中表现与混合自回归-掩码基线相当，证明了扩散模型的有效性。 Conclusion: 扩散模型在数据受限的语言学习中提供了一种可行的替代方案。 Abstract: We present a masked diffusion language modeling framework for data-efficient training for the BabyLM 2025 Challenge. Our approach applies diffusion training objectives to language modeling under strict data constraints, incorporating frequency-informed masking that prioritizes learning from rare tokens while maintaining theoretical validity. We explore multiple noise scheduling strategies, including two-mode approaches, and investigate different noise weighting schemes within the NELBO objective. We evaluate our method on the BabyLM benchmark suite, measuring linguistic competence, world knowledge, and human-likeness. Results show performance competitive to hybrid autoregressive-masked baselines, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

[82] Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Patrick Amadeus Irawan,Ryandito Diandaru,Belati Jagad Bintang Syuhada,Randy Zakya Suchrady,Alham Fikri Aji,Genta Indra Winata,Fajri Koto,Samuel Cahyawijaya

Main category: cs.CL

TL;DR: Entropy2Vec利用语言模型的熵值生成跨语言表示，克服传统方法的局限性，并在多语言任务中表现良好。

Details

Motivation: 传统的类型学清单存在特征稀疏性和静态快照的问题，而Entropy2Vec旨在克服这些问题，提供更灵活和适应性强的语言表示方法。 Method: 训练单语语言模型并利用其预测的熵值反映语言间的结构相似性，从而生成语言嵌入。 Result: Entropy2Vec生成的语言嵌入与已建立的类型学类别一致，并在多语言NLP任务中表现具有竞争力。 Conclusion: Entropy2Vec通过语言模型的内在不确定性来捕捉语言间的类型关系，生成密集、非稀疏的语言嵌入，适用于不同时间段且无缺失值，并且在下游多语言NLP任务中表现出色。 Abstract: We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

Matteo Bortoletto,Constantin Ruhdorfer,Andreas Bulling

Main category: cs.CL

TL;DR: The paper introduces ToM-SSI, a new benchmark for testing Theory of Mind capabilities in complex social and spatial environments, revealing significant limitations in current models' performance and highlighting areas for future research.

Details

Motivation: Most existing ToM benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a limited perspective on ToM and neglecting the complexity of human social interactions. This work aims to address this gap. Method: The paper introduces ToM-SSI, a new benchmark designed to test ToM capabilities in environments rich with social interactions and spatial dynamics, differing from current benchmarks by being multimodal and involving group interactions of up to four agents in situated environments. Result: Evaluations using ToM-SSI reveal that current models' performance is still severely limited, especially in the new tasks introduced by the benchmark, highlighting critical gaps for future research. Conclusion: The proposed ToM-SSI benchmark reveals that current models' performance, especially in new tasks, is still severely limited, highlighting critical gaps for future research in the area of Theory of Mind (ToM) for foundation models. Abstract: Most existing Theory of Mind (ToM) benchmarks for foundation models rely on variations of the Sally-Anne test, offering only a very limited perspective on ToM and neglecting the complexity of human social interactions. To address this gap, we propose ToM-SSI: a new benchmark specifically designed to test ToM capabilities in environments rich with social interactions and spatial dynamics. While current ToM benchmarks are limited to text-only or dyadic interactions, ToM-SSI is multimodal and includes group interactions of up to four agents that communicate and move in situated environments. This unique design allows us to study, for the first time, mixed cooperative-obstructive settings and reasoning about multiple agents' mental state in parallel, thus capturing a wider range of social cognition than existing benchmarks. Our evaluations reveal that the current models' performance is still severely limited, especially in these new tasks, highlighting critical gaps for future research.

[84] ICR: Iterative Clarification and Rewriting for Conversational Search

Zhiyu Cao,Peifeng Li,Qiaoming Zhu

Main category: cs.CL

TL;DR: This paper proposes the ICR framework for conversational query rewriting, which uses an iterative process of clarification and rewriting to handle multiple fuzzy expressions, achieving top performance on two datasets.

Details

Motivation: Previous end-to-end conversational query rewriting methods struggle with handling multiple fuzzy expressions in a single query, making it difficult to accurately identify and rewrite all parts simultaneously. Method: The ICR (Iterative Clarification and Rewriting) framework alternates between generating clarification questions and rewritten queries, addressing the challenge of multiple fuzzy expressions in conversational query rewriting. Result: Experimental results show that the ICR framework effectively improves retrieval performance iteratively, outperforming existing methods and achieving state-of-the-art results on two widely used datasets. Conclusion: The proposed ICR framework achieves state-of-the-art performance on two popular datasets by continuously improving retrieval performance through an iterative clarification and rewriting process. Abstract: Most previous work on Conversational Query Rewriting employs an end-to-end rewriting paradigm. However, this approach is hindered by the issue of multiple fuzzy expressions within the query, which complicates the simultaneous identification and rewriting of multiple positions. To address this issue, we propose a novel framework ICR (Iterative Clarification and Rewriting), an iterative rewriting scheme that pivots on clarification questions. Within this framework, the model alternates between generating clarification questions and rewritten queries. The experimental results show that our ICR can continuously improve retrieval performance in the clarification-rewriting iterative process, thereby achieving state-of-the-art performance on two popular datasets.

[85] Triadic Fusion of Cognitive, Functional, and Causal Dimensions for Explainable LLMs: The TAXAL Framework

David Herrera-Poyatos,Carlos Peláez-González,Cristina Zuheros,Virilo Tejedor,Rosana Montes,Francisco Herrera

Main category: cs.CL

TL;DR: 本文提出了 TAXAL 框架，结合认知、功能和因果三个维度，提供了一个统一的框架来增强代理 LLMs 的可解释性，并通过案例研究展示了其在多个领域的适用性。

Details

Motivation: 传统的可解释性方法无法捕捉代理 LLMs 的推理路径、规划逻辑和系统影响，因此需要一种新的统一框架来增强透明度和信任。 Method: 提出了 TAXAL（面向代理 LLMs 可解释性的三元对齐框架），结合了认知、功能和因果三个维度，并通过案例研究展示了其适用性。 Result: 通过案例研究展示了 TAXAL 在法律、教育、医疗和公共服务等领域的应用，表明其解释策略能够适应机构约束和利益相关者角色。 Conclusion: TAXAL 提供了一个统一的、角色敏感的基础，用于在不同社会技术环境中设计、评估和部署解释，从而推进可解释性作为一种技术和社会技术实践。 Abstract: Large Language Models (LLMs) are increasingly being deployed in high-risk domains where opacity, bias, and instability undermine trust and accountability. Traditional explainability methods, focused on surface outputs, do not capture the reasoning pathways, planning logic, and systemic impacts of agentic LLMs. We introduce TAXAL (Triadic Alignment for eXplainability in Agentic LLMs), a triadic fusion framework that unites three complementary dimensions: cognitive (user understanding), functional (practical utility), and causal (faithful reasoning). TAXAL provides a unified, role-sensitive foundation for designing, evaluating, and deploying explanations in diverse sociotechnical settings. Our analysis synthesizes existing methods, ranging from post-hoc attribution and dialogic interfaces to explanation-aware prompting, and situates them within the TAXAL triadic fusion model. We further demonstrate its applicability through case studies in law, education, healthcare, and public services, showing how explanation strategies adapt to institutional constraints and stakeholder roles. By combining conceptual clarity with design patterns and deployment pathways, TAXAL advances explainability as a technical and sociotechnical practice, supporting trustworthy and context-sensitive LLM applications in the era of agentic AI.

[86] Hunyuan-MT Technical Report

Mao Zheng,Zheng Li,Bingxin Qu,Mingyang Song,Yang Du,Mingrui Sun,Di Wang

Main category: cs.CL

TL;DR: This paper presents Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B, two multilingual translation models that excel in diverse translation tasks, especially for Mandarin to minority language/dialect translation, achieving state-of-the-art results in the WMT2025 shared task.

Details

Motivation: To introduce an open-source multilingual translation model supporting diverse languages, especially focusing on Mandarin to minority languages/dialects translation, and to improve performance in diverse translation scenarios. Method: Development of Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B models using a holistic training process including pre-training, Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL); Chimera-7B integrates multiple outputs for enhanced performance. Result: Both models outperform all translation-specific models of similar parameter size and most SOTA large models, achieving first place in 30 out of 31 language pairs in the WMT2025 shared task. Conclusion: Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B models demonstrate state-of-the-art performance across multiple language pairs, particularly excelling in translation involving Mandarin and minority languages or dialects. Abstract: In this report, we introduce Hunyuan-MT-7B, our first open-source multilingual translation model, which supports bidirectional translation across 33 major languages and places a special emphasis on translation between Mandarin and several ethnic minority languages as well as dialects. Furthermore, to serve and address diverse translation scenarios and enhance model performance at test time, we introduce Hunyuan-MT-Chimera-7B, a translation model inspired by the slow thinking mode. This model integrates multiple outputs generated by the Hunyuan-MT-7B model under varying parameter settings, thereby achieving performance superior to that of conventional slow-thinking models based on Chain-of-Thought (CoT). The development of our models follows a holistic training process specifically engineered for multilingual translation, which begins with general and MT-oriented pre-training to build foundational capabilities, proceeds to Supervised Fine-Tuning (SFT) for task-specific adaptation, and culminates in advanced alignment through Reinforcement Learning (RL) and weak-to-strong RL. Through comprehensive experimentation, we demonstrate that both Hunyuan-MT-7B and Hunyuan-MT-Chimera-7B significantly outperform all translation-specific models of comparable parameter size and most of the SOTA large models, particularly on the task of translation between Mandarin and minority languages as well as dialects. In the WMT2025 shared task (General Machine Translation), our models demonstrate state-of-the-art performance, ranking first in 30 out of 31 language pairs. This result highlights the robustness of our models across a diverse linguistic spectrum, encompassing high-resource languages such as Chinese, English, and Japanese, as well as low-resource languages including Czech, Marathi, Estonian, and Icelandic.

[87] BEDTime: A Unified Benchmark for Automatically Describing Time Series

Medhasweta Sen,Zachary Gottesman,Jiaxing Qiu,C. Bayan Bruss,Nam Nguyen,Tom Hartvigsen

Main category: cs.CL

TL;DR: 论文提出了一个标准化评估基准，用于测试模型使用自然语言描述时间序列的能力，指出当前模型的局限性和改进方向。

Details

Motivation: 许多研究提出了通用的基础模型用于时间序列分析任务，但现有数据集的不统一和评估方法的广泛性限制了模型间的直接比较和能力分析。 Method: 该论文通过统一四个最近的数据集，对十三个最先进的语言、视觉-语言和时间序列-语言模型进行实验评估，以进行模型比较。 Result: 实验结果表明，纯语言方法表现不佳，视觉-语言模型（VLMs）表现出色，而多模态时间序列-语言模型则优于LLMs，但在鲁棒性测试中所有方法都显示出明显的脆弱性。 Conclusion: 该论文强调了时间序列分析中需要特定架构的模型，并指出当前方法在鲁棒性测试中表现出的脆弱性，尽管预训练的多模态时间序列-语言模型表现优于LLMs，但仍有显著提升空间。 Abstract: Many recent studies have proposed general-purpose foundation models designed for a variety of time series analysis tasks. While several established datasets already exist for evaluating these models, previous works frequently introduce their models in conjunction with new datasets, limiting opportunities for direct, independent comparisons and obscuring insights into the relative strengths of different methods. Additionally, prior evaluations often cover numerous tasks simultaneously, assessing a broad range of model abilities without clearly pinpointing which capabilities contribute to overall performance. To address these gaps, we formalize and evaluate 3 tasks that test a model's ability to describe time series using generic natural language: (1) recognition (True/False question-answering), (2) differentiation (multiple choice question-answering), and (3) generation (open-ended natural language description). We then unify 4 recent datasets to enable head-to-head model comparisons on each task. Experimentally, in evaluating 13 state-of-the-art language, vision--language, and time series--language models, we find that (1) popular language-only methods largely underperform, indicating a need for time series-specific architectures, (2) VLMs are quite successful, as expected, identifying the value of vision models for these tasks and (3) pretrained multimodal time series--language models successfully outperform LLMs, but still have significant room for improvement. We also find that all approaches exhibit clear fragility in a range of robustness tests. Overall, our benchmark provides a standardized evaluation on a task necessary for time series reasoning systems.

[88] HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models

Chang Dai,Hongyu Shan,Mingyang Song,Di Liang

Main category: cs.CL

TL;DR: This paper introduces HoPE (Hyperbolic Rotary Positional Encoding), a novel positional encoding method inspired by hyperbolic geometry, which improves upon RoPE by enabling more stable and accurate modeling of long-range dependencies in Transformers.

Details

Motivation: The motivation stems from the limitations of existing positional encoding methods—absolute encodings struggle with extrapolation, relative approaches degrade in performance on very long sequences, and RoPE introduces oscillatory attention patterns that hinder stable modeling of long-distance dependencies. The authors aim to address these issues with a more effective positional encoding mechanism. Method: The paper proposes Hyperbolic Rotary Positional Encoding (HoPE), inspired by Lorentz transformations in hyperbolic geometry, to improve positional encoding in Transformers. It theoretically analyzes the advantages of HoPE over RoPE and conducts extensive experiments, including perplexity evaluations on extended sequence benchmarks, to validate its effectiveness. Result: Experimental results show that HoPE consistently outperforms existing positional encoding methods in modeling long-range dependencies, with improved perplexity evaluations on extended sequence benchmarks, demonstrating its superior generalization and representation capabilities. Conclusion: The paper concludes that HoPE effectively addresses the limitations of existing positional encoding methods, particularly in modeling long-range dependencies, outperforming RoPE and other approaches in experimental evaluations. Abstract: Positional encoding mechanisms enable Transformers to model sequential structure and long-range dependencies in text. While absolute positional encodings struggle with extrapolation to longer sequences due to fixed positional representations, and relative approaches like Alibi exhibit performance degradation on extremely long contexts, the widely-used Rotary Positional Encoding (RoPE) introduces oscillatory attention patterns that hinder stable long-distance dependency modelling. We address these limitations through a geometric reformulation of positional encoding. Drawing inspiration from Lorentz transformations in hyperbolic geometry, we propose Hyperbolic Rotary Positional Encoding (HoPE), which leverages hyperbolic functions to implement Lorentz rotations on token representations. Theoretical analysis demonstrates that RoPE is a special case of our generalized formulation. HoPE fundamentally resolves RoPE's slation issues by enforcing monotonic decay of attention weights with increasing token distances. Extensive experimental results, including perplexity evaluations under several extended sequence benchmarks, show that HoPE consistently exceeds existing positional encoding methods. These findings underscore HoPE's enhanced capacity for representing and generalizing long-range dependencies. Data and code will be available.

[89] Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation

Abdul Waheed,Chancharik Mitra,Laurie Z. Wang,Deva Ramanan,Bhiksha Raj

Main category: cs.CL

TL;DR: 本文提出了一种难度感知的推理框架，通过后训练使模型能够根据问题复杂度动态调整推理深度，减少简单问题的冗长输出。

Details

Motivation: 为了解决Chain-of-thought推理在简单问题上输出过于冗长的问题，提出了一种难度感知的推理框架。 Method: 使用监督微调(SFT)和直接偏好优化(DPO)进行后训练，以使模型动态调整推理路径。 Result: 定量指标和定性评估都证实了模型能够学习“按比例思考”，在简单问题上减少推理长度，同时保持复杂问题的推理深度。 Conclusion: 模型通过后训练学习到根据问题难度动态调整推理深度，能够在简单问题上进行最小推理而在复杂问题上保持深度。 Abstract: Chain-of-thought reasoning, while powerful, can produce unnecessarily verbose output for simpler problems. We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity. Remarkably, we show that models can be endowed with such dynamic inference pathways without any architectural modifications; we simply post-train on data that is carefully curated to include chain-of-thought traces that are proportional in length to problem difficulty. Our analysis reveals that post-training via supervised fine-tuning (SFT) primarily captures patterns like reasoning length and format, while direct preference optimization (DPO) preserves reasoning accuracy, with their combination reducing length and maintaining or improving performance. Both quantitative metrics and qualitative assessments confirm that models can learn to "think proportionally", reasoning minimally on simple problems while maintaining depth for complex ones.

[90] CURE: Controlled Unlearning for Robust Embeddings -- Mitigating Conceptual Shortcuts in Pre-Trained Language Models

Aysenur Kocak,Shuo Yang,Bardh Prenkaj,Gjergji Kasneci

Main category: cs.CL

TL;DR: 本文提出了一种名为CURE的轻量级框架，用于解耦和抑制概念捷径，以提高预训练语言模型的鲁棒性和公平性。

Details

Motivation: 预训练语言模型容易受到虚假的概念驱动相关性影响，从而影响其鲁棒性和公平性。 Method: CURE首先通过一个专门的内容提取器提取概念无关表示，并使用可控制的去偏模块进行进一步处理。 Result: 在IMDB和Yelp数据集上的实验表明，CURE在IMDB上的F1分数提高了10个百分点，在Yelp上提高了2个百分点。 Conclusion: CURE通过解耦和抑制概念捷径，在保持任务相关信息的同时提高了模型的鲁棒性和公平性。 Abstract: Pre-trained language models have achieved remarkable success across diverse applications but remain susceptible to spurious, concept-driven correlations that impair robustness and fairness. In this work, we introduce CURE, a novel and lightweight framework that systematically disentangles and suppresses conceptual shortcuts while preserving essential content information. Our method first extracts concept-irrelevant representations via a dedicated content extractor reinforced by a reversal network, ensuring minimal loss of task-relevant information. A subsequent controllable debiasing module employs contrastive learning to finely adjust the influence of residual conceptual cues, enabling the model to either diminish harmful biases or harness beneficial correlations as appropriate for the target task. Evaluated on the IMDB and Yelp datasets using three pre-trained architectures, CURE achieves an absolute improvement of +10 points in F1 score on IMDB and +2 points on Yelp, while introducing minimal computational overhead. Our approach establishes a flexible, unsupervised blueprint for combating conceptual biases, paving the way for more reliable and fair language understanding systems.

[91] Uniform Information Density and Syntactic Reduction: Revisiting $\textit{that}$-Mentioning in English Complement Clauses

Hailin Hao,Elsi Kaiser

Main category: cs.CL

TL;DR: This paper explores how speakers regulate information flow by varying syntactic structures, focusing on the optional use of the complementizer 'that'. It finds that while traditional measures of information density are informative, estimates from contextual word embeddings better explain variation in complementizer usage patterns.

Details

Motivation: The motivation stems from the Uniform Information Density (UID) hypothesis, which suggests that speakers regulate information transmission rates during language production. Prior research linked UID to syntactic reduction, particularly in relation to the omission of the complementizer 'that'. Method: The researchers used machine learning and neural language models to analyze a large-scale, contemporary conversational corpus, aiming to refine estimates of information density and understand their impact on the optional use of the complementizer 'that'. Result: The results replicated the known relationship between low information density and the omission of 'that'. However, it was found that previous measures based on matrix verbs' subcategorization probability captured significant idiosyncratic lexical variation, while estimates from contextual word embeddings explained additional variance in complementizer usage. Conclusion: The study concludes that while prior measures of information density are valuable, estimates derived from contextual word embeddings better account for variance in complementizer usage patterns. Abstract: Speakers often have multiple ways to express the same meaning. The Uniform Information Density (UID) hypothesis suggests that speakers exploit this variability to maintain a consistent rate of information transmission during language production. Building on prior work linking UID to syntactic reduction, we revisit the finding that the optional complementizer $\textit{that}$in English complement clauses is more likely to be omitted when the clause has low information density (i.e., more predictable). We advance this line of research by analyzing a large-scale, contemporary conversational corpus and using machine learning and neural language models to refine estimates of information density. Our results replicated the established relationship between information density and $\textit{that}$-mentioning. However, we found that previous measures of information density based on matrix verbs' subcategorization probability capture substantial idiosyncratic lexical variation. By contrast, estimates derived from contextual word embeddings account for additional variance in patterns of complementizer usage.

[92] Elucidating the Design Space of Decay in Linear Attention

Zhen Qin,Xuyang Shen,Yiran Zhong

Main category: cs.CL

TL;DR: This paper explores how decay mechanisms affect linear sequence models, revealing that careful design is crucial and that RoPE often offers little benefit.

Details

Motivation: To understand and improve the performance of linear complexity sequence models by investigating the impact of different decay mechanism designs. Method: The authors systematically analyzed decay mechanisms across four dimensions: parameterization strategy, parameter sharing, decay granularity (scalar vs. vector), and compatibility with RoPE. They conducted extensive experiments on language modeling tasks. Result: Key findings include the sensitivity of performance to parameterization strategy, the limitations of parameter sharing, the general superiority of vector-based decay over scalar decay (with exceptions depending on strategy), and the limited effectiveness of RoPE in enhancing linear attention mechanisms. Conclusion: RoPE does not provide tangible benefits to most linear attention mechanisms, and decay mechanisms must be carefully designed considering parameterization strategy, parameter sharing, and decay granularity. Abstract: This paper presents a comprehensive investigation into the decay mechanisms inherent in linear complexity sequence models. We systematically delineate the design space of decay mechanisms across four pivotal dimensions: parameterization strategy, which refers to the computational methodology for decay; parameter sharing, which involves the utilization of supplementary parameters for decay computation; decay granularity, comparing scalar versus vector-based decay; and compatibility with relative positional encoding methods, such as Rotary Position Embedding (RoPE). Through an extensive series of experiments conducted on diverse language modeling tasks, we uncovered several critical insights. Firstly, the design of the parameterization strategy for decay requires meticulous consideration. Our findings indicate that effective configurations are typically confined to a specific range of parameters. Secondly, parameter sharing cannot be used arbitrarily, as it may cause decay values to be too large or too small, thereby significantly impacting performance. Thirdly, under identical parameterization strategies, scalar decay generally underperforms compared to its vector-based counterpart. However, in certain scenarios with alternative parameterization strategies, scalar decay may unexpectedly surpass vector decay in efficacy. Lastly, our analysis reveals that RoPE, a commonly employed relative positional encoding method, typically fails to provide tangible benefits to the majority of linear attention mechanisms.

[93] Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining

Deniz Bayazit,Aaron Mueller,Antoine Bosselut

Main category: cs.CL

TL;DR: 本文提出了一种通过稀疏交叉编码器和RelIE度量追踪语言模型预训练过程中特征演变的新方法，提高了模型训练的可解释性。

Details

Motivation: 传统评估方法（如基准测试）无法揭示模型如何获取概念和能力，因此需要一种方法在概念层面更好地理解模型训练过程。 Method: 使用稀疏交叉编码器在模型检查点之间发现并对齐特征，并引入相对间接效应（RelIE）作为新度量来追踪特征在任务性能中的因果重要性。 Result: 交叉编码器可以检测预训练过程中特征的出现、维持和终止，并通过RelIE度量有效追踪特征的因果重要性变化。 Conclusion: 该论文提出了一种使用稀疏交叉编码器和新度量相对间接效应（RelIE）的方法，以追踪预训练过程中语言特征的演变，为模型训练的概念级理解提供了可解释且细粒度的分析路径。 Abstract: Large language models (LLMs) learn non-trivial abstractions during pretraining, like detecting irregular plural noun subjects. However, it is not well understood when and how specific linguistic abilities emerge as traditional evaluation methods such as benchmarking fail to reveal how models acquire concepts and capabilities. To bridge this gap and better understand model training at the concept level, we use sparse crosscoders to discover and align features across model checkpoints. Using this approach, we track the evolution of linguistic features during pretraining. We train crosscoders between open-sourced checkpoint triplets with significant performance and representation shifts, and introduce a novel metric, Relative Indirect Effects (RelIE), to trace training stages at which individual features become causally important for task performance. We show that crosscoders can detect feature emergence, maintenance, and discontinuation during pretraining. Our approach is architecture-agnostic and scalable, offering a promising path toward more interpretable and fine-grained analysis of representation learning throughout pretraining.

cs.CV [Back]

[94] Facial Emotion Recognition does not detect feeling unsafe in automated driving

Abel van Elburg,Konstantinos Gkentsidis,Mathieu Sarrazin,Sarah Barendswaard,Varun Kotian,Riender Happee

Main category: cs.CV

TL;DR: The study found that perceived risk in automated vehicles is influenced by driving style and critical events, while facial expressions are unreliable for risk assessment. A neural network model using physiological and motion data was effective in predicting perceived risk.

Details

Motivation: To understand the role of trust and perceived safety in public acceptance of automated vehicles by analyzing perceived risk through subjective ratings, facial expressions, and physiological signals. Method: An experiment was conducted using a driving simulator with two automated driving styles and optional introduction of a crossing pedestrian. Data collected included continuous subjective comfort ratings, vehicle motion, webcam footage for facial expression, skin conductance, heart rate, and eye tracking. A neural network model was implemented to predict perceived risk using vehicle motion and skin conductance. Result: Continuous subjective perceived risk ratings showed discomfort during cornering and braking followed by relief. Dynamic driving style induced stronger discomfort compared to calm driving style. The crossing pedestrian doubled the comfort decrement with dynamic driving but didn't affect calm driving. Facial expression analysis was mostly ineffective with most participants showing no reaction and very few showing Surprise or Happy expressions. Fear was never dominant. Conclusion: Facial expression recognition is not a reliable method for assessing perceived risk in automated vehicles. A neural network model using vehicle motion and skin conductance was shown to correlate well with reported perceived risk, indicating potential for objective perceived risk assessment. Abstract: Trust and perceived safety play a crucial role in the public acceptance of automated vehicles. To understand perceived risk, an experiment was conducted using a driving simulator under two automated driving styles and optionally introducing a crossing pedestrian. Data was collected from 32 participants, consisting of continuous subjective comfort ratings, motion, webcam footage for facial expression, skin conductance, heart rate, and eye tracking. The continuous subjective perceived risk ratings showed significant discomfort associated with perceived risk during cornering and braking followed by relief or even positive comfort on continuing the ride. The dynamic driving style induced a stronger discomfort as compared to the calm driving style. The crossing pedestrian did not affect discomfort with the calm driving style but doubled the comfort decrement with the dynamic driving style. This illustrates the importance of consequences of critical interactions in risk perception. Facial expression was successfully analyzed for 24 participants but most (15/24) did not show any detectable facial reaction to the critical event. Among the 9 participants who did, 8 showed a Happy expression, and only 4 showed a Surprise expression. Fear was never dominant. This indicates that facial expression recognition is not a reliable method for assessing perceived risk in automated vehicles. To predict perceived risk a neural network model was implemented using vehicle motion and skin conductance. The model correlated well with reported perceived risk, demonstrating its potential for objective perceived risk assessment in automated vehicles, reducing subjective bias and highlighting areas for future research.

[95] PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Linqing Wang,Ximing Xing,Yiji Cheng,Zhiyuan Zhao,Jiale Tao,Qixun Wang,Ruihuang Li,Xin Li,Mingrui Wu,Xinchi Deng,Chunyu Wang,Qinglin Lu

Main category: cs.CV

TL;DR: PromptEnhancer is a novel and universal prompt rewriting framework that enhances text-to-image models' ability to align images with complex user prompts without modifying the model's weights.

Details

Motivation: Recent text-to-image diffusion models struggle to faithfully render complex user prompts, leading to a mismatch between user intent and the generated output. PromptEnhancer addresses this challenge by enhancing any pretrained T2I model without modifying its weights. Method: The framework uses a Chain-of-Thought (CoT) rewriter trained through reinforcement learning, guided by a dedicated reward model called AlignEvaluator, which provides explicit and fine-grained feedback based on a taxonomy of key points derived from common T2I failure modes. Result: Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across various semantic and compositional challenges. Additionally, a new, high-quality human preference benchmark was introduced. Conclusion: PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges in text-to-image models. Abstract: Recent advancements in text-to-image (T2I) diffusion models have demonstrated remarkable capabilities in generating high-fidelity images. However, these models often struggle to faithfully render complex user prompts, particularly in aspects like attribute binding, negation, and compositional relationships. This leads to a significant mismatch between user intent and the generated output. To address this challenge, we introduce PromptEnhancer, a novel and universal prompt rewriting framework that enhances any pretrained T2I model without requiring modifications to its weights. Unlike prior methods that rely on model-specific fine-tuning or implicit reward signals like image-reward scores, our framework decouples the rewriter from the generator. We achieve this by training a Chain-of-Thought (CoT) rewriter through reinforcement learning, guided by a dedicated reward model we term the AlignEvaluator. The AlignEvaluator is trained to provide explicit and fine-grained feedback based on a systematic taxonomy of 24 key points, which are derived from a comprehensive analysis of common T2I failure modes. By optimizing the CoT rewriter to maximize the reward from our AlignEvaluator, our framework learns to generate prompts that are more precisely interpreted by T2I models. Extensive experiments on the HunyuanImage 2.1 model demonstrate that PromptEnhancer significantly improves image-text alignment across a wide range of semantic and compositional challenges. Furthermore, we introduce a new, high-quality human preference benchmark to facilitate future research in this direction.

[96] Skywork UniPic 2.0: Building Kontext Model with Online RL for Unified Multimodal Model

Hongyang Wei,Baixin Xu,Hongbo Liu,Cyrus Wu,Jie Liu,Yi Peng,Peiyu Wang,Zexiang Liu,Jingwen He,Yidan Xietian,Chuanxin Tang,Zidong Wang,Yichen Wei,Liang Hu,Boyi Jiang,William Li,Ying He,Yang Liu,Xuchen Song,Eric Li,Yahui Zhou

Main category: cs.CV

TL;DR: This paper introduces Skywork UniPic 2.0, a training paradigm that enhances multimodal image generation and editing through architectural changes, a novel reinforcement strategy, and joint training, resulting in a highly effective and scalable model.

Details

Motivation: The motivation is to optimize training strategies rather than merely scaling model parameters to improve efficiency and performance in multimodal models for unified image generation and editing. Method: The method involves architectural modifications, large-scale pre-training, a Progressive Dual-Task Reinforcement strategy (PDTR), and joint training with Qwen2.5-VL-7B to create a unified multimodal model. Result: UniPic2-SD3.5M-Kontext outperforms larger models in image generation and editing, while UniPic2-Metaquery achieves top-tier performance across diverse tasks with a scalable training approach. Conclusion: The proposed training paradigm, Skywork UniPic 2.0, is effective and generalizable, as validated by the top-tier performance of the unified multimodal model, UniPic2-Metaquery. Abstract: Recent advances in multimodal models have demonstrated impressive capabilities in unified image generation and editing. However, many prominent open-source models prioritize scaling model parameters over optimizing training strategies, limiting their efficiency and performance. In this work, we present UniPic2-SD3.5M-Kontext, a 2B-parameter DiT model based on SD3.5-Medium, which achieves state-of-the-art image generation and editing while extending seamlessly into a unified multimodal framework. Our approach begins with architectural modifications to SD3.5-Medium and large-scale pre-training on high-quality data, enabling joint text-to-image generation and editing capabilities. To enhance instruction following and editing consistency, we propose a novel Progressive Dual-Task Reinforcement strategy (PDTR), which effectively strengthens both tasks in a staged manner. We empirically validate that the reinforcement phases for different tasks are mutually beneficial and do not induce negative interference. After pre-training and reinforcement strategies, UniPic2-SD3.5M-Kontext demonstrates stronger image generation and editing capabilities than models with significantly larger generation parameters-including BAGEL (7B) and Flux-Kontext (12B). Furthermore, following the MetaQuery, we connect the UniPic2-SD3.5M-Kontext and Qwen2.5-VL-7B via a connector and perform joint training to launch a unified multimodal model UniPic2-Metaquery. UniPic2-Metaquery integrates understanding, generation, and editing, achieving top-tier performance across diverse tasks with a simple and scalable training paradigm. This consistently validates the effectiveness and generalizability of our proposed training paradigm, which we formalize as Skywork UniPic 2.0.

[97] Inpaint4Drag: Repurposing Inpainting Models for Drag-Based Image Editing via Bidirectional Warping

Jingyi Lu,Kai Han

Main category: cs.CV

TL;DR: Inpaint4Drag improves drag-based image editing by using pixel-space warping and inpainting, enabling real-time performance and universal compatibility with inpainting models.

Details

Motivation: Existing drag-based image editing methods rely on latent space manipulation of generative models, which results in limited precision, delayed feedback, and model-specific constraints. A more flexible and efficient approach is needed. Method: Inpaint4Drag decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. It treats image regions as deformable materials inspired by physical elasticity, enabling real-time previews and efficient inpainting. Result: The method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, outperforming existing approaches in both speed and visual quality while being universally compatible with any inpainting model. Conclusion: Inpaint4Drag offers a new, efficient, and universally adaptable method for drag-based image editing by decomposing the task into warping and inpainting, significantly improving performance and user experience. Abstract: Drag-based image editing has emerged as a powerful paradigm for intuitive image manipulation. However, existing approaches predominantly rely on manipulating the latent space of generative models, leading to limited precision, delayed feedback, and model-specific constraints. Accordingly, we present Inpaint4Drag, a novel framework that decomposes drag-based editing into pixel-space bidirectional warping and image inpainting. Inspired by elastic object deformation in the physical world, we treat image regions as deformable materials that maintain natural shape under user manipulation. Our method achieves real-time warping previews (0.01s) and efficient inpainting (0.3s) at 512x512 resolution, significantly improving the interaction experience compared to existing methods that require minutes per edit. By transforming drag inputs directly into standard inpainting formats, our approach serves as a universal adapter for any inpainting model without architecture modification, automatically inheriting all future improvements in inpainting technology. Extensive experiments demonstrate that our method achieves superior visual quality and precise control while maintaining real-time performance. Project page: https://visual-ai.github.io/inpaint4drag/

[98] DisPatch: Disarming Adversarial Patches in Object Detection with Diffusion Models

Jin Ma,Mohammed Aldeen,Christopher Salas,Feng Luo,Mashrur Chowdhury,Mert Pesé,Long Cheng

Main category: cs.CV

TL;DR: DISPATCH 是首个基于扩散模型的物体检测防御框架，通过生成模型重新生成图像并纠正对抗区域，具有优异的性能和鲁棒性。

Details

Motivation: 当前最先进的物体检测器容易受到对抗补丁攻击，需要一种有效、通用且能够应对潜在未知威胁的防御方法。 Method: DISPATCH 采用了一种基于扩散模型的 "再生和纠正" 策略，通过生成模型重新生成图像并纠正潜在的对抗区域。 Result: DISPATCH 在多种检测器和攻击下表现优异，mAP.5 分数达到 89.3%，攻击成功率降低至 24.8%，并且对自适应攻击具有强鲁棒性。 Conclusion: DISPATCH 是一种实用且可靠的物体检测防御框架，能有效对抗多种攻击，并具有良好的泛化性和鲁棒性。 Abstract: Object detection is fundamental to various real-world applications, such as security monitoring and surveillance video analysis. Despite their advancements, state-of-theart object detectors are still vulnerable to adversarial patch attacks, which can be easily applied to real-world objects to either conceal actual items or create non-existent ones, leading to severe consequences. Given the current diversity of adversarial patch attacks and potential unknown threats, an ideal defense method should be effective, generalizable, and robust against adaptive attacks. In this work, we introduce DISPATCH, the first diffusion-based defense framework for object detection. Unlike previous works that aim to "detect and remove" adversarial patches, DISPATCH adopts a "regenerate and rectify" strategy, leveraging generative models to disarm attack effects while preserving the integrity of the input image. Specifically, we utilize the in-distribution generative power of diffusion models to regenerate the entire image, aligning it with benign data. A rectification process is then employed to identify and replace adversarial regions with their regenerated benign counterparts. DISPATCH is attack-agnostic and requires no prior knowledge of the existing patches. Extensive experiments across multiple detectors and attacks demonstrate that DISPATCH consistently outperforms state-of-the-art defenses on both hiding attacks and creating attacks, achieving the best overall mAP.5 score of 89.3% on hiding attacks, and lowering the attack success rate to 24.8% on untargeted creating attacks. Moreover, it maintains strong robustness against adaptive attacks, making it a practical and reliable defense for object detection systems.

[99] WATCH: World-aware Allied Trajectory and pose reconstruction for Camera and Human

Qijun Ying,Zhongyuan Hu,Rui Zhang,Ronghui Li,Yu Lu,Zijiao Zeng

Main category: cs.CV

TL;DR: This paper introduces WATCH, a new framework for global human motion reconstruction from monocular videos, which effectively integrates camera orientation and translation information, achieving superior results.

Details

Motivation: The motivation is to overcome the limitations of human-motion-centric approaches in exploiting camera orientation information and integrating camera translation cues, which are critical for accurate global human motion reconstruction from monocular videos. Method: The paper proposes WATCH, a unified framework that uses an analytical heading angle decomposition technique and a camera trajectory integration mechanism inspired by world models to better utilize camera orientation and translation information. Result: WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction on in-the-wild benchmarks, demonstrating the effectiveness of the proposed methods. Conclusion: The paper concludes that jointly modeling camera-human motion relationships effectively addresses the challenge of camera translation integration in global human motion reconstruction, with WATCH achieving state-of-the-art results. Abstract: Global human motion reconstruction from in-the-wild monocular videos is increasingly demanded across VR, graphics, and robotics applications, yet requires accurate mapping of human poses from camera to world coordinates-a task challenged by depth ambiguity, motion ambiguity, and the entanglement between camera and human movements. While human-motion-centric approaches excel in preserving motion details and physical plausibility, they suffer from two critical limitations: insufficient exploitation of camera orientation information and ineffective integration of camera translation cues. We present WATCH (World-aware Allied Trajectory and pose reconstruction for Camera and Human), a unified framework addressing both challenges. Our approach introduces an analytical heading angle decomposition technique that offers superior efficiency and extensibility compared to existing geometric methods. Additionally, we design a camera trajectory integration mechanism inspired by world models, providing an effective pathway for leveraging camera translation information beyond naive hard-decoding approaches. Through experiments on in-the-wild benchmarks, WATCH achieves state-of-the-art performance in end-to-end trajectory reconstruction. Our work demonstrates the effectiveness of jointly modeling camera-human motion relationships and offers new insights for addressing the long-standing challenge of camera translation integration in global human motion reconstruction. The code will be available publicly.

[100] Sali4Vid: Saliency-Aware Video Reweighting and Adaptive Caption Retrieval for Dense Video Captioning

MinJu Jeon,Si-Woo Kim,Ye-Chan Kim,HyunGee Kim,Dong-Jin Kim

Main category: cs.CV

TL;DR: Sali4Vid提出了一种简单而有效的显著性感知框架，解决了现有密集视频描述方法的局限性。

Details

Motivation: 现有的端到端模型存在两个局限性：对所有视频帧一视同仁以及忽略场景转换，本文旨在解决这些问题。 Method: Sali4Vid框架引入了显著性感知视频重新加权和基于语义的自适应描述检索方法。 Result: Sali4Vid在YouCook2和ViTT数据集上取得了最先进的结果。 Conclusion: Sali4Vid通过改进视频加权和检索方法，在密集视频描述领域取得了最先进的成果。 Abstract: Dense video captioning aims to temporally localize events in video and generate captions for each event. While recent works propose end-to-end models, they suffer from two limitations: (1) applying timestamp supervision only to text while treating all video frames equally, and (2) retrieving captions from fixed-size video chunks, overlooking scene transitions. To address these, we propose Sali4Vid, a simple yet effective saliency-aware framework. We introduce Saliency-aware Video Reweighting, which converts timestamp annotations into sigmoid-based frame importance weights, and Semantic-based Adaptive Caption Retrieval, which segments videos by frame similarity to capture scene transitions and improve caption retrieval. Sali4Vid achieves state-of-the-art results on YouCook2 and ViTT, demonstrating the benefit of jointly improving video weighting and retrieval for dense video captioning

[101] UAV-Based Intelligent Traffic Surveillance System: Real-Time Vehicle Detection, Classification, Tracking, and Behavioral Analysis

Ali Khanpour,Tianyi Wang,Afra Vahidi-Shams,Wim Ectors,Farzam Nakhaie,Amirhossein Taheri,Christian Claudel

Main category: cs.CV

TL;DR: This paper presents an advanced UAV-based traffic surveillance system that accurately detects, classifies, and tracks vehicles while identifying traffic violations in urban environments.

Details

Motivation: Traditional traffic monitoring systems face limitations in coverage, adaptability, and scalability, prompting the need for a more advanced solution. Method: The system uses multi-scale and multi-angle template matching, Kalman filtering, and homography-based calibration to process aerial video data. It also integrates geofencing, motion filtering, and trajectory deviation analysis for traffic violation detection. Result: The system achieved high performance metrics, including 91.8% detection precision, 90.5% F1-score, and strong tracking metrics (MOTA/MOTP of 92.1% and 93.7%). It also effectively classified vehicle types and detected traffic violations. Conclusion: The UAV-based traffic surveillance system is a scalable, accurate, and practical solution for urban mobility analytics and traffic violation detection. Abstract: Traffic congestion and violations pose significant challenges for urban mobility and road safety. Traditional traffic monitoring systems, such as fixed cameras and sensor-based methods, are often constrained by limited coverage, low adaptability, and poor scalability. To address these challenges, this paper introduces an advanced unmanned aerial vehicle (UAV)-based traffic surveillance system capable of accurate vehicle detection, classification, tracking, and behavioral analysis in real-world, unconstrained urban environments. The system leverages multi-scale and multi-angle template matching, Kalman filtering, and homography-based calibration to process aerial video data collected from altitudes of approximately 200 meters. A case study in urban area demonstrates robust performance, achieving a detection precision of 91.8%, an F1-score of 90.5%, and tracking metrics (MOTA/MOTP) of 92.1% and 93.7%, respectively. Beyond precise detection, the system classifies five vehicle types and automatically detects critical traffic violations, including unsafe lane changes, illegal double parking, and crosswalk obstructions, through the fusion of geofencing, motion filtering, and trajectory deviation analysis. The integrated analytics module supports origin-destination tracking, vehicle count visualization, inter-class correlation analysis, and heatmap-based congestion modeling. Additionally, the system enables entry-exit trajectory profiling, vehicle density estimation across road segments, and movement direction logging, supporting comprehensive multi-scale urban mobility analytics. Experimental results confirms the system's scalability, accuracy, and practical relevance, highlighting its potential as an enforcement-aware, infrastructure-independent traffic monitoring solution for next-generation smart cities.

[102] VCMamba: Bridging Convolutions with Multi-Directional Mamba for Efficient Visual Representation

Mustafa Munir,Alex Zhang,Radu Marculescu

Main category: cs.CV

TL;DR: VCMamba combines CNNs and multi-directional Mamba SSMs to effectively capture both local features and global context, achieving state-of-the-art performance on image classification and semantic segmentation tasks with fewer parameters.

Details

Motivation: The motivation is to address the limitations of existing models: CNNs are good at capturing local features but lack global reasoning, while ViTs and SSMs like Mamba excel at global context but do not capture fine-grained local features as effectively. Method: The method involves a hybrid architecture that uses a convolutional stem and hierarchical structure with convolutional blocks in early stages for local feature extraction, followed by multi-directional Mamba blocks in later stages for capturing long-range dependencies and global context. Result: VCMamba-B achieved 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters and Vision GNN-B by 0.3% with 64% fewer parameters. On ADE20K, it obtained 47.1 mIoU, exceeding EfficientFormer-L7 by 2.0 mIoU with 62% fewer parameters. Conclusion: VCMamba is a new vision backbone that combines the strengths of CNNs and multi-directional Mamba SSMs to effectively capture both local features and global context, outperforming existing models in terms of accuracy and parameter efficiency. Abstract: Recent advances in Vision Transformers (ViTs) and State Space Models (SSMs) have challenged the dominance of Convolutional Neural Networks (CNNs) in computer vision. ViTs excel at capturing global context, and SSMs like Mamba offer linear complexity for long sequences, yet they do not capture fine-grained local features as effectively as CNNs. Conversely, CNNs possess strong inductive biases for local features but lack the global reasoning capabilities of transformers and Mamba. To bridge this gap, we introduce \textit{VCMamba}, a novel vision backbone that integrates the strengths of CNNs and multi-directional Mamba SSMs. VCMamba employs a convolutional stem and a hierarchical structure with convolutional blocks in its early stages to extract rich local features. These convolutional blocks are then processed by later stages incorporating multi-directional Mamba blocks designed to efficiently model long-range dependencies and global context. This hybrid design allows for superior feature representation while maintaining linear complexity with respect to image resolution. We demonstrate VCMamba's effectiveness through extensive experiments on ImageNet-1K classification and ADE20K semantic segmentation. Our VCMamba-B achieves 82.6% top-1 accuracy on ImageNet-1K, surpassing PlainMamba-L3 by 0.3% with 37% fewer parameters, and outperforming Vision GNN-B by 0.3% with 64% fewer parameters. Furthermore, VCMamba-B obtains 47.1 mIoU on ADE20K, exceeding EfficientFormer-L7 by 2.0 mIoU while utilizing 62% fewer parameters. Code is available at https://github.com/Wertyuui345/VCMamba.

Vanshika Vats,Ashwani Rathee,James Davis

Main category: cs.CV

TL;DR: A training-free, multi-agent framework improves semantic segmentation by iteratively refining masks to align with complex labeling guidelines using a Worker-Supervisor architecture and reinforcement learning.

Details

Motivation: The motivation stems from the limitations of traditional and recent segmentation methods in following complex textual labeling guidelines, which are crucial for real-world applications. Method: The method uses a Worker-Supervisor refinement architecture with a reinforcement learning stop policy to iteratively refine segmentation masks based on retrieved guidelines. Result: Evaluated on Waymo and ReasonSeg datasets, the method significantly outperforms state-of-the-art baselines in generalization and adherence to instructions. Conclusion: The proposed multi-agent framework effectively addresses the challenge of adhering to complex labeling guidelines in semantic segmentation without task-specific retraining. Abstract: Semantic segmentation in real-world applications often requires not only accurate masks but also strict adherence to textual labeling guidelines. These guidelines are typically complex and long, and both human and automated labeling often fail to follow them faithfully. Traditional approaches depend on expensive task-specific retraining that must be repeated as the guidelines evolve. Although recent open-vocabulary segmentation methods excel with simple prompts, they often fail when confronted with sets of paragraph-length guidelines that specify intricate segmentation rules. To address this, we introduce a multi-agent, training-free framework that coordinates general-purpose vision-language models within an iterative Worker-Supervisor refinement architecture. The Worker performs the segmentation, the Supervisor critiques it against the retrieved guidelines, and a lightweight reinforcement learning stop policy decides when to terminate the loop, ensuring guideline-consistent masks while balancing resource use. Evaluated on the Waymo and ReasonSeg datasets, our method notably outperforms state-of-the-art baselines, demonstrating strong generalization and instruction adherence.

[104] Domain Adaptation for Different Sensor Configurations in 3D Object Detection

Satoshi Tanaka,Kok Seang Tan,Isamu Yamashita

Main category: cs.CV

TL;DR: 本文研究了3D目标检测中不同传感器配置之间的领域适应问题，并提出了两种技术来提高跨配置泛化能力。

Details

Motivation: 不同的车辆平台通常部署不同的传感器配置，这导致在一种配置上训练的模型在应用于另一种配置时性能下降。 Method: 本文提出了两种技术：下游微调（在多数据集训练后进行数据集特定的微调）和部分层微调（仅更新一部分层以提高跨配置泛化能力）。 Result: 使用在相同地理区域收集的配对数据集，本文显示结合下游微调和部分层微调的联合训练始终优于每种配置的朴素联合训练。 Conclusion: 本文提出了一种实用且可扩展的解决方案，用于将3D目标检测模型适应于不同的车辆平台。 Abstract: Recent advances in autonomous driving have underscored the importance of accurate 3D object detection, with LiDAR playing a central role due to its robustness under diverse visibility conditions. However, different vehicle platforms often deploy distinct sensor configurations, causing performance degradation when models trained on one configuration are applied to another because of shifts in the point cloud distribution. Prior work on multi-dataset training and domain adaptation for 3D object detection has largely addressed environmental domain gaps and density variation within a single LiDAR; in contrast, the domain gap for different sensor configurations remains largely unexplored. In this work, we address domain adaptation across different sensor configurations in 3D object detection. We propose two techniques: Downstream Fine-tuning (dataset-specific fine-tuning after multi-dataset training) and Partial Layer Fine-tuning (updating only a subset of layers to improve cross-configuration generalization). Using paired datasets collected in the same geographic region with multiple sensor configurations, we show that joint training with Downstream Fine-tuning and Partial Layer Fine-tuning consistently outperforms naive joint training for each configuration. Our findings provide a practical and scalable solution for adapting 3D object detection models to the diverse vehicle platforms.

[105] CD-Mamba: Cloud detection with long-range spatial dependency modeling

Tianxiang Xue,Jiayi Zhao,Jingsheng Li,Changlu Chen,Kun Zhan

Main category: cs.CV

TL;DR: CD-Mamba is a hybrid model that combines convolution and Mamba's state-space modeling to improve cloud detection accuracy by capturing both pixel-wise textural details and long-term patch-wise dependencies.

Details

Motivation: Remote sensing images are frequently obscured by cloud cover, which poses significant challenges to data integrity and reliability. Effective cloud detection requires addressing both short-range spatial redundancies and long-range atmospheric similarities among cloud patches. Method: CD-Mamba integrates convolution and Mamba's state-space modeling into a unified cloud detection network to capture both pixelwise textural details and long term patchwise dependencies. Result: Extensive experiments validated the effectiveness of CD-Mamba and demonstrated its superior performance over existing methods in cloud detection. Conclusion: CD-Mamba manages both pixel-wise interactions and extensive patch-wise dependencies, leading to improved cloud detection accuracy across diverse spatial scales. Abstract: Remote sensing images are frequently obscured by cloud cover, posing significant challenges to data integrity and reliability. Effective cloud detection requires addressing both short-range spatial redundancies and long-range atmospheric similarities among cloud patches. Convolutional neural networks are effective at capturing local spatial dependencies, while Mamba has strong capabilities in modeling long-range dependencies. To fully leverage both local spatial relations and long-range dependencies, we propose CD-Mamba, a hybrid model that integrates convolution and Mamba's state-space modeling into a unified cloud detection network. CD-Mamba is designed to comprehensively capture pixelwise textural details and long term patchwise dependencies for cloud detection. This design enables CD-Mamba to manage both pixel-wise interactions and extensive patch-wise dependencies simultaneously, improving detection accuracy across diverse spatial scales. Extensive experiments validate the effectiveness of CD-Mamba and demonstrate its superior performance over existing methods.

[106] Exploiting Unlabeled Structures through Task Consistency Training for Versatile Medical Image Segmentation

Shengqian Zhu,Jiafei Wu,Xiaogang Xu,Chengrong Yu,Ying Song,Zhang Yi,Guangjun Li,Junjie Hu

Main category: cs.CV

TL;DR: 该研究提出了一种任务一致性训练框架（TCT），用于解决医学图像分割中类别不平衡的问题，而无需额外模型。

Details

Motivation: 由于完全标注数据的获取成本高昂，利用部分标注数据集（PLDs）成为一种有前景的替代方法。然而，现有方法在处理类别不平衡问题时存在性能下降的问题。 Method: 提出了一种任务一致性训练（TCT）框架，包括一个主分割头（MSH）和多个辅助任务头（ATHs），通过一致性约束来利用未标注的解剖结构，并引入了过滤策略和统一辅助不确定性加权损失（UAUWL）来减少错误传播和分割质量下降。 Result: 在八个腹部数据集上的广泛实验表明，该方法在解决类别不平衡问题方面具有有效性。 Conclusion: TCT框架能够在不使用额外模型的情况下有效解决医学图像分割中的类别不平衡问题。 Abstract: Versatile medical image segmentation (VMIS) targets the segmentation of multiple classes, while obtaining full annotations for all classes is often impractical due to the time and labor required. Leveraging partially labeled datasets (PLDs) presents a promising alternative; however, current VMIS approaches face significant class imbalance due to the unequal category distribution in PLDs. Existing methods attempt to address this by generating pseudo-full labels. Nevertheless, these typically require additional models and often result in potential performance degradation from label noise. In this work, we introduce a Task Consistency Training (TCT) framework to address class imbalance without requiring extra models. TCT includes a backbone network with a main segmentation head (MSH) for multi-channel predictions and multiple auxiliary task heads (ATHs) for task-specific predictions. By enforcing a consistency constraint between the MSH and ATH predictions, TCT effectively utilizes unlabeled anatomical structures. To avoid error propagation from low-consistency, potentially noisy data, we propose a filtering strategy to exclude such data. Additionally, we introduce a unified auxiliary uncertainty-weighted loss (UAUWL) to mitigate segmentation quality declines caused by the dominance of specific tasks. Extensive experiments on eight abdominal datasets from diverse clinical sites demonstrate our approach's effectiveness.

[107] Enhancing Self-Driving Segmentation in Adverse Weather Conditions: A Dual Uncertainty-Aware Training Approach to SAM Optimization

Dharsan Ravindran,Kevin Wang,Zhuoyuan Cao,Saleh Abdelrahman,Jeffery Wu

Main category: cs.CV

TL;DR: 本文研究了如何通过不确定性感知方法提升自动驾驶场景分割模型在恶劣天气下的鲁棒性。

Details

Motivation: 由于现有的视觉基础模型（如SAM和SAM2）在高视觉模糊的恶劣天气条件下表现不佳，缺乏不确定性量化，因此研究者受到医学影像领域不确定性感知训练的启发，尝试提升自动驾驶场景分割的可靠性。 Method: 研究引入了一种多步骤微调方法，将不确定性度量直接纳入损失函数，并调整了用于医学图像分割的不确定性感知适配器（UAT）以适应自动驾驶场景。 Result: 实验表明，UAT-SAM在极端天气条件下优于标准SAM，而结合不确定性感知损失的SAM2在多样化的驾驶场景中表现出改进的性能。 Conclusion: 研究发现，在恶劣天气下，通过显式不确定性建模可以提高自动驾驶场景分割的鲁棒性，UAT-SAM和引入不确定性感知损失的SAM2都表现出优于标准模型的性能。 Abstract: Recent advances in vision foundation models, such as the Segment Anything Model (SAM) and its successor SAM2, have achieved state-of-the-art performance on general image segmentation benchmarks. However, these models struggle in adverse weather conditions where visual ambiguity is high, largely due to their lack of uncertainty quantification. Inspired by progress in medical imaging, where uncertainty-aware training has improved reliability in ambiguous cases, we investigate two approaches to enhance segmentation robustness for autonomous driving. First, we introduce a multi-step finetuning procedure for SAM2 that incorporates uncertainty metrics directly into the loss function, improving overall scene recognition. Second, we adapt the Uncertainty-Aware Adapter (UAT), originally designed for medical image segmentation, to driving contexts. We evaluate both methods on CamVid, BDD100K, and GTA driving datasets. Experiments show that UAT-SAM outperforms standard SAM in extreme weather, while SAM2 with uncertainty-aware loss achieves improved performance across diverse driving scenes. These findings underscore the value of explicit uncertainty modeling for safety-critical autonomous driving in challenging environments.

[108] WatchHAR: Real-time On-device Human Activity Recognition System for Smartwatches

Taeyoung Yeon,Vasco Xu,Henry Hoffmann,Karan Ahuja

Main category: cs.CV

TL;DR: WatchHAR是一种在智能手表上运行的端到端HAR系统，在隐私、延迟和性能方面表现出色。

Details

Motivation: 尽管HAR技术取得了进展，但在不受约束的环境中，能够在智能手表上完全运行的系统仍然难以实现。 Method: 提出了一种将传感器数据预处理和推理统一为端到端可训练模块的新架构，并优化了管道的每个组件。 Result: WatchHAR在超过25个活动类别中保持超过90%的准确率，活动事件检测处理时间为9.3毫秒，多模态活动分类处理时间为11.8毫秒。 Conclusion: WatchHAR实现了在智能手表上完全运行的HAR系统，解决了隐私和延迟问题，并实现了卓越的性能。 Abstract: Despite advances in practical and multimodal fine-grained Human Activity Recognition (HAR), a system that runs entirely on smartwatches in unconstrained environments remains elusive. We present WatchHAR, an audio and inertial-based HAR system that operates fully on smartwatches, addressing privacy and latency issues associated with external data processing. By optimizing each component of the pipeline, WatchHAR achieves compounding performance gains. We introduce a novel architecture that unifies sensor data preprocessing and inference into an end-to-end trainable module, achieving 5x faster processing while maintaining over 90% accuracy across more than 25 activity classes. WatchHAR outperforms state-of-the-art models for event detection and activity classification while running directly on the smartwatch, achieving 9.3 ms processing time for activity event detection and 11.8 ms for multimodal activity classification. This research advances on-device activity recognition, realizing smartwatches' potential as standalone, privacy-aware, and minimally-invasive continuous activity tracking devices.

[109] MCANet: A Multi-Scale Class-Specific Attention Network for Multi-Label Post-Hurricane Damage Assessment using UAV Imagery

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: MCANet improves post-hurricane damage assessment by accurately capturing multi-scale spatial features and distinguishing similar damage types.

Details

Motivation: Existing CNN-based methods struggle to capture multi-scale spatial features and distinguish visually similar or co-occurring damage types, which MCANet aims to overcome. Method: MCANet uses a Res2Net-based hierarchical backbone and a multi-head class-specific residual attention module to capture multi-scale spatial features and enhance damage classification accuracy. Result: MCANet achieved a mean average precision (mAP) of 91.75% on the RescueNet dataset, outperforming other models, and improved performance to 92.35% with eight attention heads, especially boosting accuracy for challenging classes. Conclusion: MCANet is an effective framework for multi-label damage classification, offering improved accuracy and interpretability in post-hurricane damage assessment. Abstract: Rapid and accurate post-hurricane damage assessment is vital for disaster response and recovery. Yet existing CNN-based methods struggle to capture multi-scale spatial features and to distinguish visually similar or co-occurring damage types. To address these issues, we propose MCANet, a multi-label classification framework that learns multi-scale representations and adaptively attends to spatially relevant regions for each damage category. MCANet employs a Res2Net-based hierarchical backbone to enrich spatial context across scales and a multi-head class-specific residual attention module to enhance discrimination. Each attention branch focuses on different spatial granularities, balancing local detail with global context. We evaluate MCANet on the RescueNet dataset of 4,494 UAV images collected after Hurricane Michael. MCANet achieves a mean average precision (mAP) of 91.75%, outperforming ResNet, Res2Net, VGG, MobileNet, EfficientNet, and ViT. With eight attention heads, performance further improves to 92.35%, boosting average precision for challenging classes such as Road Blocked by over 6%. Class activation mapping confirms MCANet's ability to localize damage-relevant regions, supporting interpretability. Outputs from MCANet can inform post-disaster risk mapping, emergency routing, and digital twin-based disaster response. Future work could integrate disaster-specific knowledge graphs and multimodal large language models to improve adaptability to unseen disasters and enrich semantic understanding for real-world decision-making.

[110] Dynamic Group Detection using VLM-augmented Temporal Groupness Graph

Kaname Yokoyama,Chihiro Nakatani,Norimichi Ukita

Main category: cs.CV

TL;DR: 本文提出了一种动态视频中的人群组检测方法，结合局部和全局特征，并通过全局优化提高检测一致性。

Details

Motivation: 为了检测复杂的群体，需要同时考虑组内成员的局部外观特征和场景的全局上下文信息。 Method: 利用增强的视觉-语言模型（VLM）提取每帧中的局部和全局外观特征，并通过图的全局优化检测动态变化的群体。 Result: 实验结果表明，该方法在公共数据集上优于最先进的群体检测方法。 Conclusion: 该论文提出的方法在公共数据集上优于现有的群体检测方法。 Abstract: This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git

[111] FloodVision: Urban Flood Depth Estimation Using Foundation Vision-Language Models and Domain Knowledge Graph

Zhangding Liu,Neda Mohammadi,John E. Taylor

Main category: cs.CV

TL;DR: FloodVision是一个零样本框架，结合GPT-4o的基础视觉-语言模型和结构化的领域知识图谱，实现准确且可泛化的洪水深度估计。

Details

Motivation: 及时准确的洪水深度估计对于道路通行和应急响应至关重要，但现有的计算机视觉方法由于依赖固定的目标检测器和任务特定的训练，存在准确性和泛化能力不足的问题。 Method: FloodVision通过动态识别RGB图像中的可见参考对象，从知识图谱中检索验证高度，估计淹没比例，并应用统计异常值过滤来计算最终深度值。 Result: 在MyCoast纽约的110张众包图像上进行的评估中，FloodVision的平均绝对误差为8.17厘米，比GPT-4o基线10.28厘米减少了20.5%，并超过了先前的基于CNN的方法。 Conclusion: FloodVision在不同场景中具有良好的泛化能力，并且可以在接近实时的情况下运行，适合未来集成到数字孪生平台和市民报告应用程序中，以提高智慧城市对洪水的恢复能力。 Abstract: Timely and accurate floodwater depth estimation is critical for road accessibility and emergency response. While recent computer vision methods have enabled flood detection, they suffer from both accuracy limitations and poor generalization due to dependence on fixed object detectors and task-specific training. To enable accurate depth estimation that can generalize across diverse flood scenarios, this paper presents FloodVision, a zero-shot framework that combines the semantic reasoning abilities of the foundation vision-language model GPT-4o with a structured domain knowledge graph. The knowledge graph encodes canonical real-world dimensions for common urban objects including vehicles, people, and infrastructure elements to ground the model's reasoning in physical reality. FloodVision dynamically identifies visible reference objects in RGB images, retrieves verified heights from the knowledge graph to mitigate hallucination, estimates submergence ratios, and applies statistical outlier filtering to compute final depth values. Evaluated on 110 crowdsourced images from MyCoast New York, FloodVision achieves a mean absolute error of 8.17 cm, reducing the GPT-4o baseline 10.28 cm by 20.5% and surpassing prior CNN-based methods. The system generalizes well across varying scenes and operates in near real-time, making it suitable for future integration into digital twin platforms and citizen-reporting apps for smart city flood resilience.

[112] Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

Bangxiang Lan,Ruobing Xie,Ruixiang Zhao,Xingwu Sun,Zhanhui Kang,Gang Yang,Xirong Li

Main category: cs.CV

TL;DR: This paper introduces a new Hybrid-Tower framework for Text-to-Video Retrieval (T2VR), called PIG, which improves retrieval effectiveness without compromising efficiency, outperforming existing Two-Tower and approaching state-of-the-art Single-Tower methods.

Details

Motivation: To overcome the limitations of existing CLIP-based frameworks for T2VR, namely the low effectiveness of the Two-Tower framework and the low efficiency of the Single-Tower framework. Method: A hybrid method named Fine-grained Pseudo-query Interaction and Generation (PIG) is proposed. It involves a pseudo-query generator that creates pseudo-queries for each video, enabling fine-grained interaction between video and textual features. Result: Experiments on five benchmarks show that the proposed method improves R@1 by 1.6% to 3.9% over the baseline while maintaining the efficiency of the Two-Tower framework. Conclusion: The Hybrid-Tower framework, specifically the PIG method, effectively combines the strengths of both Two-Tower and Single-Tower frameworks, achieving both high effectiveness and efficiency in T2VR. Abstract: The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6\% \sim 3.9\%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.

[113] Comparative Evaluation of Traditional and Deep Learning Feature Matching Algorithms using Chandrayaan-2 Lunar Data

R. Makharia,J. G. Singla,Amitabh,N. Dube,H. Sharma

Main category: cs.CV

TL;DR: This paper evaluates feature matching algorithms for lunar image registration, showing that SuperGlue outperforms classical methods, especially under polar lighting conditions, with an emphasis on the importance of preprocessing.

Details

Motivation: Accurate image registration is essential for lunar exploration tasks such as surface mapping, resource localization, and mission planning. However, aligning images from different sensors is challenging due to variations in resolution, illumination, and sensor distortion. Method: Five feature matching algorithms (SIFT, ASIFT, AKAZE, RIFT2, SuperGlue) were evaluated using cross-modality lunar image pairs from equatorial and polar regions. A preprocessing pipeline was implemented, including georeferencing, resolution alignment, intensity normalization, and enhancements such as adaptive histogram equalization, PCA, and shadow correction. Result: SuperGlue achieved the lowest root mean square error and fastest runtimes. Classical methods (SIFT, AKAZE) performed well in equatorial regions but degraded under polar lighting conditions. Conclusion: The study concludes that SuperGlue, a deep learning-based matcher, offers the most accurate and efficient solution for lunar image registration, particularly under challenging polar lighting conditions. Classical methods like SIFT and AKAZE are effective near the equator but struggle with polar data. Preprocessing is essential for improving registration accuracy. Abstract: Accurate image registration is critical for lunar exploration, enabling surface mapping, resource localization, and mission planning. Aligning data from diverse lunar sensors -- optical (e.g., Orbital High Resolution Camera, Narrow and Wide Angle Cameras), hyperspectral (Imaging Infrared Spectrometer), and radar (e.g., Dual-Frequency Synthetic Aperture Radar, Selene/Kaguya mission) -- is challenging due to differences in resolution, illumination, and sensor distortion. We evaluate five feature matching algorithms: SIFT, ASIFT, AKAZE, RIFT2, and SuperGlue (a deep learning-based matcher), using cross-modality image pairs from equatorial and polar regions. A preprocessing pipeline is proposed, including georeferencing, resolution alignment, intensity normalization, and enhancements like adaptive histogram equalization, principal component analysis, and shadow correction. SuperGlue consistently yields the lowest root mean square error and fastest runtimes. Classical methods such as SIFT and AKAZE perform well near the equator but degrade under polar lighting. The results highlight the importance of preprocessing and learning-based approaches for robust lunar image registration across diverse conditions.

[114] Toward Accessible Dermatology: Skin Lesion Classification Using Deep Learning Models on Mobile-Acquired Images

Asif Newaz,Masum Mushfiq Ishti,A Z M Ashraful Azam,Asif Ur Rahman Adib

Main category: cs.CV

TL;DR: This paper introduces a large dataset of mobile-acquired skin disease images and demonstrates that Transformer-based models, particularly the Swin Transformer, outperform traditional methods in classification accuracy while enhancing interpretability through Grad-CAM.

Details

Motivation: Conventional diagnostic methods for skin diseases are often costly, complex, and unavailable in low-resource settings, while existing automated classification studies are limited in scope and data representation. Method: The study curated a large dataset of over 50 skin disease categories captured with mobile devices and evaluated multiple convolutional neural networks and Transformer-based architectures, particularly focusing on the Swin Transformer. Additionally, Gradient-weighted Class Activation Mapping (Grad-CAM) was incorporated to enhance interpretability. Result: Transformer models, especially the Swin Transformer, demonstrated superior performance by effectively capturing global contextual features in classifying skin diseases. Conclusion: Transformer-based approaches have significant potential for classifying mobile-acquired skin lesions, enabling accessible AI-assisted dermatological screening and early diagnosis, especially in resource-limited environments. Abstract: Skin diseases are among the most prevalent health concerns worldwide, yet conventional diagnostic methods are often costly, complex, and unavailable in low-resource settings. Automated classification using deep learning has emerged as a promising alternative, but existing studies are mostly limited to dermoscopic datasets and a narrow range of disease classes. In this work, we curate a large dataset of over 50 skin disease categories captured with mobile devices, making it more representative of real-world conditions. We evaluate multiple convolutional neural networks and Transformer-based architectures, demonstrating that Transformer models, particularly the Swin Transformer, achieve superior performance by effectively capturing global contextual features. To enhance interpretability, we incorporate Gradient-weighted Class Activation Mapping (Grad-CAM), which highlights clinically relevant regions and provides transparency in model predictions. Our results underscore the potential of Transformer-based approaches for mobile-acquired skin lesion classification, paving the way toward accessible AI-assisted dermatological screening and early diagnosis in resource-limited environments.

[115] Extracting Uncertainty Estimates from Mixtures of Experts for Semantic Segmentation

Svetlana Pavlitska,Beyza Keskin,Alwin Faßbender,Christian Hubschneider,J. Marius Zöllner

Main category: cs.CV

TL;DR: This paper demonstrates that mixture of experts (MoEs) can effectively estimate reliable predictive uncertainty without architectural changes, outperforming ensembles, especially with out-of-distribution data. Increasing the number of experts enhances uncertainty calibration.

Details

Motivation: Accurate and well-calibrated predictive uncertainty is crucial for reliable and safe computer vision models, particularly in safety-critical applications like traffic scene perception. Method: Investigated predictive uncertainty estimates using predictive entropy, mutual information, and expert variance from MoEs. Evaluated routing uncertainty through gate entropy. Tested on A2D2 and Cityscapes datasets. Result: MoEs provide better calibrated uncertainty estimates than ensembles, especially under OOD data. Simple gating mechanisms yield better routing uncertainty calibration than complex classwise gates. Increasing experts improves uncertainty calibration. Conclusion: Mixture of experts (MoEs) can produce reliable predictive uncertainty estimates without architectural changes, outperforming ensemble methods, particularly under out-of-distribution conditions. Increasing the number of experts improves uncertainty calibration. Abstract: Estimating accurate and well-calibrated predictive uncertainty is important for enhancing the reliability of computer vision models, especially in safety-critical applications like traffic scene perception. While ensemble methods are commonly used to quantify uncertainty by combining multiple models, a mixture of experts (MoE) offers an efficient alternative by leveraging a gating network to dynamically weight expert predictions based on the input. Building on the promising use of MoEs for semantic segmentation in our previous works, we show that well-calibrated predictive uncertainty estimates can be extracted from MoEs without architectural modifications. We investigate three methods to extract predictive uncertainty estimates: predictive entropy, mutual information, and expert variance. We evaluate these methods for an MoE with two experts trained on a semantical split of the A2D2 dataset. Our results show that MoEs yield more reliable uncertainty estimates than ensembles in terms of conditional correctness metrics under out-of-distribution (OOD) data. Additionally, we evaluate routing uncertainty computed via gate entropy and find that simple gating mechanisms lead to better calibration of routing uncertainty estimates than more complex classwise gates. Finally, our experiments on the Cityscapes dataset suggest that increasing the number of experts can further enhance uncertainty calibration. Our code is available at https://github.com/KASTEL-MobilityLab/mixtures-of-experts/.

[116] Exploring Non-Local Spatial-Angular Correlations with a Hybrid Mamba-Transformer Framework for Light Field Super-Resolution

Haosong Liu,Xiancheng Zhu,Huanqiang Zeng,Jianqing Zhu,Jiuwen Cao,Junhui Hou

Main category: cs.CV

TL;DR: 本文提出了一种新的光场图像超分辨率方法LFMT，通过子空间简单扫描策略和双阶段建模策略，有效地结合了Mamba和Transformer模型的优势，实验证明其性能优于现有方法。

Details

Motivation: 现有的基于Mamba的方法在应用于复杂的光场数据时，其多方向扫描策略导致特征提取效率低下且冗余。 Method: 提出了一种子空间简单扫描策略和双阶段建模策略，并设计了相应的模块，包括子空间简单Mamba块、空间-角度残差子空间Mamba块、极平面Mamba块和极平面Transformer块。 Result: LFMT在真实世界和合成光场数据集上的实验结果表明，其在保持低计算复杂度的同时，显著优于当前最先进的方法。 Conclusion: LFMT通过结合Mamba和Transformer模型的优势，在光场图像超分辨率任务中实现了对空间、角度和极平面域的全面信息探索。 Abstract: Recently, Mamba-based methods, with its advantage in long-range information modeling and linear complexity, have shown great potential in optimizing both computational cost and performance of light field image super-resolution (LFSR). However, current multi-directional scanning strategies lead to inefficient and redundant feature extraction when applied to complex LF data. To overcome this challenge, we propose a Subspace Simple Scanning (Sub-SS) strategy, based on which we design the Subspace Simple Mamba Block (SSMB) to achieve more efficient and precise feature extraction. Furthermore, we propose a dual-stage modeling strategy to address the limitation of state space in preserving spatial-angular and disparity information, thereby enabling a more comprehensive exploration of non-local spatial-angular correlations. Specifically, in stage I, we introduce the Spatial-Angular Residual Subspace Mamba Block (SA-RSMB) for shallow spatial-angular feature extraction; in stage II, we use a dual-branch parallel structure combining the Epipolar Plane Mamba Block (EPMB) and Epipolar Plane Transformer Block (EPTB) for deep epipolar feature refinement. Building upon meticulously designed modules and strategies, we introduce a hybrid Mamba-Transformer framework, termed LFMT. LFMT integrates the strengths of Mamba and Transformer models for LFSR, enabling comprehensive information exploration across spatial, angular, and epipolar-plane domains. Experimental results demonstrate that LFMT significantly outperforms current state-of-the-art methods in LFSR, achieving substantial improvements in performance while maintaining low computational complexity on both real-word and synthetic LF datasets.

[117] PropVG: End-to-End Proposal-Driven Visual Grounding with Multi-Granularity Discrimination

Ming Dai,Wenxuan Cheng,Jiedong Zhuang,Jiang-jiang Liu,Hongshen Zhao,Zhenhua Feng,Wankou Yang

Main category: cs.CV

TL;DR: 本文提出PropVG，一种新的端到端视觉基础框架，通过结合前景对象提议与参考理解及多粒度辨别模块，提高了对象识别的准确性。

Details

Motivation: 传统基于提议的两阶段框架效率低下且计算复杂度高，而现有端到端方法忽略了潜在的显著目标监督，且缺乏多粒度辨别能力。 Method: 提出PropVG，一种端到端的基于提议的框架，结合了前景对象提议生成与参考对象理解，并引入了对比参考评分模块和多粒度目标辨别模块。 Result: 在多个基准数据集上进行了广泛的实验，结果表明PropVG在性能上优于现有方法。 Conclusion: PropVG展现出在视觉基础任务中的有效性，特别是在复杂场景中识别目标对象方面优于现有方法。 Abstract: Recent advances in visual grounding have largely shifted away from traditional proposal-based two-stage frameworks due to their inefficiency and high computational complexity, favoring end-to-end direct reference paradigms. However, these methods rely exclusively on the referred target for supervision, overlooking the potential benefits of prominent prospective targets. Moreover, existing approaches often fail to incorporate multi-granularity discrimination, which is crucial for robust object identification in complex scenarios. To address these limitations, we propose PropVG, an end-to-end proposal-based framework that, to the best of our knowledge, is the first to seamlessly integrate foreground object proposal generation with referential object comprehension without requiring additional detectors. Furthermore, we introduce a Contrastive-based Refer Scoring (CRS) module, which employs contrastive learning at both sentence and word levels to enhance the capability in understanding and distinguishing referred objects. Additionally, we design a Multi-granularity Target Discrimination (MTD) module that fuses object- and semantic-level information to improve the recognition of absent targets. Extensive experiments on gRefCOCO (GREC/GRES), Ref-ZOM, R-RefCOCO, and RefCOCO (REC/RES) benchmarks demonstrate the effectiveness of PropVG. The codes and models are available at https://github.com/Dmmm1997/PropVG.

[118] TemporalFlowViz: Parameter-Aware Visual Analytics for Interpreting Scramjet Combustion Evolution

Yifei Jia,Shiyu Cheng,Yu Dong,Guan Li,Dong Tian,Ruixiao Peng,Xuyi Lu,Yu Wang,Wei Yao,Guihua Shan

Main category: cs.CV

TL;DR: This paper introduces TemporalFlowViz, a visual analytics system that leverages machine learning techniques to analyze and interpret complex temporal flow field data from scramjet combustion simulations, aiding in hypothesis generation and knowledge discovery.

Details

Motivation: The large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison in scramjet engine combustion dynamics. Method: TemporalFlowViz uses pretrained Vision Transformers for embedding extraction, dimensionality reduction and density-based clustering for latent combustion mode detection, and vision-language model-based summarization for interpretation. Result: A parameter-aware visual analytics workflow and system that enables expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations, allowing for parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration. Conclusion: TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis. Abstract: Understanding the complex combustion dynamics within scramjet engines is critical for advancing high-speed propulsion technologies. However, the large scale and high dimensionality of simulation-generated temporal flow field data present significant challenges for visual interpretation, feature differentiation, and cross-case comparison. In this paper, we present TemporalFlowViz, a parameter-aware visual analytics workflow and system designed to support expert-driven clustering, visualization, and interpretation of temporal flow fields from scramjet combustion simulations. Our approach leverages hundreds of simulated combustion cases with varying initial conditions, each producing time-sequenced flow field images. We use pretrained Vision Transformers to extract high-dimensional embeddings from these frames, apply dimensionality reduction and density-based clustering to uncover latent combustion modes, and construct temporal trajectories in the embedding space to track the evolution of each simulation over time. To bridge the gap between latent representations and expert reasoning, domain specialists annotate representative cluster centroids with descriptive labels. These annotations are used as contextual prompts for a vision-language model, which generates natural-language summaries for individual frames and full simulation cases. The system also supports parameter-based filtering, similarity-based case retrieval, and coordinated multi-view exploration to facilitate in-depth analysis. We demonstrate the effectiveness of TemporalFlowViz through two expert-informed case studies and expert feedback, showing TemporalFlowViz enhances hypothesis generation, supports interpretable pattern discovery, and enhances knowledge discovery in large-scale scramjet combustion analysis.

[119] Pose-Free 3D Quantitative Phase Imaging of Flowing Cellular Populations

Enze Ye,Wei Lin,Shaochi Ren,Yakun Liu,Xiaoping Li,Hao Wang,He Sun,Feng Pan

Main category: cs.CV

TL;DR: OmniFHT is a novel 3D imaging framework that enables accurate, high-throughput tomographic imaging of flowing cells without assumptions about their orientation or shape.

Details

Motivation: Current imaging methods assume uniform, single-axis rotation of cells, limiting their applicability to near-spherical cells and preventing accurate imaging of irregularly shaped cells with complex rotations. Method: OmniFHT uses a pose-free 3D refractive index reconstruction framework based on the Fourier diffraction theorem and implicit neural representations (INRs). Result: OmniFHT supports arbitrary cell geometries and multi-axis rotations, allowing accurate reconstruction from sparsely sampled projections and restricted angular coverage, achieving high-fidelity results with as few as 10 views or 120 degrees of angular range. Conclusion: OmniFHT enables unbiased, high-throughput tomographic imaging of entire flowing cell populations, providing a scalable solution for label-free morphometric analysis in flow cytometry platforms. Abstract: High-throughput 3D quantitative phase imaging (QPI) in flow cytometry enables label-free, volumetric characterization of individual cells by reconstructing their refractive index (RI) distributions from multiple viewing angles during flow through microfluidic channels. However, current imaging methods assume that cells undergo uniform, single-axis rotation, which require their poses to be known at each frame. This assumption restricts applicability to near-spherical cells and prevents accurate imaging of irregularly shaped cells with complex rotations. As a result, only a subset of the cellular population can be analyzed, limiting the ability of flow-based assays to perform robust statistical analysis. We introduce OmniFHT, a pose-free 3D RI reconstruction framework that leverages the Fourier diffraction theorem and implicit neural representations (INRs) for high-throughput flow cytometry tomographic imaging. By jointly optimizing each cell's unknown rotational trajectory and volumetric structure under weak scattering assumptions, OmniFHT supports arbitrary cell geometries and multi-axis rotations. Its continuous representation also allows accurate reconstruction from sparsely sampled projections and restricted angular coverage, producing high-fidelity results with as few as 10 views or only 120 degrees of angular range. OmniFHT enables, for the first time, in situ, high-throughput tomographic imaging of entire flowing cell populations, providing a scalable and unbiased solution for label-free morphometric analysis in flow cytometry platforms.

[120] CoRe-GS: Coarse-to-Refined Gaussian Splatting with Semantic Object Focus

Hannah Schieber,Dominik Frischmann,Simon Boche,Victor Schaack,Angela Schoellig,Stefan Leutenegger,Daniel Roth

Main category: cs.CV

TL;DR: CoRe-GS是一種高效的3D重建方法，專注於感興趣對象，結合語義分割與色彩過濾，實現快速且高質量的新視圖合成。

Details

Motivation: 在災難應變和遠程引導等應用中，需要快速且精確的3D重建。關注感興趣對象（PoIs）比重建整個場景更有效率。 Method: 首先生成帶有語義的粗略場景，然後使用基於顏色的過濾方法對感興趣對象進行精細重建。 Result: CoRe-GS將訓練時間減少到語義GS完整訓練週期的四分之一，並在SCRREAM和NeRDS 360數據集上展示了更高的新視圖合成質量。 Conclusion: CoRe-GS平衡了高质量重建與減少訓練時間的需求，通過語義GS和色彩過濾實現快速且高質量的對象隔離和重建。 Abstract: Mobile reconstruction for autonomous aerial robotics holds strong potential for critical applications such as tele-guidance and disaster response. These tasks demand both accurate 3D reconstruction and fast scene processing. Instead of reconstructing the entire scene in detail, it is often more efficient to focus on specific objects, i.e., points of interest (PoIs). Mobile robots equipped with advanced sensing can usually detect these early during data acquisition or preliminary analysis, reducing the need for full-scene optimization. Gaussian Splatting (GS) has recently shown promise in delivering high-quality novel view synthesis and 3D representation by an incremental learning process. Extending GS with scene editing, semantics adds useful per-splat features to isolate objects effectively. Semantic 3D Gaussian editing can already be achieved before the full training cycle is completed, reducing the overall training time. Moreover, the semantically relevant area, the PoI, is usually already known during capturing. To balance high-quality reconstruction with reduced training time, we propose CoRe-GS. We first generate a coarse segmentation-ready scene with semantic GS and then refine it for the semantic object using our novel color-based effective filtering for effective object isolation. This is speeding up the training process to be about a quarter less than a full training cycle for semantic GS. We evaluate our approach on two datasets, SCRREAM (real-world, outdoor) and NeRDS 360 (synthetic, indoor), showing reduced runtime and higher novel-view-synthesis quality.

[121] Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning

Trixia Simangan,Ahmed Nadeem Abbasi,Yipeng Hu,Shaheer U. Saeed

Main category: cs.CV

TL;DR: Cryo-RL is an automated reinforcement learning framework for prostate cancer cryoablation planning that matches expert performance, improves upon existing automated methods, and drastically reduces planning time.

Details

Motivation: Cryoablation planning is currently a manual, expertise-dependent, and time-consuming process, leading to variability in treatment quality. There is a need for an automated solution to improve consistency, scalability, and efficiency. Method: The study developed Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process. The framework learns an optimal policy for cryoprobe placement in a simulated environment that accounts for clinical constraints and intraoperative variability. Result: Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared to the best automated baselines and matched human expert performance with significantly less planning time. Conclusion: The study concludes that Cryo-RL, a reinforcement learning framework, can efficiently and accurately plan cryoablation for prostate cancer, achieving results comparable to human experts while significantly reducing planning time. Abstract: Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.

Dominik Pegler,David Steyrl,Mengfan Zhang,Alexander Karner,Jozsef Arato,Frank Scharnowski,Filip Melinscak

Main category: cs.CV

TL;DR: The paper investigates the use of explainable computer vision models to predict fear levels from spider-related images, showing promising results but highlighting the need for sufficient data and model interpretability for therapeutic applications.

Details

Motivation: The motivation is to explore the feasibility of using pretrained computer vision models in dynamically adjusting visual stimuli for computerized exposure therapy based on patient fear levels. Method: Three pretrained computer vision models were adapted using transfer learning to predict human fear ratings from a dataset of 313 spider-related images. The models were evaluated using cross-validation, and their explainability and learning curves were analyzed. Result: The models achieved an average mean absolute error (MAE) between 10.1 and 11.0. Reducing dataset size significantly decreased performance, while increasing dataset size beyond a certain point offered no significant improvement. Error analysis revealed higher errors for images with distant views and artificial/painted spiders. Conclusion: The study concludes that explainable computer vision models hold potential in predicting human fear levels from spider-related images, emphasizing the importance of model explainability and adequate dataset size for emotion-aware therapeutic technologies. Abstract: Advances in computer vision have opened new avenues for clinical applications, particularly in computerized exposure therapy where visual stimuli can be dynamically adjusted based on patient responses. As a critical step toward such adaptive systems, we investigated whether pretrained computer vision models can accurately predict fear levels from spider-related images. We adapted three diverse models using transfer learning to predict human fear ratings (on a 0-100 scale) from a standardized dataset of 313 images. The models were evaluated using cross-validation, achieving an average mean absolute error (MAE) between 10.1 and 11.0. Our learning curve analysis revealed that reducing the dataset size significantly harmed performance, though further increases yielded no substantial gains. Explainability assessments showed the models' predictions were based on spider-related features. A category-wise error analysis further identified visual conditions associated with higher errors (e.g., distant views and artificial/painted spiders). These findings demonstrate the potential of explainable computer vision models in predicting fear ratings, highlighting the importance of both model explainability and a sufficient dataset size for developing effective emotion-aware therapeutic technologies.

[123] SynGen-Vision: Synthetic Data Generation for training industrial vision models

Alpana Dubey,Suma Mani Kuriakose,Nitish Bhardwaj

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉语言模型和3D模拟的合成数据生成方法，用于工业磨损检测，解决了数据不足的问题，并取得了良好的效果。

Details

Motivation: 由于不同磨损场景的数据集不易获取，训练此类模型的数据整理过程昂贵且耗时，因此需要生成合成数据。 Method: 结合视觉语言模型和3D模拟渲染引擎生成合成数据，用于训练计算机视觉模型进行锈蚀检测。 Result: 使用生成的合成数据训练的模型在真实图像测试中表现出色，mAP50得分为0.87。 Conclusion: 提出的方法在工业磨损检测中具有很高的实用性，并且可以扩展到其他工业磨损检测场景。 Abstract: We propose an approach to generate synthetic data to train computer vision (CV) models for industrial wear and tear detection. Wear and tear detection is an important CV problem for predictive maintenance tasks in any industry. However, data curation for training such models is expensive and time-consuming due to the unavailability of datasets for different wear and tear scenarios. Our approach employs a vision language model along with a 3D simulation and rendering engine to generate synthetic data for varying rust conditions. We evaluate our approach by training a CV model for rust detection using the generated dataset and tested the trained model on real images of rusted industrial objects. The model trained with the synthetic data generated by our approach, outperforms the other approaches with a mAP50 score of 0.87. The approach is customizable and can be easily extended to other industrial wear and tear detection scenarios

[124] Evaluating Multiple Instance Learning Strategies for Automated Sebocyte Droplet Counting

Maryam Adelipour,Gustavo Carneiro,Jeongkwon Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于注意力机制的多实例学习框架用于皮脂细胞图像分析，并与基于聚合补丁级计数的基线多层感知机（MLP）进行了比较，结果显示简单的袋级聚合方法在滑动片级脂滴计数中表现稳健，而基于注意力的MIL方法需要任务对齐的池化和正则化来充分发挥其潜力。

Details

Motivation: 手动计数劳动强度大且主观性强，促使研究者寻求自动化的解决方案。 Method: 引入了一种简单的基于注意力机制的多实例学习（MIL）框架用于皮脂细胞图像分析，并使用尼罗红染色的皮脂细胞图像进行分类。数据通过数据增强扩展到约50,000个细胞。基准测试了两种模型：一种是基于聚合补丁级计数的基线多层感知机（MLP），另一种是基于ResNet-50特征和实例加权的注意力机制MIL模型。 Result: 实验使用五折交叉验证显示，基线MLP表现更稳定（平均MAE = 5.6），而基于注意力机制的MIL模型一致性较低（平均MAE = 10.7），但在某些特定折中表现更优。 Conclusion: 简单的袋级聚合方法为滑动片级脂滴计数提供了稳健的基线，而基于注意力机制的MIL方法需要任务对齐的池化和正则化来充分发挥其潜力。 Abstract: Sebocytes are lipid-secreting cells whose differentiation is marked by the accumulation of intracellular lipid droplets, making their quantification a key readout in sebocyte biology. Manual counting is labor-intensive and subjective, motivating automated solutions. Here, we introduce a simple attention-based multiple instance learning (MIL) framework for sebocyte image analysis. Nile Red-stained sebocyte images were annotated into 14 classes according to droplet counts, expanded via data augmentation to about 50,000 cells. Two models were benchmarked: a baseline multi-layer perceptron (MLP) trained on aggregated patch-level counts, and an attention-based MIL model leveraging ResNet-50 features with instance weighting. Experiments using five-fold cross-validation showed that the baseline MLP achieved more stable performance (mean MAE = 5.6) compared with the attention-based MIL, which was less consistent (mean MAE = 10.7) but occasionally superior in specific folds. These findings indicate that simple bag-level aggregation provides a robust baseline for slide-level droplet counting, while attention-based MIL requires task-aligned pooling and regularization to fully realize its potential in sebocyte image analysis.

[125] UniView: Enhancing Novel View Synthesis From A Single Image By Unifying Reference Features

Haowang Cui,Rui Chen,Tao Luo,Rui Li,Jiaze Wang

Main category: cs.CV

TL;DR: UniView enhances single-image novel view synthesis by leveraging reference images and advanced feature integration techniques, outperforming existing methods.

Details

Motivation: Synthesizing novel views from a single image is ill-posed due to ambiguity in unobserved regions. Existing methods suffer from distortions, so a more robust approach using reference priors is needed. Method: UniView uses a retrieval and augmentation system with a multimodal large language model (MLLM) to select reference images and introduces a plug-and-play adapter module with multi-level isolation layers. It also employs a decoupled triple attention mechanism to align and integrate features. Result: UniView achieves superior performance on challenging datasets, significantly improving novel view synthesis compared to existing state-of-the-art methods. Conclusion: UniView effectively improves novel view synthesis performance by leveraging reference images and a decoupled triple attention mechanism, surpassing state-of-the-art methods. Abstract: The task of synthesizing novel views from a single image is highly ill-posed due to multiple explanations for unobserved areas. Most current methods tend to generate unseen regions from ambiguity priors and interpolation near input views, which often lead to severe distortions. To address this limitation, we propose a novel model dubbed as UniView, which can leverage reference images from a similar object to provide strong prior information during view synthesis. More specifically, we construct a retrieval and augmentation system and employ a multimodal large language model (MLLM) to assist in selecting reference images that meet our requirements. Additionally, a plug-and-play adapter module with multi-level isolation layers is introduced to dynamically generate reference features for the target views. Moreover, in order to preserve the details of an original input image, we design a decoupled triple attention mechanism, which can effectively align and integrate multi-branch features into the synthesis process. Extensive experiments have demonstrated that our UniView significantly improves novel view synthesis performance and outperforms state-of-the-art methods on the challenging datasets.

[126] Efficient Video-to-Audio Generation via Multiple Foundation Models Mapper

Gehui Chen,Guan'an Wang,Xiaowen Huang,Jitao Sang

Main category: cs.CV

TL;DR: MFM-Mapper通过融合双视觉编码器特征并使用GPT-2改进特征对齐，实现了高效的视频到音频生成，且训练规模更小，性能更优。

Details

Motivation: 为了克服从头开始训练视频到音频生成模型的高资源消耗，本文提出了一种更有效的利用基础模型的方法。 Method: MFM-Mapper通过融合双视觉编码器的特征，并用GPT-2替代线性映射器来改进特征对齐，从而利用基础模型进行视频到音频生成。 Result: MFM-Mapper只需之前基于映射器方法16%的训练规模，就能在语义和时间一致性上达到更优的性能。 Conclusion: MFM-Mapper不仅在训练效率方面表现出色，而且在语义和时间一致性方面也达到了与更大规模训练模型相媲美的性能。 Abstract: Recent Video-to-Audio (V2A) generation relies on extracting semantic and temporal features from video to condition generative models. Training these models from scratch is resource intensive. Consequently, leveraging foundation models (FMs) has gained traction due to their cross-modal knowledge transfer and generalization capabilities. One prior work has explored fine-tuning a lightweight mapper network to connect a pre-trained visual encoder with a text-to-audio generation model for V2A. Inspired by this, we introduce the Multiple Foundation Model Mapper (MFM-Mapper). Compared to the previous mapper approach, MFM-Mapper benefits from richer semantic and temporal information by fusing features from dual visual encoders. Furthermore, by replacing a linear mapper with GPT-2, MFM-Mapper improves feature alignment, drawing parallels between cross-modal features mapping and autoregressive translation tasks. Our MFM-Mapper exhibits remarkable training efficiency. It achieves better performance in semantic and temporal consistency with fewer training consuming, requiring only 16\% of the training scale compared to previous mapper-based work, yet achieves competitive performance with models trained on a much larger scale.

[127] Dual-Domain Perspective on Degradation-Aware Fusion: A VLM-Guided Robust Infrared and Visible Image Fusion Framework

Tianpei Zhang,Jufeng Zhao,Yiming Zhu,Guangmang Cui

Main category: cs.CV

TL;DR: 本文提出GD^2Fusion，一种结合视觉-语言模型与双域联合优化的新框架，有效解决了双源退化场景下的红外-可见光图像融合问题。

Details

Motivation: 传统IVIF方法依赖高质量输入，在双源退化场景下表现不佳，且容易导致误差累积和性能下降。 Method: 提出了一种名为GD^2Fusion的框架，包括GFMSE模块进行频域退化感知和抑制，以及GSMAF模块进行跨模态退化滤波和自适应多源特征聚合。 Result: 通过广泛的定性和定量实验，验证了GD^2Fusion在双源退化场景中实现了优于现有算法和策略的融合效果。 Conclusion: GD^2Fusion能够克服传统IVIF方法在双源退化场景下的局限性，取得了优越的融合性能。 Abstract: Most existing infrared-visible image fusion (IVIF) methods assume high-quality inputs, and therefore struggle to handle dual-source degraded scenarios, typically requiring manual selection and sequential application of multiple pre-enhancement steps. This decoupled pre-enhancement-to-fusion pipeline inevitably leads to error accumulation and performance degradation. To overcome these limitations, we propose Guided Dual-Domain Fusion (GD^2Fusion), a novel framework that synergistically integrates vision-language models (VLMs) for degradation perception with dual-domain (frequency/spatial) joint optimization. Concretely, the designed Guided Frequency Modality-Specific Extraction (GFMSE) module performs frequency-domain degradation perception and suppression and discriminatively extracts fusion-relevant sub-band features. Meanwhile, the Guided Spatial Modality-Aggregated Fusion (GSMAF) module carries out cross-modal degradation filtering and adaptive multi-source feature aggregation in the spatial domain to enhance modality complementarity and structural consistency. Extensive qualitative and quantitative experiments demonstrate that GD^2Fusion achieves superior fusion performance compared with existing algorithms and strategies in dual-source degraded scenarios. The code will be publicly released after acceptance of this paper.

[128] Interpretable Deep Transfer Learning for Breast Ultrasound Cancer Detection: A Multi-Dataset Study

Mohammad Abbadi,Yassine Himeur,Shadi Atalla,Wathiq Mansoor

Main category: cs.CV

TL;DR: This paper demonstrates that machine learning and deep learning techniques, particularly ResNet-18, can achieve high accuracy in breast cancer classification using ultrasound images, supporting the integration of AI tools into clinical diagnostics.

Details

Motivation: Breast cancer is a leading cause of cancer-related mortality in women, and early detection using safe, cost-effective methods like ultrasound imaging is crucial, especially for those with dense breast tissue. Method: Classical machine learning models and deep convolutional neural networks were evaluated on datasets BUSI, BUS-BRA, and BrEaST-Lesions USG, with model transparency enhanced by Grad-CAM visualizations. Result: ResNet-18 achieved the highest accuracy (99.7%) and perfect sensitivity for malignant lesions; classical ML models showed competitive performance when combined with deep feature extraction. Conclusion: AI-based diagnostic tools are feasible for breast cancer detection using ultrasound images, with high-performing and interpretable systems demonstrated. Abstract: Breast cancer remains a leading cause of cancer-related mortality among women worldwide. Ultrasound imaging, widely used due to its safety and cost-effectiveness, plays a key role in early detection, especially in patients with dense breast tissue. This paper presents a comprehensive study on the application of machine learning and deep learning techniques for breast cancer classification using ultrasound images. Using datasets such as BUSI, BUS-BRA, and BrEaST-Lesions USG, we evaluate classical machine learning models (SVM, KNN) and deep convolutional neural networks (ResNet-18, EfficientNet-B0, GoogLeNet). Experimental results show that ResNet-18 achieves the highest accuracy (99.7%) and perfect sensitivity for malignant lesions. Classical ML models, though outperformed by CNNs, achieve competitive performance when enhanced with deep feature extraction. Grad-CAM visualizations further improve model transparency by highlighting diagnostically relevant image regions. These findings support the integration of AI-based diagnostic tools into clinical workflows and demonstrate the feasibility of deploying high-performing, interpretable systems for ultrasound-based breast cancer detection.

[129] A biologically inspired separable learning vision model for real-time traffic object perception in Dark

Hulin Li,Qiliang Ren,Jun Li,Hanbing Wei,Zheng Liu,Linfang Fan

Main category: cs.CV

TL;DR: This paper introduces SLVM, a biologically inspired framework for enhancing object perception in low-light traffic scenes, achieving state-of-the-art results on a new large-scale dataset called Dark-traffic.

Details

Motivation: The motivation is to address the challenges of fast and accurate object perception in low-light traffic scenes where existing models struggle due to illumination degradation and lack of reliable visual cues. Method: The paper introduces a physically grounded illumination degradation method and proposes the Separable Learning Vision Model (SLVM), which includes a light-adaptive pupillary mechanism, a feature-level separable learning strategy, task-specific decoupled branches, and a spatial misalignment-aware fusion module. Result: Extensive experiments demonstrate that SLVM outperforms existing models such as RT-DETR and YOLOv12 by significant margins in detection and instance segmentation tasks, and also reduces endpoint error on optical flow estimation. Additionally, SLVM surpasses other methods on the LIS benchmark. Conclusion: The paper concludes that SLVM, a new biologically inspired framework, significantly enhances perception under adverse lighting conditions and achieves state-of-the-art performance on the Dark-traffic dataset. Abstract: Fast and accurate object perception in low-light traffic scenes has attracted increasing attention. However, due to severe illumination degradation and the lack of reliable visual cues, existing perception models and methods struggle to quickly adapt to and accurately predict in low-light environments. Moreover, there is the absence of available large-scale benchmark specifically focused on low-light traffic scenes. To bridge this gap, we introduce a physically grounded illumination degradation method tailored to real-world low-light settings and construct Dark-traffic, the largest densely annotated dataset to date for low-light traffic scenes, supporting object detection, instance segmentation, and optical flow estimation. We further propose the Separable Learning Vision Model (SLVM), a biologically inspired framework designed to enhance perception under adverse lighting. SLVM integrates four key components: a light-adaptive pupillary mechanism for illumination-sensitive feature extraction, a feature-level separable learning strategy for efficient representation, task-specific decoupled branches for multi-task separable learning, and a spatial misalignment-aware fusion module for precise multi-feature alignment. Extensive experiments demonstrate that SLVM achieves state-of-the-art performance with reduced computational overhead. Notably, it outperforms RT-DETR by 11.2 percentage points in detection, YOLOv12 by 6.1 percentage points in instance segmentation, and reduces endpoint error (EPE) of baseline by 12.37% on Dark-traffic. On the LIS benchmark, the end-to-end trained SLVM surpasses Swin Transformer+EnlightenGAN and ConvNeXt-T+EnlightenGAN by an average of 11 percentage points across key metrics, and exceeds Mask RCNN (with light enhancement) by 3.1 percentage points. The Dark-traffic dataset and complete code is released at https://github.com/alanli1997/slvm.

[130] Leveraging Transfer Learning and Mobile-enabled Convolutional Neural Networks for Improved Arabic Handwritten Character Recognition

Mohsine El Khayati,Ayyad Maafiri,Yassine Himeur,Hamzah Ali Alkhazaleh,Shadi Atalla,Wathiq Mansoor

Main category: cs.CV

TL;DR: 本研究分析了迁移学习与移动卷积神经网络结合在阿拉伯手写字符识别中的应用，发现MobileNet和ShuffleNet在不同数据集上表现优异，全微调策略效果最佳。

Details

Motivation: 研究旨在解决阿拉伯手写字符识别中的计算资源需求大和数据集稀缺的挑战。 Method: 研究评估了三种迁移学习策略：全微调、部分微调和从头开始训练，并使用了四种轻量级MobileNet模型：MobileNet、SqueezeNet、MnasNet和ShuffleNet。 Result: MobileNet在准确率、鲁棒性和效率方面表现最佳，而ShuffleNet在全微调下表现出色。IFHCDB数据集使用MnasNet实现了99%的准确率，AHCD数据集使用ShuffleNet实现了97%的准确率，而HIJJA数据集在ShuffleNet下达到了92%的最高准确率。全微调整体表现最佳，而部分微调表现不佳。 Conclusion: 该研究得出结论，迁移学习与移动卷积神经网络的结合在阿拉伯手写字符识别中具有巨大潜力，为资源高效的字符识别奠定了基础。 Abstract: The study explores the integration of transfer learning (TL) with mobile-enabled convolutional neural networks (MbNets) to enhance Arabic Handwritten Character Recognition (AHCR). Addressing challenges like extensive computational requirements and dataset scarcity, this research evaluates three TL strategies--full fine-tuning, partial fine-tuning, and training from scratch--using four lightweight MbNets: MobileNet, SqueezeNet, MnasNet, and ShuffleNet. Experiments were conducted on three benchmark datasets: AHCD, HIJJA, and IFHCDB. MobileNet emerged as the top-performing model, consistently achieving superior accuracy, robustness, and efficiency, with ShuffleNet excelling in generalization, particularly under full fine-tuning. The IFHCDB dataset yielded the highest results, with 99% accuracy using MnasNet under full fine-tuning, highlighting its suitability for robust character recognition. The AHCD dataset achieved competitive accuracy (97%) with ShuffleNet, while HIJJA posed significant challenges due to its variability, achieving a peak accuracy of 92% with ShuffleNet. Notably, full fine-tuning demonstrated the best overall performance, balancing accuracy and convergence speed, while partial fine-tuning underperformed across metrics. These findings underscore the potential of combining TL and MbNets for resource-efficient AHCR, paving the way for further optimizations and broader applications. Future work will explore architectural modifications, in-depth dataset feature analysis, data augmentation, and advanced sensitivity analysis to enhance model robustness and generalizability.

[131] LUIVITON: Learned Universal Interoperable VIrtual Try-ON

Cong Cao,Xianhang Cheng,Jingyuan Liu,Yujian Zheng,Zhenhui Lin,Meriem Chkir,Hao Li

Main category: cs.CV

TL;DR: LUIVITON是一种全自动虚拟试穿系统，可将复杂服装披覆在多样化的人形角色上，无需人工干预，且支持服装尺寸和材质的快速定制。

Details

Motivation: 为了解决复杂、多层次服装与多样化且任意姿势的人形角色对齐的挑战，开发一个全自动的虚拟试穿系统。 Method: 利用SMPL作为代理表示，将服装到身体的披覆问题分为两个对应任务：1) 服装到SMPL和2) 身体到SMPL的对应关系。使用基于几何学习的方法解决服装到SMPL的匹配问题，而身体到SMPL的对应关系则通过基于扩散模型的方法结合多视角一致外观特征和预训练2D基础模型来处理。 Result: LUIVITON能够高效地处理复杂几何形状、非流形网格，并能推广到广泛的人形角色（包括人类、机器人、卡通人物、生物和外星人）。此外，系统支持快速定制服装尺寸，允许用户在服装披覆后调整尺寸和材质属性。 Conclusion: LUIVITON是一个能够实现全自动虚拟试穿的端到端系统，可以应对各种复杂情况下的服装与身体对齐问题，并且在没有2D服装缝纫图案的情况下也能生成高质量的3D服装拟合效果。 Abstract: We present LUIVITON, an end-to-end system for fully automated virtual try-on, capable of draping complex, multi-layer clothing onto diverse and arbitrarily posed humanoid characters. To address the challenge of aligning complex garments with arbitrary and highly diverse body shapes, we use SMPL as a proxy representation and separate the clothing-to-body draping problem into two correspondence tasks: 1) clothing-to-SMPL and 2) body-to-SMPL correspondence, where each has its unique challenges. While we address the clothing-to-SMPL fitting problem using a geometric learning-based approach for partial-to-complete shape correspondence prediction, we introduce a diffusion model-based approach for body-to-SMPL correspondence using multi-view consistent appearance features and a pre-trained 2D foundation model. Our method can handle complex geometries, non-manifold meshes, and generalizes effectively to a wide range of humanoid characters -- including humans, robots, cartoon subjects, creatures, and aliens, while maintaining computational efficiency for practical adoption. In addition to offering a fully automatic fitting solution, LUIVITON supports fast customization of clothing size, allowing users to adjust clothing sizes and material properties after they have been draped. We show that our system can produce high-quality 3D clothing fittings without any human labor, even when 2D clothing sewing patterns are not available.

[132] Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization

Jingqi Wu,Hanxi Li,Lin Yuanbo Wu,Hao Chen,Deyin Liu,Peng Wang

Main category: cs.CV

TL;DR: ADClick和ADClick-Seg是用于工业异常检测的交互式图像分割算法，通过少量用户点击和文本描述生成像素级异常注释，显著提升了检测模型的性能。

Details

Motivation: 工业产品检测中，尽管在生产过程中可以收集缺陷样本，但利用这些样本通常需要像素级注释，限制了可扩展性。 Method: 提出了ADClick和ADClick-Seg，一种基于用户点击和文本描述的交互式图像分割算法，用于工业异常检测。 Result: ADClick能够显著提高AD模型的性能；ADClick-Seg在多类别AD任务中实现了最先进的结果，包括AP = 80.0%，PRO = 97.5%，Pixel-AUROC = 99.1%在MVTec AD上的成绩。 Conclusion: ADClick和ADClick-Seg为工业异常检测提供了一个高效和精确的解决方案，特别是在利用用户交互和文本描述生成像素级异常注释方面。 Abstract: Industrial product inspection is often performed using Anomaly Detection (AD) frameworks trained solely on non-defective samples. Although defective samples can be collected during production, leveraging them usually requires pixel-level annotations, limiting scalability. To address this, we propose ADClick, an Interactive Image Segmentation (IIS) algorithm for industrial anomaly detection. ADClick generates pixel-wise anomaly annotations from only a few user clicks and a brief textual description, enabling precise and efficient labeling that significantly improves AD model performance (e.g., AP = 96.1\% on MVTec AD). We further introduce ADClick-Seg, a cross-modal framework that aligns visual features and textual prompts via a prototype-based approach for anomaly detection and localization. By combining pixel-level priors with language-guided cues, ADClick-Seg achieves state-of-the-art results on the challenging ``Multi-class'' AD task (AP = 80.0\%, PRO = 97.5\%, Pixel-AUROC = 99.1\% on MVTec AD).

[133] Systematic Review and Meta-analysis of AI-driven MRI Motion Artifact Detection and Correction

Mojtaba Safari,Zach Eidex,Richard L. J. Qiu,Matthew Goette,Tonghe Wang,Xiaofeng Yang

Main category: cs.CV

TL;DR: This paper explores AI-driven methods, especially deep learning generative models, for detecting and correcting MRI motion artifacts. While these methods show promise in improving image quality, challenges like limited generalizability and reliance on paired training data persist. The study emphasizes the need for standardized datasets and reporting protocols to enhance MRI diagnostic accuracy and patient care outcomes.

Details

Motivation: To systematically review and perform a meta-analysis of artificial intelligence (AI)-driven methods for detecting and correcting magnetic resonance imaging (MRI) motion artifacts, assessing current developments, effectiveness, challenges, and future research directions. Method: A comprehensive systematic review and meta-analysis were conducted, focusing on deep learning (DL) approaches, particularly generative models, for the detection and correction of MRI motion artifacts. Result: DL, particularly generative models, show promise for reducing motion artifacts and improving image quality; however, limited generalizability, reliance on paired training data, and risk of visual distortions remain key challenges. Conclusion: AI-driven methods, particularly DL generative models, show significant potential for improving MRI image quality by effectively addressing motion artifacts. However, critical challenges must be addressed, including the need for comprehensive public datasets, standardized reporting protocols for artifact levels, and more advanced, adaptable DL techniques to reduce reliance on extensive paired datasets. Abstract: Background: To systematically review and perform a meta-analysis of artificial intelligence (AI)-driven methods for detecting and correcting magnetic resonance imaging (MRI) motion artifacts, assessing current developments, effectiveness, challenges, and future research directions. Methods: A comprehensive systematic review and meta-analysis were conducted, focusing on deep learning (DL) approaches, particularly generative models, for the detection and correction of MRI motion artifacts. Quantitative data were extracted regarding utilized datasets, DL architectures, and performance metrics. Results: DL, particularly generative models, show promise for reducing motion artifacts and improving image quality; however, limited generalizability, reliance on paired training data, and risk of visual distortions remain key challenges that motivate standardized datasets and reporting. Conclusions: AI-driven methods, particularly DL generative models, show significant potential for improving MRI image quality by effectively addressing motion artifacts. However, critical challenges must be addressed, including the need for comprehensive public datasets, standardized reporting protocols for artifact levels, and more advanced, adaptable DL techniques to reduce reliance on extensive paired datasets. Addressing these aspects could substantially enhance MRI diagnostic accuracy, reduce healthcare costs, and improve patient care outcomes.

[134] GeoSplat: A Deep Dive into Geometry-Constrained Gaussian Splatting

Yangming Li,Chaoyu Liu,Lihao Liu,Simon Masnou,Carola-Bibian Schönlieb

Main category: cs.CV

TL;DR: 本文提出GeoSplat，一个利用几何约束优化高斯点训练的新框架，通过引入更高效和抗噪的几何先验估计方法，在新视角合成任务中取得了更好的效果。

Details

Motivation: 已有研究主要使用低阶几何先验（如法向量），且通过噪声敏感方法（如局部主成分分析）估计不可靠，因此需要一种更鲁棒且高效的几何约束优化方法。 Method: 提出GeoSplat框架，结合一阶和二阶几何量优化高斯点初始化、梯度更新和稠密化；利用主曲率初始化高斯尺度；基于局部流形结构引入高效且抗噪的几何先验估计方法。 Result: 在多个数据集上进行了广泛实验，结果表明GeoSplat显著提升了高斯点的表现，尤其是在新视角合成任务中优于现有方法。 Conclusion: GeoSplat是一个利用一阶和二阶几何量改进高斯点训练流程的几何约束优化框架，在新视角合成任务中显著提升了高斯点的性能，并优于以往的基线方法。 Abstract: A few recent works explored incorporating geometric priors to regularize the optimization of Gaussian splatting, further improving its performance. However, those early studies mainly focused on the use of low-order geometric priors (e.g., normal vector), and they are also unreliably estimated by noise-sensitive methods, like local principal component analysis. To address their limitations, we first present GeoSplat, a general geometry-constrained optimization framework that exploits both first-order and second-order geometric quantities to improve the entire training pipeline of Gaussian splatting, including Gaussian initialization, gradient update, and densification. As an example, we initialize the scales of 3D Gaussian primitives in terms of principal curvatures, leading to a better coverage of the object surface than random initialization. Secondly, based on certain geometric structures (e.g., local manifold), we introduce efficient and noise-robust estimation methods that provide dynamic geometric priors for our framework. We conduct extensive experiments on multiple datasets for novel view synthesis, showing that our framework: GeoSplat, significantly improves the performance of Gaussian splatting and outperforms previous baselines.

[135] Scale-interaction transformer: a hybrid cnn-transformer model for facial beauty prediction

Djamel Eddine Boukhari

Main category: cs.CV

TL;DR: This paper introduces the Scale-Interaction Transformer (SIT), a hybrid deep learning model that combines CNNs and Transformers to improve Automated Facial Beauty Prediction by explicitly modeling the interplay between multi-scale facial features.

Details

Motivation: Convolutional Neural Networks (CNNs) process information at a fixed scale, potentially overlooking the inter-dependencies between facial features at different levels of granularity, which is crucial for Automated Facial Beauty Prediction (FBP). Method: The authors proposed a hybrid architecture named Scale-Interaction Transformer (SIT), which combines CNNs for multi-scale feature extraction and Transformers for modeling inter-dependencies between features. Result: The proposed SIT model achieved a Pearson Correlation of 0.9187 on the SCUT-FBP5500 benchmark dataset, establishing a new state-of-the-art performance. Conclusion: The SIT architecture demonstrates the importance of explicitly modeling interactions between multi-scale facial features for high-performance FBP, and it shows the potential of hybrid CNN-Transformer models in complex image regression tasks. Abstract: Automated Facial Beauty Prediction (FBP) is a challenging computer vision task due to the complex interplay of local and global facial features that influence human perception. While Convolutional Neural Networks (CNNs) excel at feature extraction, they often process information at a fixed scale, potentially overlooking the critical inter-dependencies between features at different levels of granularity. To address this limitation, we introduce the Scale-Interaction Transformer (SIT), a novel hybrid deep learning architecture that synergizes the feature extraction power of CNNs with the relational modeling capabilities of Transformers. The SIT first employs a multi-scale module with parallel convolutions to capture facial characteristics at varying receptive fields. These multi-scale representations are then framed as a sequence and processed by a Transformer encoder, which explicitly models their interactions and contextual relationships via a self-attention mechanism. We conduct extensive experiments on the widely-used SCUT-FBP5500 benchmark dataset, where the proposed SIT model establishes a new state-of-the-art. It achieves a Pearson Correlation of 0.9187, outperforming previous methods. Our findings demonstrate that explicitly modeling the interplay between multi-scale visual cues is crucial for high-performance FBP. The success of the SIT architecture highlights the potential of hybrid CNN-Transformer models for complex image regression tasks that demand a holistic, context-aware understanding.

[136] Robust Experts: the Effect of Adversarial Training on CNNs with Sparse Mixture-of-Experts Layers

Svetlana Pavlitska,Haixi Fan,Konstantin Ditschuneit,J. Marius Zöllner

Main category: cs.CV

TL;DR: This paper explores the use of sparse mixture-of-experts (MoE) layers in convolutional neural networks (CNNs) to improve robustness against adversarial attacks without increasing inference cost.

Details

Motivation: Robustifying CNNs against adversarial attacks is challenging and often resource-intensive. The authors aim to find a solution that increases model capacity without adding inference cost. Method: The authors replace selected residual blocks or convolutional layers in ResNet architectures with sparse MoE layers and train the models on CIFAR-100. They evaluate robustness under PGD and AutoPGD attacks when combined with adversarial training. Result: Inserting a single MoE layer in the deeper stages of ResNet leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. The use of switch loss for balancing causes routing to collapse onto a small set of overused experts, concentrating adversarial training on these paths and inadvertently making them more robust. Conclusion: The study suggests that robust subpaths emerge through specialization, as some individual experts outperform the gated MoE model in robustness. The code is available at a specified GitHub repository. Abstract: Robustifying convolutional neural networks (CNNs) against adversarial attacks remains challenging and often requires resource-intensive countermeasures. We explore the use of sparse mixture-of-experts (MoE) layers to improve robustness by replacing selected residual blocks or convolutional layers, thereby increasing model capacity without additional inference cost. On ResNet architectures trained on CIFAR-100, we find that inserting a single MoE layer in the deeper stages leads to consistent improvements in robustness under PGD and AutoPGD attacks when combined with adversarial training. Furthermore, we discover that when switch loss is used for balancing, it causes routing to collapse onto a small set of overused experts, thereby concentrating adversarial training on these paths and inadvertently making them more robust. As a result, some individual experts outperform the gated MoE model in robustness, suggesting that robust subpaths emerge through specialization. Our code is available at https://github.com/KASTEL-MobilityLab/robust-sparse-moes.

[137] Semi-supervised Deep Transfer for Regression without Domain Alignment

Mainak Biswas,Ambedkar Dukkipati,Devarajan Sridharan

Main category: cs.CV

TL;DR: This paper introduces CRAFT, a source-free, semi-supervised transfer learning method for regression tasks, which improves model generalization in scenarios with limited labeled data and no access to source data, particularly in neuroscience and medicine.

Details

Motivation: The study addresses the challenge of domain shifts in real-world applications where source data are unavailable, labeled target data are scarce, and predictions involve continuous-valued outputs. Method: CRAFT builds upon the Contradistinguisher (CUDA) framework, adapting it for source-free, semi-supervised transfer of pretrained models in regression tasks without intermediate representation alignment. Result: CRAFT demonstrated up to 9% improvement in RMSE over fine-tuned models when labeled examples were scarce, outperformed four state-of-the-art source-free DA models by more than 3%, and showed efficacy on two additional real-world regression benchmarks. Conclusion: CRAFT is proposed as an efficient approach for source-free, semi-supervised deep transfer in regression tasks, showing improved performance in neuroscience and other real-world applications. Abstract: Deep learning models deployed in real-world applications (e.g., medicine) face challenges because source models do not generalize well to domain-shifted target data. Many successful domain adaptation (DA) approaches require full access to source data. Yet, such requirements are unrealistic in scenarios where source data cannot be shared either because of privacy concerns or because it is too large and incurs prohibitive storage or computational costs. Moreover, resource constraints may limit the availability of labeled targets. We illustrate this challenge in a neuroscience setting where source data are unavailable, labeled target data are meager, and predictions involve continuous-valued outputs. We build upon Contradistinguisher (CUDA), an efficient framework that learns a shared model across the labeled source and unlabeled target samples, without intermediate representation alignment. Yet, CUDA was designed for unsupervised DA, with full access to source data, and for classification tasks. We develop CRAFT -- a Contradistinguisher-based Regularization Approach for Flexible Training -- for source-free (SF), semi-supervised transfer of pretrained models in regression tasks. We showcase the efficacy of CRAFT in two neuroscience settings: gaze prediction with electroencephalography (EEG) data and ``brain age'' prediction with structural MRI data. For both datasets, CRAFT yielded up to 9% improvement in root-mean-squared error (RMSE) over fine-tuned models when labeled training examples were scarce. Moreover, CRAFT leveraged unlabeled target data and outperformed four competing state-of-the-art source-free domain adaptation models by more than 3%. Lastly, we demonstrate the efficacy of CRAFT on two other real-world regression benchmarks. We propose CRAFT as an efficient approach for source-free, semi-supervised deep transfer for regression that is ubiquitous in biology and medicine.

[138] A Scalable Attention-Based Approach for Image-to-3D Texture Mapping

Arianna Rampini,Kanika Madan,Bruno Roy,AmirHossein Zamani,Derek Cheung

Main category: cs.CV

TL;DR: 本文提出了一种快速且高效的3D纹理生成方法，能够从单个图像和网格中预测高质量纹理，优于现有方法。

Details

Motivation: 现有的生成方法速度慢，依赖UV映射，并且常常无法忠实于参考图像。 Method: 结合三平面表示和基于深度的反投影损失，进行高效训练和快速推理。 Result: 该方法在单次前向传递中生成高保真纹理，每形状仅需0.2秒，并在单图像纹理重建方面优于现有方法。 Conclusion: 本文提出了一种基于transformer的框架，直接从单个图像和网格预测3D纹理场，实现了高质量、可控的3D内容创建。 Abstract: High-quality textures are critical for realistic 3D content creation, yet existing generative methods are slow, rely on UV maps, and often fail to remain faithful to a reference image. To address these challenges, we propose a transformer-based framework that predicts a 3D texture field directly from a single image and a mesh, eliminating the need for UV mapping and differentiable rendering, and enabling faster texture generation. Our method integrates a triplane representation with depth-based backprojection losses, enabling efficient training and faster inference. Once trained, it generates high-fidelity textures in a single forward pass, requiring only 0.2s per shape. Extensive qualitative, quantitative, and user preference evaluations demonstrate that our method outperforms state-of-the-art baselines on single-image texture reconstruction in terms of both fidelity to the input image and perceptual quality, highlighting its practicality for scalable, high-quality, and controllable 3D content creation.

[139] SGS-3D: High-Fidelity 3D Instance Segmentation via Reliable Semantic Mask Splitting and Growing

Chaolei Wang,Yang Luo,Jing Du,Siyu Chen,Yiping Chen,Ting Han

Main category: cs.CV

TL;DR: 本文提出了一种新的3D实例分割方法SGS-3D，通过结合语义和几何信息，提高了分割精度和鲁棒性。

Details

Motivation: 基于2D-to-3D提升方法的3D实例分割在实例级分割中存在误差累积问题，主要由于语义指导模糊和深度约束不足导致。 Method: 提出了一种新的“先分裂后增长”框架SGS-3D，利用几何基元对模糊的提升掩码进行过滤和分裂，然后在场景中生成完整的实例。 Result: SGS-3D在ScanNet200、ScanNet++和KITTI-360数据集上展示了更高的分割精度和鲁棒性，尤其是在面对预训练模型生成的不准确掩码时。 Conclusion: SGS-3D通过结合语义和几何信息，有效解决了3D实例分割中的模糊性问题，并在多个数据集上验证了其在分割精度和鲁棒性方面的优越性能。 Abstract: Accurate 3D instance segmentation is crucial for high-quality scene understanding in the 3D vision domain. However, 3D instance segmentation based on 2D-to-3D lifting approaches struggle to produce precise instance-level segmentation, due to accumulated errors introduced during the lifting process from ambiguous semantic guidance and insufficient depth constraints. To tackle these challenges, we propose splitting and growing reliable semantic mask for high-fidelity 3D instance segmentation (SGS-3D), a novel "split-then-grow" framework that first purifies and splits ambiguous lifted masks using geometric primitives, and then grows them into complete instances within the scene. Unlike existing approaches that directly rely on raw lifted masks and sacrifice segmentation accuracy, SGS-3D serves as a training-free refinement method that jointly fuses semantic and geometric information, enabling effective cooperation between the two levels of representation. Specifically, for semantic guidance, we introduce a mask filtering strategy that leverages the co-occurrence of 3D geometry primitives to identify and remove ambiguous masks, thereby ensuring more reliable semantic consistency with the 3D object instances. For the geometric refinement, we construct fine-grained object instances by exploiting both spatial continuity and high-level features, particularly in the case of semantic ambiguity between distinct objects. Experimental results on ScanNet200, ScanNet++, and KITTI-360 demonstrate that SGS-3D substantially improves segmentation accuracy and robustness against inaccurate masks from pre-trained models, yielding high-fidelity object instances while maintaining strong generalization across diverse indoor and outdoor environments. Code is available in the supplementary materials.

[140] SL-SLR: Self-Supervised Representation Learning for Sign Language Recognition

Ariel Basso Madjoukeng,Jérôme Fink,Pierre Poitier,Edith Belise Kenmogne,Benoit Frenay

Main category: cs.CV

TL;DR: 本文提出了一种针对手语识别的自监督学习框架，通过自由负样本对和数据增强技术显著提升了识别性能。

Details

Motivation: 手语识别因标注数据稀缺而面临挑战，传统的对比学习方法未能有效处理视频中不同区域的重要性差异及不同手语之间的共享动作问题。 Method: 提出了一种带有自由负样本对的自监督学习方法和一种新的数据增强技术，二者协同工作以提升手语识别的表示学习效果。 Result: 所提出的方法在多个任务中（如线性评估、半监督学习和跨手语迁移）相比对比学习和其他自监督方法，表现出显著的准确率提升。 Conclusion: 该论文提出了一种新的自监督学习框架，用于手语识别，有效解决了对比学习中的一些关键问题，并在多个评估任务中表现出色。 Abstract: Sign language recognition (SLR) is a machine learning task aiming to identify signs in videos. Due to the scarcity of annotated data, unsupervised methods like contrastive learning have become promising in this field. They learn meaningful representations by pulling positive pairs (two augmented versions of the same instance) closer and pushing negative pairs (different from the positive pairs) apart. In SLR, in a sign video, only certain parts provide information that is truly useful for its recognition. Applying contrastive methods to SLR raises two issues: (i) contrastive learning methods treat all parts of a video in the same way, without taking into account the relevance of certain parts over others; (ii) shared movements between different signs make negative pairs highly similar, complicating sign discrimination. These issues lead to learning non-discriminative features for sign recognition and poor results in downstream tasks. In response, this paper proposes a self-supervised learning framework designed to learn meaningful representations for SLR. This framework consists of two key components designed to work together: (i) a new self-supervised approach with free-negative pairs; (ii) a new data augmentation technique. This approach shows a considerable gain in accuracy compared to several contrastive and self-supervised methods, across linear evaluation, semi-supervised learning, and transferability between sign languages.

[141] Enhancing 3D Point Cloud Classification with ModelNet-R and Point-SkipNet

Mohammad Saeid,Amir Salarpour,Pedram MohajerAnsari

Main category: cs.CV

TL;DR: 本文提出了ModelNet-R数据集和Point-SkipNet模型，解决了3D点云分类中的数据集问题，并在保证高准确率的同时降低了计算需求。

Details

Motivation: ModelNet40数据集存在标签不一致、2D数据、尺寸不匹配和类别区分不足等问题，这阻碍了模型性能。为解决这些问题，本文引入了ModelNet-R，一个经过精心优化的ModelNet40版本。 Method: 提出了Point-SkipNet，这是一种轻量级的基于图的神经网络，利用高效采样、邻域分组和跳跃连接，在降低计算开销的同时实现了高分类准确率。 Result: Point-SkipNet在ModelNet-R上达到了最先进的准确率，且与当代模型相比参数量显著减少。 Conclusion: 高质量数据集在优化3D点云分类模型效率方面起着至关重要的作用。 Abstract: The classification of 3D point clouds is crucial for applications such as autonomous driving, robotics, and augmented reality. However, the commonly used ModelNet40 dataset suffers from limitations such as inconsistent labeling, 2D data, size mismatches, and inadequate class differentiation, which hinder model performance. This paper introduces ModelNet-R, a meticulously refined version of ModelNet40 designed to address these issues and serve as a more reliable benchmark. Additionally, this paper proposes Point-SkipNet, a lightweight graph-based neural network that leverages efficient sampling, neighborhood grouping, and skip connections to achieve high classification accuracy with reduced computational overhead. Extensive experiments demonstrate that models trained in ModelNet-R exhibit significant performance improvements. Notably, Point-SkipNet achieves state-of-the-art accuracy on ModelNet-R with a substantially lower parameter count compared to contemporary models. This research highlights the crucial role of dataset quality in optimizing model efficiency for 3D point cloud classification. For more details, see the code at: https://github.com/m-saeid/ModeNetR_PointSkipNet.

[142] Symbolic Graphics Programming with Large Language Models

Yamei Chen,Haoquan Zhang,Yangyi Huang,Zeju Qiu,Kaipeng Zhang,Yandong Wen,Weiyang Liu

Main category: cs.CV

TL;DR: This paper introduces a reinforcement learning approach to improve large language models' ability to generate scalable vector graphics from natural-language descriptions, achieving results comparable to leading models.

Details

Motivation: To explore how LLMs understand the visual world by generating symbolic graphics programs (SGPs) and bridge the performance gap between open-source and proprietary models. Method: A reinforcement learning approach with verifiable rewards, using a format-validity gate and cross-modal reward to align text and rendered images. Result: The proposed method improves SVG generation on Qwen-2.5-7B, achieving performance comparable to frontier systems, while also enhancing object decomposition and scene coherence. Conclusion: Symbolic graphics programming provides a precise and interpretable way to understand cross-modal grounding, and the proposed RL-based method effectively improves SVG generation quality and semantics. Abstract: Large language models (LLMs) excel at program synthesis, yet their ability to produce symbolic graphics programs (SGPs) that render into precise visual content remains underexplored. We study symbolic graphics programming, where the goal is to generate an SGP from a natural-language description. This task also serves as a lens into how LLMs understand the visual world by prompting them to generate images rendered from SGPs. Among various SGPs, our paper sticks to scalable vector graphics (SVGs). We begin by examining the extent to which LLMs can generate SGPs. To this end, we introduce SGP-GenBench, a comprehensive benchmark covering object fidelity, scene fidelity, and compositionality (attribute binding, spatial relations, numeracy). On SGP-GenBench, we discover that frontier proprietary models substantially outperform open-source models, and performance correlates well with general coding capabilities. Motivated by this gap, we aim to improve LLMs' ability to generate SGPs. We propose a reinforcement learning (RL) with verifiable rewards approach, where a format-validity gate ensures renderable SVG, and a cross-modal reward aligns text and the rendered image via strong vision encoders (e.g., SigLIP for text-image and DINO for image-image). Applied to Qwen-2.5-7B, our method substantially improves SVG generation quality and semantics, achieving performance on par with frontier systems. We further analyze training dynamics, showing that RL induces (i) finer decomposition of objects into controllable primitives and (ii) contextual details that improve scene coherence. Our results demonstrate that symbolic graphics programming offers a precise and interpretable lens on cross-modal grounding.

[143] COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization

Yassine Taoudi-Benchekroun,Klim Troyan,Pascal Sager,Stefan Gerber,Lukas Tuggener,Benjamin Grewe

Main category: cs.CV

TL;DR: COGITAO是一个用于研究视觉领域中组合性和泛化性的框架，它能够生成大量具有不同难度级别的任务规则，以测试最先进的视觉模型的泛化能力。

Details

Motivation: 为了解决当前机器学习模型在组合性学习和新环境应用中的局限性，COGITAO旨在提供一个系统研究组合性和泛化性的平台。 Method: COGITAO借鉴了ARC-AGI的问题设置，构建基于规则的任务，通过28种可互操作的变换对网格环境中的对象进行操作，并支持组合深度的调节。 Result: 实验表明，尽管在特定领域内表现良好，但当前最先进的视觉模型在面对熟悉元素的新组合时仍表现出一致的泛化失败。 Conclusion: COGITAO提供了一个灵活且强大的工具，用于研究和改进机器学习模型的组合性和泛化性，其开源性质有助于促进该领域的持续研究。 Abstract: The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.

[144] WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Zizun Li,Jianjun Zhou,Yifan Wang,Haoyu Guo,Wenzheng Chang,Yang Zhou,Haoyi Zhu,Junyi Chen,Chunhua Shen,Tong He

Main category: cs.CV

TL;DR: WinT3R是一种前馈重建模型，通过滑动窗口机制和紧凑的相机表示，在线实现了高质量的重建和精确的相机姿态估计。

Details

Motivation: 先前的方法在重建质量和实时性能之间存在权衡，WinT3R旨在解决这一问题。 Method: 引入了一个滑动窗口机制，确保窗口内帧之间的信息交换，并利用紧凑的相机表示和全局相机令牌池来提高相机姿态估计的可靠性。 Result: WinT3R在多个数据集上验证了其在在线重建质量、相机姿态估计和重建速度方面的卓越性能。 Conclusion: WinT3R通过滑动窗口机制和紧凑的相机表示实现了在线重建质量、相机姿态估计和重建速度的最先进性能。 Abstract: We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.

[145] FlowSeek: Optical Flow Made Easier with Depth Foundation Models and Motion Bases

Matteo Poggi,Fabio Tosi

Main category: cs.CV

TL;DR: FlowSeek is a resource-efficient optical flow framework that outperforms existing methods in performance while being trained on much lower hardware budgets.

Details

Motivation: The motivation is to develop an optical flow framework that is both accurate and resource-efficient, enabling training on low-budget hardware. Method: FlowSeek combines recent advances in optical flow network design with single-image depth foundation models and classical low-dimensional motion parametrization. Result: FlowSeek achieves superior cross-dataset generalization on Sintel Final, KITTI, Spring, and LayeredFlow datasets, with a 10% and 15% improvement over SEA-RAFT on Sintel Final and KITTI, respectively. Conclusion: FlowSeek is a novel and compact framework for optical flow that achieves superior performance while requiring significantly less hardware resources for training. Abstract: We present FlowSeek, a novel framework for optical flow requiring minimal hardware resources for training. FlowSeek marries the latest advances on the design space of optical flow networks with cutting-edge single-image depth foundation models and classical low-dimensional motion parametrization, implementing a compact, yet accurate architecture. FlowSeek is trained on a single consumer-grade GPU, a hardware budget about 8x lower compared to most recent methods, and still achieves superior cross-dataset generalization on Sintel Final and KITTI, with a relative improvement of 10 and 15% over the previous state-of-the-art SEA-RAFT, as well as on Spring and LayeredFlow datasets.

Table of Contents

cs.CL [Back]

[1] INSEva: A Comprehensive Chinese Benchmark for Large Language Models in Insurance

[2] Mentalic Net: Development of RAG-based Conversational AI and Evaluation Framework for Mental Health Support

[3] Do MLLMs Really Understand the Charts?

[4] Predicting Failures of LLMs to Link Biomedical Ontology Terms to Identifiers Evidence Across Models and Ontologies

[5] Uncertainty-Aware Collaborative System of Large and Small Models for Multimodal Sentiment Analysis

[6] CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection

[7] From Post To Personality: Harnessing LLMs for MBTI Prediction in Social Media

[8] Benchmarking GPT-5 for biomedical natural language processing

[9] Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?

[10] Emotionally-Aware Agents for Dispute Resolution

[11] Just-in-time and distributed task representations in language models

[12] Enhancing LLM Efficiency: Targeted Pruning for Prefill-Decode Disaggregation in Inference

[13] Evaluating Large Language Models for Financial Reasoning: A CFA-Based Benchmark Study

[14] Multi-Modal Vision vs. Text-Based Parsing: Benchmarking LLM Strategies for Invoice Processing

[15] COCORELI: Cooperative, Compositional Reconstitution \& Execution of Language Instructions

[16] MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

[17] RECAP: REwriting Conversations for Intent Understanding in Agentic Planning

[18] SpeechLLM: Unified Speech and Language Model for Enhanced Multi-Task Understanding in Low Resource Settings

[19] Scaling Up, Speeding Up: A Benchmark of Speculative Decoding for Efficient LLM Test-Time Scaling

[20] ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute

[21] Training Text-to-Molecule Models with Context-Aware Tokenization

[22] An End-to-End System for Culturally-Attuned Driving Feedback using a Dual-Component NLG Engine

[23] No Clustering, No Routing: How Transformers Actually Process Rare Tokens

[24] Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition

[25] Energy Landscapes Enable Reliable Abstention in Retrieval-Augmented Large Language Models for Healthcare

[26] DecMetrics: Structured Claim Decomposition Scoring for Factually Consistent LLM Outputs

[27] The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors

[28] ASCENDgpt: A Phenotype-Aware Transformer Model for Cardiovascular Risk Prediction from Electronic Health Records

[29] Serialized Output Prompting for Large Language Model-based Multi-Talker Speech Recognition

[30] Refining Transcripts With TV Subtitles by Prompt-Based Weakly Supervised Training of ASR

[31] Learned Hallucination Detection in Black-Box LLMs using Token-level Entropy Production Rate

[32] A Narrative-Driven Computational Framework for Clinician Burnout Surveillance

[33] Where Should I Study? Biased Language Models Decide! Evaluating Fairness in LMs for Academic Recommendations

[34] DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence

[35] Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts

[36] Understanding Reinforcement Learning for Model Training, and future directions with GRAPE

[37] VaccineRAG: Boosting Multimodal Large Language Models' Immunity to Harmful RAG Samples

[38] Behavioral Fingerprinting of Large Language Models

[39] From Silent Signals to Natural Language: A Dual-Stage Transformer-LLM Approach

[40] ProST: Progressive Sub-task Training for Pareto-Optimal Multi-agent Systems Using Small Language Models

[41] Combine Virtual Reality and Machine-Learning to Identify the Presence of Dyslexia: A Cross-Linguistic Approach

[42] Scaling behavior of large language models in emotional safety classification across sizes and tasks

[43] Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations

[44] Artificially Fluent: Swahili AI Performance Benchmarks Between English-Trained and Natively-Trained Datasets

[45] Analysis of Voluntarily Reported Data Post Mesh Implantation for Detecting Public Emotion and Identifying Concern Reports

[46] Advancing SLM Tool-Use Capability using Reinforcement Learning

[47] Hierarchical Section Matching Prediction (HSMP) BERT for Fine-Grained Extraction of Structured Data from Hebrew Free-Text Radiology Reports in Crohn's Disease

[48] Using LLMs to create analytical datasets: A case study of reconstructing the historical memory of Colombia

[49] Quantized Large Language Models in Biomedical Natural Language Processing: Evaluation and Recommendation

[50] Sample-efficient Integration of New Modalities into Large Language Models

[51] Manipulating Transformer-Based Models: Controllability, Steerability, and Robust Interventions

[52] Phonological Representation Learning for Isolated Signs Improves Out-of-Vocabulary Generalization

[53] Spoken in Jest, Detected in Earnest: A Systematic Review of Sarcasm Recognition -- Multimodal Fusion, Challenges, and Future Prospects

[54] PRIM: Towards Practical In-Image Multilingual Machine Translation

[55] Breaking to Build: A Threat Model of Prompt-Based Attacks for Securing LLMs

[56] Comparative Analysis of Transformer Models in Disaster Tweet Classification for Public Safety

[57] Polysemantic Dropout: Conformal OOD Detection for Specialized LLMs

[58] AraHalluEval: A Fine-grained Hallucination Evaluation Framework for Arabic LLMs

[59] Evaluating NL2SQL via SQL2NL

[60] Why Language Models Hallucinate

[61] ODKE+: Ontology-Guided Open-Domain Knowledge Extraction with LLMs

[62] OleSpeech-IV: A Large-Scale Multispeaker and Multilingual Conversational Speech Dataset with Diverse Topics

[63] KERAG: Knowledge-Enhanced Retrieval-Augmented Generation for Advanced Question Answering

[64] A Study of Large Language Models for Patient Information Extraction: Model Architecture, Fine-Tuning Strategy, and Multi-task Instruction Tuning

[65] Research on Multi-hop Inference Optimization of LLM Based on MQUAKE Framework

[66] Decoders Laugh as Loud as Encoders

[67] Enhancing Diversity in Large Language Models via Determinantal Point Processes

[68] Personality as a Probe for LLM Evaluation: Method Trade-offs and Downstream Effects

[69] Knowledge Collapse in LLMs: When Fluency Survives but Facts Fail under Recursive Synthetic Training

[70] Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

[71] Analyzing Finnish Inflectional Classes through Discriminative Lexicon and Deep Learning Models

[72] AFD-SLU: Adaptive Feature Distillation for Spoken Language Understanding

[73] Memorization $\neq$ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?

[74] Using LLMs for Multilingual Clinical Entity Linking to ICD-10

[75] L1RA: Dynamic Rank Assignment in LoRA Fine-Tuning

[76] PLaMo 2 Technical Report

[77] ACE-RL: Adaptive Constraint-Enhanced Reward for Long-form Generation Reinforcement Learning

[78] Classification of kinetic-related injury in hospital triage data using NLP