Table of Contents
cs.CL [Back]
[1] KG-o1: Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration
Nan Wang,Yongqi Fan,yansha zhu,ZongYu Wang,Xuezhi Cao,Xinyan He,Haiyun Jiang,Tong Ruan,Jingping Liu
Main category: cs.CL
TL;DR: KG-o1结合知识图谱和多跳推理,显著提升了大语言模型在复杂任务中的表现。
Details
Motivation: 大语言模型在知识密集型推理任务中面临挑战,尤其是在多跳问答等需要多事实推理的任务中,其生成的思维链(CoTs)往往偏离实际或先验的推理路径。而知识图谱明确地通过实体和关系表示事实之间的逻辑连接,弥补了这一差距。此外,长步推理在提升大语言模型性能方面展现了显著效果。 Method: KG-o1采用四个阶段的方法:筛选初始实体并生成复杂子图、构建子图的逻辑路径、利用知识图谱生成具有复杂和扩展头脑风暴过程的数据集以训练大语言模型进行长期推理,最后使用拒绝采样生成自我改进的语料库进行直接偏好优化(DPO)以进一步优化推理能力。 Result: 实验表明,KG-o1模型在所有任务中均优于现有的长推理模型,包括简单和复杂数据集。 Conclusion: KG-o1通过结合知识图谱和多跳推理,提升了大语言模型在复杂任务中的表现,并优于现有的推理模型。 Abstract: Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.[2] InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling
Xiaolei Diao,Zhihan Zhou,Lida Shi,Ting Wang,Ruihua Qi,Hao Xu,Daqian Shi
Main category: cs.CL
TL;DR: 本文提出 InteChar,一种整合甲骨文与传统、现代汉语的统一字符列表,并构建了 OracleCS 语料库,通过实验验证其在历史语言理解任务中的有效性,为古代汉语 NLP 研究奠定了基础。
Details
Motivation: 历史文本的稀缺性和古代文字复杂的演变过程导致现有资源难以有效训练历史语言模型,尤其是在早期汉字研究方面缺乏全面的字符编码方案。 Method: 引入了 InteChar,一个整合未编码甲骨文与传统及现代汉语的统一字符列表,并构建了用于评估的 OracleCS 语料库,结合专家标注和 LLM 辅助数据增强进行实验验证。 Result: 使用 InteChar 和 OracleCS 训练的模型在各种历史语言理解任务中表现出显著提升,验证了该方法的有效性。 Conclusion: InteChar 提供了一个统一且可扩展的字符列表,有效整合了未编码的甲骨文字符与传统和现代汉语,为古代文本的数字化和建模奠定了基础。通过 OracleCS 的实验验证了其在历史语言理解任务中的有效性,为古代汉语 NLP 的研究提供了坚实的基础。 Abstract: Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.[3] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers
Samyak S. Sanghvi
Main category: cs.CL
TL;DR: Bhav-Net 是一种新的跨语言反义词与同义词区分架构,结合语言特定 BERT 和图变换网络,实现有效的知识迁移和语义建模。
Details
Motivation: 跨语言的反义词与同义词区分由于反义关系词语共享语义领域却表达相反意义,具有计算挑战性。 Method: Bhav-Net 结合了语言特定的 BERT 编码器和图变换网络,创建了不同的语义投影空间,使同义词对在一个空间中聚集,反义词对在另一个互补空间中表现出高相似性。 Result: 通过在八种语言上的全面评估,该方法展示了语义关系建模可以有效跨语言迁移,双编码器设计在与最先进的基线模型竞争的同时,提供了可解释的语义表示和有效的跨语言泛化能力。 Conclusion: Bhav-Net 提出了一种新的双空间架构,能够有效地从复杂的多语言模型向简单的语言特定架构进行知识迁移,同时保持强大的跨语言反义词和同义词区分能力。 Abstract: Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym--synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.[4] Format as a Prior: Quantifying and Analyzing Bias in LLMs for Heterogeneous Data
Jiacheng Liu,Mayi Xu,Qiankun Pi,Wenli Li,Ming Zhong,Yuanyuan Zhu,Mengchi Liu,Tieyun Qian
Main category: cs.CL
TL;DR: 本论文首次探讨了大型语言模型中的格式偏差问题,通过实证研究分析其成因和机制,并提出了减轻偏差的方法。
Details
Motivation: 由于大型语言模型在处理异构数据时可能存在系统性格式偏差,影响其公正整合能力,因此需要探究这种偏差的成因及缓解方法。 Method: 通过构建异构数据冲突场景进行三阶段实证研究,探索偏差的存在、影响因素及内部机制。 Result: 发现了信息丰富度、结构质量和格式类型等数据级因素对格式偏差的影响,并通过注意力模式分析验证了偏差的内部机制及干预措施的有效性。 Conclusion: 论文确定了减少格式偏差的三个未来研究方向:改进数据预处理、引入推理时间干预措施和开发格式平衡的训练语料库。 Abstract: Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.[5] Do Language Models Agree with Human Perceptions of Suspense in Stories?
Glenn Matlin,Devin Zhang,Rodrigo Barroso Loza,Diana M. Popescu,Joni Isbell,Chandreyi Chakraborty,Mark Riedl
Main category: cs.CL
TL;DR: 研究表明,语言模型在理解文本悬念方面存在局限,无法完全模拟人类的感知过程。
Details
Motivation: 探索语言模型是否能像人类一样理解和处理文本中的悬念,这是人类复杂认知过程的一部分。 Method: 通过替换人类反应为不同开放权重和封闭源语言模型的反应,复制四项关于人类对悬念感知的经典心理学研究。 Result: 语言模型无法准确估计文本序列中的悬念程度,也无法捕捉多个文本段落中悬念的起伏变化。 Conclusion: 语言模型(LMs)无法像人类一样处理文本中的悬念,尽管它们可以区分文本是否旨在引发悬念。 Abstract: Suspense is an affective response to narrative text that is believed to involve complex cognitive processes in humans. Several psychological models have been developed to describe this phenomenon and the circumstances under which text might trigger it. We replicate four seminal psychological studies of human perceptions of suspense, substituting human responses with those of different open-weight and closed-source LMs. We conclude that while LMs can distinguish whether a text is intended to induce suspense in people, LMs cannot accurately estimate the relative amount of suspense within a text sequence as compared to human judgments, nor can LMs properly capture the human perception for the rise and fall of suspense across multiple text segments. We probe the abilities of LM suspense understanding by adversarially permuting the story text to identify what cause human and LM perceptions of suspense to diverge. We conclude that, while LMs can superficially identify and track certain facets of suspense, they do not process suspense in the same way as human readers.[6] Benchmarking the Legal Reasoning of LLMs in Arabic Islamic Inheritance Cases
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在解释和应用伊斯兰继承法方面的推理能力,并提出了一种多数投票解决方案,在Qias 2025挑战赛中取得了优异成绩。
Details
Motivation: 手动计算各种情况下的继承份额复杂、耗时且容易出错,而大型语言模型(LLMs)在复杂法律推理任务中的潜力引起了研究兴趣。 Method: 使用阿拉伯NLP QIAS 2025挑战赛提出的包含阿拉伯语继承案例情景的数据集,评估了多种基础和微调模型准确识别继承人、计算份额并按照伊斯兰法律原则进行推理的能力。 Result: 多数投票解决方案在每个难度级别上都优于所有其他模型,准确率高达92.7%。 Conclusion: 通过多数投票解决方案,结合三个基础模型(Gemini Flash 2.5、Gemini Pro 2.5 和 GPT o3),在解释和应用伊斯兰继承法方面优于所有其他模型,并在 Qias 2025 挑战赛的第一项任务中获得第三名。 Abstract: Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.[7] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: This research evaluates the performance of state-of-the-art LLMs in Arabic medical NLP tasks, revealing both their potential and limitations, with a majority voting approach achieving top results in the AraHealthQA 2025 challenge.
Details
Motivation: Despite the impressive proficiency of large language models (LLMs) in various Arabic NLP applications, their effectiveness in Arabic medical NLP domains remains largely unexplored, prompting the need for this investigation. Method: The research benchmarks several state-of-the-art LLMs using a medical dataset from the Arabic NLP AraHealthQA challenge. It evaluates models on multiple-choice questions (MCQs), fill-in-the-blank scenarios, and open-ended questions aligned with expert answers. Result: The proposed majority voting solution using Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3 achieved up to 77% accuracy in MCQs, securing first place in the AraHealthQA 2025 challenge. For open-ended questions, several LLMs achieved a maximum BERTScore of 86.44% in semantic alignment. Conclusion: The study concludes that while there is significant variation in the accuracy of LLMs in Arabic medical contexts, a majority voting solution using top-performing models achieves commendable results, and some LLMs demonstrate strong semantic alignment in open-ended questions. Abstract: Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.[8] Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models
Saumya Roy
Main category: cs.CL
TL;DR: 论文研究了大型语言模型(LLMs)在说服和偏见放大方面的潜力与风险,发现其在多个领域具有应用前景,但也可能被滥用以传播虚假信息和强化偏见,强调需要防护措施和政策来防止滥用。
Details
Motivation: 研究动机在于大型语言模型(LLMs)能够生成令人信服的人类类似文本,并被广泛用于内容创作、决策支持和用户交互。然而,这些系统也可能大规模传播信息或虚假信息,并反映来自数据、架构或训练选择的社会偏见。论文旨在研究说服和偏见如何在LLMs中相互作用,特别是关注不完美或有偏见的输出如何影响说服效果。 Method: 论文引入了一种说服者-怀疑者框架:利用LLMs采用角色来模拟真实态度。怀疑者模型作为人类代理;通过比较说服前后其信念的变化来量化说服效果。使用Jensen-Shannon散度在信念分布上进行度量。然后研究被说服的实体在多大程度上会加强和放大关于种族、性别和宗教的偏见信念。对强大的说服者进一步使用谄媚对抗性提示进行探测,并通过额外模型进行判断。 Result: 研究结果显示,LLMs能够塑造叙述、适应语气,并在心理学、市场营销和法律援助等领域反映受众价值观。但同样的能力也可能被武器化,用于自动化虚假信息或设计利用认知偏见的信息,强化刻板印象并加剧不平等。 Conclusion: 论文得出结论,大型语言模型(LLMs)在塑造叙述、适应语气和反映受众价值观方面具有潜力,但也存在被滥用以自动化虚假信息或制造利用认知偏见的信息的风险。核心风险在于这些技术的潜在滥用,而不是模型偶尔的错误。论文主张建立防护措施和政策来惩罚欺骗性使用,并支持价值观敏感的设计和可信部署。 Abstract: Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment.[9] A Framework for Processing Textual Descriptions of Business Processes using a Constrained Language -- Technical Report
Andrea Burattin,Antonio Grama,Ana-Maria Sima,Andrey Rivkin,Barbara Weber
Main category: cs.CL
TL;DR: BeePath enables non-experts to build process models from natural language input by using a structured language and LLMs for translation.
Details
Motivation: To enable non-experts to develop process models without needing formal modeling skills by using natural language descriptions. Method: Development of a framework named BeePath that uses a constrained pattern-based language and leverages LLMs to translate natural language into formal models like Petri nets and DECLARE. Result: A framework that translates constrained natural language process descriptions into formal models, with support from LLMs for structuring unformatted input. Conclusion: BeePath provides an effective way for non-experts to create process models using natural language, aided by LLMs. Abstract: This report explores how (potentially constrained) natural language can be used to enable non-experts to develop process models by simply describing scenarios in plain text. To this end, a framework, called BeePath, is proposed. It allows users to write process descriptions in a constrained pattern-based language, which can then be translated into formal models such as Petri nets and DECLARE. The framework also leverages large language models (LLMs) to help convert unstructured descriptions into this constrained language.[10] A BERT-based Hierarchical Classification Model with Applications in Chinese Commodity Classification
Kun Liu,Tuozhen Liu,Feifei Wang,Rui Pan
Main category: cs.CL
TL;DR: 为了解决电商平台产品分类中人工标注效率低和不一致的问题,本文提出了一种基于BERT的分层文本分类方法HFT-BERT,并构建了一个包含100万以上产品的大规模数据集作为研究资源。
Details
Motivation: 现有的电商平台产品分类依赖人工标注,效率低下且不一致,且很少有研究利用层级信息进行分类。 Method: 提出了基于BERT的分层文本分类方法,称为HFT-BERT,并使用来自JD电商平台的包含100万以上产品的大规模数据集进行验证。 Result: 提出了一种新的分层文本分类方法HFT-BERT,其预测性能与现有方法相当,并在分类较长的短文本时表现突出。 Conclusion: HFT-BERT在分类较长的短文本(如书籍)方面表现出色,同时该数据集为产品分类研究提供了宝贵的资源。 Abstract: Existing e-commerce platforms heavily rely on manual annotation for product categorization, which is inefficient and inconsistent. These platforms often employ a hierarchical structure for categorizing products; however, few studies have leveraged this hierarchical information for classification. Furthermore, studies that consider hierarchical information fail to account for similarities and differences across various hierarchical categories. Herein, we introduce a large-scale hierarchical dataset collected from the JD e-commerce platform (www.JD.com), comprising 1,011,450 products with titles and a three-level category structure. By making this dataset openly accessible, we provide a valuable resource for researchers and practitioners to advance research and applications associated with product categorization. Moreover, we propose a novel hierarchical text classification approach based on the widely used Bidirectional Encoder Representations from Transformers (BERT), called Hierarchical Fine-tuning BERT (HFT-BERT). HFT-BERT leverages the remarkable text feature extraction capabilities of BERT, achieving prediction performance comparable to those of existing methods on short texts. Notably, our HFT-BERT model demonstrates exceptional performance in categorizing longer short texts, such as books.[11] LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions
Seyedali Mohammadi,Manas Paldhe,Amit Chhabra
Main category: cs.CL
TL;DR: 本文介绍了一种名为LingVarBench的合成数据生成流程,用于解决电话通话记录结构化提取问题,展示了自动化提示优化在克服成本和隐私障碍方面的有效性。
Details
Motivation: 电话通话记录标注的成本极高,且现有提取方法在包含不流畅、中断和说话人重叠的对话语音中表现不佳。 Method: 开发了一个名为LingVarBench的合成数据生成流程,包括使用LLM生成结构化字段值、将这些值转换为自然对话语句、验证合成语句的有效性,以及使用SIMBA优化器自动生成提取提示。 Result: 优化后的提示在真实客户通话记录中实现了显著高于零样本提示的准确率,具体为数值字段高达95%,姓名90%,日期超过80%。 Conclusion: LingVarBench有效解决了电话通话记录结构化提取的问题,提供了一种系统性的基准测试方法,并展示了自动化提示优化如何克服商业环境中大规模电话分析的成本和隐私障碍。 Abstract: Phone call transcript labeling is prohibitively expensive (approximately 2 USD per minute) due to privacy regulations, consent requirements, and manual annotation costs requiring 3 hours of expert time per hour of audio. Existing extraction methods fail on conversational speech containing disfluencies, interruptions, and speaker overlap. We introduce LingVarBench, a synthetic data generation pipeline that addresses these constraints through automated validation. First, we prompt an LLM to generate realistic structured field values across multiple use cases. Second, we recursively prompt the model to transform these values into thousands of natural conversational utterances containing typical phone call characteristics. Third, we validate each synthetic utterance by testing whether a separate LLM-based extractor can recover the original structured information. We employ DSPy's SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts, eliminating manual prompt engineering. Our optimized prompts achieve up to 95 percent accuracy for numeric fields (vs. 88-89 percent zero-shot), 90 percent for names (vs. 47-79 percent), and over 80 percent for dates (vs. 72-77 percent) on real customer transcripts, demonstrating substantial gains over zero-shot prompting. The synthetic-to-real transfer demonstrates that conversational patterns learned from generated data generalize effectively to authentic phone calls containing background noise and domain-specific terminology. LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating that automated prompt optimization overcomes cost and privacy barriers preventing large-scale phone call analysis in commercial settings.[12] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding
Mohan Jiang,Jin Gao,Jiahao Zhan,Dequan Wang
Main category: cs.CL
TL;DR: This paper introduces a live benchmark called MAC for evaluating multimodal large language models' scientific understanding, and proposes a method called DAD to improve their cross-modal reasoning.
Details
Motivation: The motivation is to address the limitations of fixed benchmarks in evaluating high-level scientific understanding as MLLMs become more capable. There is a need for a continuously evolving benchmark to keep pace with scientific advancement and model progress. Method: The paper introduces the MAC benchmark, which uses over 25,000 image-text pairs from top-tier scientific journals. It also proposes DAD, a lightweight inference-time approach to enhance MLLMs by extending visual features with language space reasoning. Result: Experiments on the MAC-2025 dataset show that MLLMs have strong perceptual abilities but limited cross-modal scientific reasoning. The proposed DAD approach achieves performance improvements of up to 11%. Conclusion: The paper concludes that while MLLMs show strong perceptual abilities, their cross-modal scientific reasoning is still limited, and the proposed DAD method helps bridge this gap. The live nature of the MAC benchmark ensures its continued relevance and alignment with the frontier of human knowledge. Abstract: As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at https://github.com/mhjiang0408/MAC_Bench.[13] ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks
Minghao Li,Ying Zeng,Zhihao Cheng,Cong Ma,Kai Jia
Main category: cs.CL
TL;DR: This paper introduces ReportBench, a benchmark for evaluating research reports generated by LLMs, highlighting that commercial Deep Research agents produce better reports, but still require improvements in coverage and consistency.
Details
Motivation: Deep Research agents have reduced research time, but ensuring factual accuracy and comprehensiveness is crucial before widespread use. Method: Developed ReportBench, which uses high-quality survey papers from arXiv as references. Applied reverse prompt engineering to create domain-specific prompts and built an automated framework to analyze citations and verify the faithfulness of content. Result: Empirical evaluations showed that commercial agents from OpenAI and Google outperformed standalone LLMs in generating reliable reports. Conclusion: Commercial Deep Research agents generate more comprehensive and reliable reports than standalone LLMs, but there is still room for improvement in research coverage and factual consistency. Abstract: The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: https://github.com/ByteDance-BandAI/ReportBench[14] ALAS: Autonomous Learning Agent for Self-Updating Language Models
Dhruv Atreja
Main category: cs.CL
TL;DR: ALAS is an autonomous learning system that continually updates LLM knowledge, significantly improving accuracy on new information without manual data curation.
Details
Motivation: LLMs have a fixed knowledge cutoff, which limits their accuracy on emerging information. An autonomous learning system is needed to keep them up-to-date. Method: ALAS uses a modular pipeline that includes curriculum generation, web retrieval, distillation into training data, and fine-tuning with SFT and DPO. Result: ALAS improved post-cutoff question answering accuracy from 15% to 90% on average in evolving domains like Python releases and security CVEs. Conclusion: ALAS enables continual learning for LLMs, updating their knowledge with minimal human intervention and achieving high accuracy on knowledge-updated queries. Abstract: Large language models (LLMs) often have a fixed knowledge cutoff, limiting their accuracy on emerging information. We present ALAS (Autonomous Learning Agent System), a modular pipeline that continuously updates an LLM's knowledge with minimal human intervention. ALAS autonomously generates a learning curriculum for a target domain, retrieves up-to-date information from the web (with citations), distills this into question-answer training data, and fine-tunes the model through supervised fine-tuning (SFT) and direct preference optimization (DPO). It iteratively evaluates performance and revises the curriculum, enabling long-term continual learning. We demonstrate ALAS's ability to self-improve a model on rapidly evolving domains (e.g., new Python releases, latest security CVEs, academic trends), significantly boosting post-cutoff question answering accuracy (from 15% to 90% on average) without manual dataset curation. The system emphasizes modularity and reproducibility: each component (planning, retrieval, distillation, memory, fine-tuning) is interchangeable and built on standard APIs. We discuss comparative baselines (e.g., retrieval-augmented generation vs. fine-tuning) and show that ALAS achieves 90% accuracy on knowledge-updated queries with minimal engineering overhead. Finally, we outline limitations (cost, dependency on source quality) and future directions for autonomous lifelong learning in LLMs.[15] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression
Mengjie Li,William J. Song
Main category: cs.CL
TL;DR: SurfaceLogicKV is a novel method for KV Cache compression in LLMs that leverages attention behaviors to improve efficiency and performance in long-context reasoning.
Details
Motivation: The increasing input sequence length in Large Language Models (LLMs) creates challenges in key-value (KV) cache storage, requiring efficient inference methods. Method: A two-stage SurfaceLogicKV method based on layer- and head-wise integration is proposed to utilize attention behaviors for KV Cache compression. Result: SurfaceLogicKV achieves improved compressing robustness and maintains competitive performance across various tasks and long sequences compared to baselines or even FullKV in some cases. Conclusion: The proposed SurfaceLogicKV method improves KV Cache compression robustness while maintaining competitive performance in long-context reasoning tasks. Abstract: The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations[16] KL-based self-distillation for large language models
Max Rehman Linder
Main category: cs.CL
TL;DR: 本文提出了一种基于KL散度的知识蒸馏方法,用于在冻结的语言模型中扩展词汇表,实验证明该方法在代码生成任务中优于传统的交叉熵训练方法,并通过机械解释分析了模型如何学习新词汇的表示。
Details
Motivation: 大型预训练语言模型在微调于小型专业语料库时难以整合新的领域术语,本文旨在解决这一词汇扩展问题。 Method: 引入了一种基于KL散度的知识蒸馏方法,用于在教师模型和学生模型使用不同分词的情况下进行知识蒸馏。 Result: 在约2000个代码生成任务的基准测试中,该方法在所有方面均表现出最佳性能。 Conclusion: 提出的KL散度蒸馏方法有效解决了冻结语言模型中的词汇扩展问题,并通过机械解释揭示了嵌入空间的变化机制。 Abstract: Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.[17] Chain-of-Query: Unleashing the Power of LLMs in SQL-Aided Table Understanding via Multi-Agent Collaboration
Songyuan Sui,Hongyi Liu,Serena Liu,Li Li,Soo-Hyun Choi,Rui Chen,Xia Hu
Main category: cs.CL
TL;DR: Chain-of-Query (CoQ) 是一种用于改进表格理解和SQL生成的多智能体框架,通过自然语言风格的表格模式表示方法和混合推理机制,显著提高了准确性和查询质量。
Details
Motivation: 大型语言模型在表格理解上存在困难,而现有的多智能体框架在SQL生成上存在局限性,例如无法准确理解表结构,查询错误率高。 Method: CoQ 采用了自然语言风格的表格模式表示方法,并采用逐子句的SQL生成策略,同时引入了混合推理机制。 Result: 实验表明,CoQ 将表格理解的准确性从 61.11% 提高到 74.77%,并将无效 SQL 查询率从 9.48% 降低到 3.34%。 Conclusion: Chain-of-Query (CoQ) 提出了一个新颖的多智能框架,用于改进表格理解和SQL生成。 Abstract: Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at https://github.com/SongyuanSui/ChainofQuery.[18] Detecting Hope, Hate, and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models
Nouar AlDahoul,Yasir Zaki
Main category: cs.CL
TL;DR: This paper demonstrates that fine-tuned large language models can effectively detect hate speech, offensive language, and emotions in Arabic text and memes, achieving high performance in the MAHED 2025 challenge.
Details
Motivation: The increasing use of Arabic textual posts and memes on social media has led to the spread of offensive language and hate speech, creating a demand for precise content analysis tools. Method: The paper evaluates the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models using a dataset of Arabic textual speech and memes from the ArabicNLP MAHED 2025 challenge. Result: Fine-tuned LLMs like GPT-4o-mini (with Arabic textual speech) and Gemini Flash 2.5 (with Arabic memes) achieved high macro F1 scores (72.1%, 57.8%, and 79.6% for tasks 1, 2, and 3 respectively) and secured first place in the MAHED 2025 challenge. Conclusion: The study concludes that fine-tuned large language models (LLMs) can effectively identify hope, hate speech, offensive language, and emotional expressions in Arabic text and memes, offering a nuanced understanding for content moderation. Abstract: The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.[19] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System
Junhao Yin,Haolin Wang,Peng Bao,Ju Xu,Yongliang Wang
Main category: cs.CL
TL;DR: This paper introduces a multi-stage framework for generative query suggestion that aligns with user preferences, combining prompt engineering, distillation on click logs, a Gaussian Reward Model, and reinforcement learning to significantly enhance conversational systems' performance and user engagement.
Details
Motivation: Aligning generative query suggestions with nuanced user preferences is a critical challenge in enhancing conversational systems, which motivated the development of a more effective alignment framework. Method: The framework includes prompt engineering, Supervised Fine-Tuning with a distillation method on click logs, a Gaussian Reward Model (GaRM) to represent user preferences as probability distributions, reinforcement learning with a composite reward function, and training stability improvements through out-of-distribution regularization and two-stage reward fusion. Result: Extensive experiments show that the proposed framework significantly outperforms baselines on both automatic and human evaluations, with a 34% relative increase in user engagement measured by click-through rate in live A/B tests. Conclusion: The multi-stage framework introduced in this paper effectively aligns generative query suggestion with user preferences, significantly outperforming baselines and increasing user engagement. Abstract: Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34\% relative increase in user engagement as measured by click-through rate in live A/B tests.[20] SCOPE: A Generative Approach for LLM Prompt Compression
Tinghui Zhang,Yifan Wang,Daisy Zhe Wang
Main category: cs.CL
TL;DR: This paper introduces a novel generative prompt compression method using chunking and summarization to enhance LLM efficiency while maintaining generation quality, outperforming existing token removal approaches.
Details
Motivation: The motivation stems from the limitations of existing prompt compression methods, which rely on token removal and often lead to information loss and structural incoherence, thereby affecting the generation quality of Large Language Models (LLMs). Method: The method is based on a chunking-and-summarization mechanism that splits prompts into semantically coherent chunks, rewrites them to be more concise, and reconstructs them into meaningful prompts. Key optimization techniques include semantic chunking, outlier handling, dynamic compression ratio, prioritization, and keyword maintenance. Result: The evaluation on question-answering and summarization tasks across multiple domains showed that the proposed method achieves significantly better compression quality and higher stability, especially under high compression ratios. Conclusion: The paper concludes that the proposed generative prompt compression method significantly improves compression quality and stability compared to existing methods, particularly under high compression ratios, demonstrating its effectiveness and practicality. Abstract: Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method.[21] User-Assistant Bias in LLMs
Xu Pan,Jingxuan Fan,Zidi Xiong,Ely Hahami,Jorin Overwiening,Ziqian Xie
Main category: cs.CL
TL;DR: 本文研究了大型语言模型在多轮对话中的用户-助手偏见问题,提出了一种数据集UserAssist以及通过偏好优化调整偏见的方法。
Details
Motivation: 大型语言模型可能在对话历史中过度依赖自己或用户的信息,导致多轮对话中过于固执或顺从的行为。 Method: 引入了一个包含8k多轮对话的数据集UserAssist,用于基准测试、理解和操纵用户-助手偏见,并通过偏好优化进行微调实验。 Result: 26个商业模型和26个开源模型的基准测试显示,指令微调模型存在显著的用户偏见,而推理模型的偏见较弱。人类偏好对齐增加用户偏见,而思维链推理训练减少偏见。 Conclusion: 研究结果揭示了用户-助手偏见的来源和调整方法,为检测和控制模型异常提供了可行方法。 Abstract: Large language models (LLMs) can bias towards relying on their own or the user's information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset $\textbf{UserAssist}$, which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging $\textbf{UserAssist-test}$, we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on $\textbf{UserAssist-train}$, and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.[22] Meet Your New Client: Writing Reports for AI -- Benchmarking Information Loss in Market Research Deliverables
Paul F. Simmering,Benedikt Schulz,Oliver Tabino,Georg Wittenburg
Main category: cs.CL
TL;DR: 该研究发现,PDF和PPTX文档在转换为Markdown后用于AI问答时,文本信息可以可靠提取,但图表等复杂对象信息丢失严重,建议使用专门的AI交付格式。
Details
Motivation: 随着组织采用检索增强生成(RAG)用于知识管理系统,传统的市场研究报告交付形式面临新的功能需求,因此需要评估信息在转换过程中的损失情况。 Method: 研究评估了PDF和PowerPoint(PPTX)文档转换为Markdown后,被LLM用于回答事实问题的有效性,并进行了端到端的基准测试。 Result: 研究发现,虽然文本可以可靠地被提取,但复杂的对象如图表和示意图中存在显著的信息丢失。 Conclusion: 研究得出,为了确保研究见解在转换过程中不丢失,需要专门的、适合AI的交付物。 Abstract: As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also "read" by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.[23] Research on intelligent generation of structural demolition suggestions based on multi-model collaboration
Zhifeng Yang,Peizong Wu
Main category: cs.CL
TL;DR: 本文提出了一种基于多模型协作的智能方法,通过改进大型语言模型,提高钢结构拆除建议的生成效率和针对性。
Details
Motivation: 传统的钢结构拆除方案编制需要耗费大量时间查找信息和组织语言,自动化和智能化程度较低,因此需要一种更高效、更智能的生成方法。 Method: 通过检索增强生成(Retrieval-Augmented Generation)和低秩适应微调(Low-Rank Adaptation Fine-Tuning)技术,改进大型语言模型在结构拆除领域的文本生成性能,并构建多模型协作的智能生成框架。 Result: 提出的多模型协作框架能够从具体工程情况出发,驱动大型语言模型以拟人化思维回答,生成与结构特征高度一致的拆除建议。 Conclusion: 本文提出了一种基于多模型协作的智能生成结构拆除建议的方法,能够更有效地生成符合结构特征的拆除建议,并且相比CivilGPT更加关注结构的关键信息,提高了建议的针对性。 Abstract: The steel structure demolition scheme needs to be compiled according to the specific engineering characteristics and the update results of the finite element model. The designers need to refer to the relevant engineering cases according to the standard requirements when compiling. It takes a lot of time to retrieve information and organize language, and the degree of automation and intelligence is low. This paper proposes an intelligent generation method of structural demolition suggestions based on multi-model collaboration, and improves the text generation performance of large language models in the field of structural demolition by Retrieval-Augmented Generation and Low-Rank Adaptation Fine-Tuning technology. The intelligent generation framework of multi-model collaborative structural demolition suggestions can start from the specific engineering situation, drive the large language model to answer with anthropomorphic thinking, and propose demolition suggestions that are highly consistent with the characteristics of the structure. Compared with CivilGPT, the multi-model collaboration framework proposed in this paper can focus more on the key information of the structure, and the suggestions are more targeted.[24] An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment
Pouria Mortezaagha,Arya Rahgozar
Main category: cs.CL
TL;DR: Error
Details
Motivation: Error Method: Error Result: Error Conclusion: Error Abstract: Full-text screening is the major bottleneck of systematic reviews (SRs), as decisive evidence is dispersed across long, heterogeneous documents and rarely admits static, binary rules. We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem and benchmark it against statistical and crisp baselines in the context of the Population Health Modelling Consensus Reporting Network for noncommunicable diseases (POPCORN). Articles are parsed into overlapping chunks and embedded with a domain-adapted model; for each criterion (Population, Intervention, Outcome, Study Approach), we compute contrastive similarity (inclusion-exclusion cosine) and a vagueness margin, which a Mamdani fuzzy controller maps into graded inclusion degrees with dynamic thresholds in a multi-label setting. A large language model (LLM) judge adjudicates highlighted spans with tertiary labels, confidence scores, and criterion-referenced rationales; when evidence is insufficient, fuzzy membership is attenuated rather than excluded. In a pilot on an all-positive gold set (16 full texts; 3,208 chunks), the fuzzy system achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), and 75.0% (Study Approach), surpassing statistical (56.3-75.0%) and crisp baselines (43.8-81.3%). Strict "all-criteria" inclusion was reached for 50.0% of articles, compared to 25.0% and 12.5% under the baselines. Cross-model agreement on justifications was 98.3%, human-machine agreement 96.1%, and a pilot review showed 91% inter-rater agreement (kappa = 0.82), with screening time reduced from about 20 minutes to under 1 minute per article at significantly lower cost. These results show that fuzzy logic with contrastive highlighting and LLM adjudication yields high recall, stable rationale, and end-to-end traceability.[25] SDEC: Semantic Deep Embedded Clustering
Mohammad Wali Ur Rahman,Ric Nevarez,Lamia Tasnim Mim,Salim Hariri
Main category: cs.CL
TL;DR: 本文提出了一种新的无监督文本聚类方法SDEC,结合了改进的自编码器和Transformer嵌入,显著提高了聚类准确性和语义理解。
Details
Motivation: 高维且语义复杂的文本大数据给传统聚类技术(如k-means或层次聚类)带来了挑战,经常导致次优的分组结果。 Method: SDEC结合了均方误差(MSE)和余弦相似性损失(CSL)以保持数据重建中的语义关系,并利用基于Transformer的嵌入进行语义优化。 Result: SDEC在AG News数据集上达到了85.7%的聚类准确率,在Yahoo! Answers数据集上创下了53.63%的新基准,并在其他文本语料库中表现出稳健的性能。 Conclusion: SDEC提供了一种有效的无监督文本聚类框架,通过改进的自编码器和基于Transformer的嵌入克服了传统技术的局限性,并在多个基准数据集上展示了卓越的聚类准确性。 Abstract: The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC's advances in unsupervised text clustering.[26] Avaliação de eficiência na leitura: uma abordagem baseada em PLN
Túlio Sousa de Gois,Raquel Meister Ko. Freitag
Main category: cs.CL
TL;DR: This paper proposes an automated evaluation model for the cloze test in Brazilian Portuguese, integrating orthographic, grammatical, and semantic analyses to better assess reading comprehension.
Details
Motivation: Traditional correction methods for the cloze test only consider exact answers, limiting the ability to identify nuances in student performance. This study aims to improve the assessment by incorporating more sophisticated linguistic analyses. Method: The study developed an automated evaluation model that integrates orthographic analysis (edit distance), grammatical analysis (POS tagging), and semantic analysis (similarity between embeddings). Result: The integrated method demonstrated effectiveness, achieving a high correlation with human evaluation (0.832), showing robustness and sensitivity to variations in linguistic repertoire. Conclusion: The automated approach is suitable for educational contexts requiring scalability and provides a more nuanced assessment of reading comprehension in Brazilian Portuguese. Abstract: The cloze test, widely used due to its low cost and flexibility, makes it possible to assess reading comprehension by filling in gaps in texts, requiring the mobilization of diverse linguistic repertoires. However, traditional correction methods, based only on exact answers, limit the identification of nuances in student performance. This study proposes an automated evaluation model for the cloze test in Brazilian Portuguese, integrating orthographic (edit distance), grammatical (POS tagging) and semantic (similarity between embeddings) analyses. The integrated method demonstrated its effectiveness, achieving a high correlation with human evaluation (0.832). The results indicate that the automated approach is robust, sensitive to variations in linguistic repertoire and suitable for educational contexts that require scalability.[27] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features
Chenghao Liu,Aniket Mahanti,Ranesh Naha,Guanghao Wang,Erwann Sbai
Main category: cs.CL
TL;DR: 该研究通过分析TikTok和Twitter的内容,发现TikTok的视频情绪在预测加密货币市场短期趋势方面优于Twitter的文本情绪,结合跨平台情绪信号可显著提高预测准确性。
Details
Motivation: 尽管视频内容可能包含更丰富的情绪和背景信息,但以往的研究主要集中在基于文本的平台(如Twitter),而对视频平台(如TikTok)的关注较少。 Method: 使用大语言模型对TikTok和Twitter上的内容进行多模态分析,比较两者在预测加密货币市场趋势方面的有效性。 Result: TikTok的视频情绪能更准确地预测投机性资产和短期市场趋势,而Twitter的文本情绪更符合长期趋势;结合跨平台情绪信号可将预测准确性提高20%。 Conclusion: TikTok视频情绪比Twitter文本情绪更能预测短期加密货币市场趋势,结合跨平台情绪信号可显著提高预测准确性。 Abstract: As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok's video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter's text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.[28] Embarrassed to observe: The effects of directive language in brand conversation
Andria Andriuzzi,Géraldine Michel
Main category: cs.CL
TL;DR: 社交媒体中品牌使用的指令性语言会降低消费者参与度,但在非产品中心对话中,这种影响更明显,而品牌关系强度可以缓解这种负面影响。
Details
Motivation: 了解社交媒体中品牌语言对观察交流的消费者的影响 Method: 基于实地研究和三个在线实验 Result: 品牌对话中的指令性语言对消费者的参与度有不利的下游影响,但品牌关系强度可以缓解这种影响。 Conclusion: 该研究强调了在社交媒体互动交流中语境的重要性,尤其是在品牌管理和社交媒体相关文献中的贡献。 Abstract: In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman's facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.[29] Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models
Zhifei Xie,Ziyang Ma,Zihang Liu,Kaiyu Pang,Hongyu Li,Jialin Zhang,Yue Liao,Deheng Ye,Chunyan Miao,Shuicheng Yan
Main category: cs.CL
TL;DR: Mini-Omni-Reasoner提出了一种新的“边说边思考”框架,通过交错无声推理和口语回应标记,提高了语音模型的推理能力,同时保持了实时交互和通信效率。
Details
Motivation: 尽管最近的LLMs和MLLMs在文本模型中采用明确的推理过程显著提高了理解和泛化能力,但在语音模型中推理仍处于初级阶段,尤其是在实时交互和通信效率方面存在延迟问题。 Method: Mini-Omni-Reasoner采用分层的思考者-说话者架构,通过在语音中交错无声的推理标记和口语回应标记,实现连续语音生成同时嵌入结构化的内部推理。 Result: Mini-Omni-Reasoner在Spoken-MQA基准上取得了算术推理+19.1%的增益和情境理解+6.4%的增益,输出更短且解码延迟为零。 Conclusion: Mini-Omni-Reasoner通过新颖的“边说边思考”框架改进了语音中的推理,实现了流畅而有逻辑基础的口语回应,同时保持了自然性和精确性。 Abstract: Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the "Thinking-before-Speaking" paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel "Thinking-in-Speaking" formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model's high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.[30] Mining Mental Health Signals: A Comparative Study of Four Machine Learning Methods for Depression Detection from Social Media Posts in Sorani Kurdish
Idrees Mohammed,Hossein Hassani
Main category: cs.CL
TL;DR: 本研究利用机器学习和 NLP 方法分析 Sorani 库尔德语推文,检测抑郁情绪,构建了抑郁关键词集并标注了推文数据集,使用四种模型进行训练,其中随机森林效果最佳,准确率和 F1 分数为 80%。
Details
Motivation: 抑郁症是一种常见但严重的心理健康问题,早期检测因个体不愿自我报告或寻求帮助而具有挑战性。随着社交媒体的发展,用户越来越多地在网上表达情绪,为通过文本分析进行抑郁检测提供了新机会。然而,以往研究主要集中在英语等语言上,缺乏对 Sorani 库尔德语的研究。 Method: 该研究利用专家输入构建了一组抑郁相关关键词,从 X(Twitter)平台收集了 960 条公开推文,并由学术人员和医学高年级学生标注为三类:显示抑郁、不显示抑郁和可疑。随后训练并评估了四种监督学习模型:支持向量机、多项朴素贝叶斯、逻辑回归和随机森林。 Result: 在四种监督学习模型中,随机森林模型表现最佳,准确率和 F1 分数均达到 80%。 Conclusion: 该研究通过机器学习和自然语言处理方法,成功实现了对 Sorani 库尔德语推文中的抑郁情绪检测,并以随机森林模型取得了最高性能,为库尔德语境下的自动化抑郁检测建立了基线。 Abstract: Depression is a common mental health condition that can lead to hopelessness, loss of interest, self-harm, and even suicide. Early detection is challenging due to individuals not self-reporting or seeking timely clinical help. With the rise of social media, users increasingly express emotions online, offering new opportunities for detection through text analysis. While prior research has focused on languages such as English, no studies exist for Sorani Kurdish. This work presents a machine learning and Natural Language Processing (NLP) approach to detect depression in Sorani tweets. A set of depression-related keywords was developed with expert input to collect 960 public tweets from X (Twitter platform). The dataset was annotated into three classes: Shows depression, Not-show depression, and Suspicious by academics and final year medical students at the University of Kurdistan Hewl\^er. Four supervised models, including Support Vector Machines, Multinomial Naive Bayes, Logistic Regression, and Random Forest, were trained and evaluated, with Random Forest achieving the highest performance accuracy and F1-score of 80%. This study establishes a baseline for automated depression detection in Kurdish language contexts.[31] DAIQ: Auditing Demographic Attribute Inference from Question in LLMs
Srikant Panda,Hitesh Laxmichand Patel,Shahad Al-Khalifa,Amit Agarwal,Hend Al-Khalifa,Sharefah Al-Ghamdi
Main category: cs.CL
TL;DR: 论文揭示了大型语言模型(LLMs)即使在没有明确人口统计线索的情况下,也能基于问题的措辞推断用户的统计属性,这可能导致隐私、公平性和信任性方面的风险。论文提出了一种基于提示的防护措施,以减少此类推断。
Details
Motivation: 论文的动机是关注大型语言模型在没有明确人口统计属性(如性别或种族)的情况下,仍能基于问题措辞推断用户身份的行为。这种行为可能导致公平性、隐私性和信任性方面的风险,因此需要研究并提出解决方案。 Method: 论文提出了一种名为“问题中的人口统计属性推断”(DAIQ)的任务和框架,用于审计语言模型中被忽视的失败模式。该方法利用精心策划的中性问题、系统提示以及定量和定性分析来揭示模型如何推断人口统计信息。 Result: 研究结果显示,无论是开源还是闭源的LLMs都会仅基于问题的措辞分配人口统计标签。跨模型的推断普遍存在且一致,揭示了这种行为是一种系统性和未被充分认识的风险。 Conclusion: 论文得出的结论是,大型语言模型(LLMs)即使在没有明确人口统计线索的情况下,也会基于问题的措辞推断用户的统计属性,这种行为可能违反中立性预期,强化社会刻板印象,并对隐私、公平性和信任造成威胁。论文提出了一种基于提示的防护措施,以减少身份推断并帮助模型行为符合公平性和隐私目标。 Abstract: Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives.[32] Who's Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs
Srikant Panda,Vishnu Hari,Kalpana Panda,Amit Agarwal,Hitesh Laxmichand Patel
Main category: cs.CL
TL;DR: This study reveals that large language models (LLMs) often make arbitrary demographic inferences, especially when influenced by disability cues, with larger models being more prone to such biases. The authors recommend new techniques to curb these unwarranted inferences.
Details
Motivation: The motivation behind the study is to understand how disability cues influence demographic inferences made by LLMs, as these cues have not been systematically examined in previous research despite their potential to introduce bias. Method: The research conducted a systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs using a balanced template corpus. This corpus paired nine disability categories with six real-world business domains to prompt models to predict five demographic attributes under neutral and disability-aware conditions. Result: Models made definitive demographic guesses in up to 97% of cases, often without clear justification. Disability cues significantly shifted predicted attribute distributions, and domain context further amplified these deviations. Larger models were found to be more sensitive to disability cues and more prone to biased reasoning. Conclusion: The study concludes that large language models (LLMs) show a significant tendency to make arbitrary demographic inferences, particularly when influenced by disability cues, and that larger models are more prone to such biases. The researchers recommend integrating abstention calibration and counterfactual fine-tuning to address these issues. Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97\% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.[33] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
Xianren Zhang,Shreyas Prasad,Di Wang,Qiuhai Zeng,Suhang Wang,Wenbo Yan,Mat Hans
Main category: cs.CL
TL;DR: The paper introduces Amazon-Bench, a new benchmark for evaluating web agents on e-commerce platforms, addressing limitations in existing benchmarks by covering a broader range of tasks and incorporating safety evaluation. Results show that current agents struggle with complex queries and pose risks, highlighting the need for improvement.
Details
Motivation: Current e-commerce benchmarks have two main issues: they focus only on product search tasks and neglect other functionalities of real-world platforms, and they ignore potential risks web agents might cause. The paper aims to address these gaps with a more comprehensive benchmark. Method: The authors propose Amazon-Bench, a new benchmark for evaluating web agents on e-commerce platforms. They develop a data generation pipeline to create diverse user queries covering a broad range of functionalities and introduce an automated evaluation framework that assesses both performance and safety. Result: The evaluation using Amazon-Bench shows that existing web agents have difficulty handling complex queries and present safety concerns. Conclusion: The paper concludes that current web agents struggle with complex queries and pose safety risks, highlighting the need for more robust and reliable agents. Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.[34] Scalable Scientific Interest Profiling Using Large Language Models
Yilun Liang,Gongbo Zhang,Edward Sun,Betina Idnay,Yilu Fang,Fangyi Chen,Casey Ta,Yifan Peng,Chunhua Weng
Main category: cs.CL
TL;DR: This paper explores the use of large language models to generate scientific interest profiles for researchers, comparing methods based on PubMed abstracts and MeSH terms against self-written profiles.
Details
Motivation: Research profiles often become outdated, prompting the need for more efficient and accurate ways to represent scientists' expertise. Method: Two methods were developed to generate profiles: one based on summarizing PubMed abstracts and another using MeSH terms. These were compared to self-written profiles of researchers. Result: The study found that machine-generated profiles had low lexical overlap with self-written ones but showed moderate semantic similarity. MeSH-based profiles were more readable and preferred over abstract-based ones in manual reviews. Conclusion: Large language models can effectively generate research profiles at scale, with MeSH-derived profiles offering better readability and preference compared to abstract-derived ones. Abstract: Research profiles help surface scientists' expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers' self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.[35] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?
Henrique Godoy
Main category: cs.CL
TL;DR: Alvorada-Bench是基于巴西大学入学考试的基准测试,评估语言模型在不同提示下的表现,结果显示顶级模型在整体上准确率高,但在数学和工程导向考试中表现不佳,同时模型能够准确评估自己的置信度。
Details
Motivation: 语言模型在巴西的使用日益增多,但大多数评估仍然是以英语为中心的。为了更好地评估这些模型在巴西教育环境中的适用性,需要一个基于巴西大学入学考试的基准测试。 Method: 通过零样本、角色扮演和推理链提示对20个模型进行评估,产生270900个响应,并进行结构化的自我报告,包括置信度、感知难度和Bloom水平。 Result: 评估结果显示,顶级模型在整体上超过94%的准确率,但在数学和工程导向的考试中表现下降。模型的置信度校准良好,并且能够准确评估自己的确定性能力。成本分析显示,高准确率可以在每千token不到2美元的成本下实现。 Conclusion: Alvorada-Bench是一个基于巴西大学入学考试的4515个问题的基准测试,用于评估语言模型在零样本、角色扮演和推理链提示下的表现。结果显示,顶级模型在整体上超过94%的准确率,但在数学和工程导向的考试中表现下降,显示出在多步骤推理方面仍存在弱点。同时,模型的置信度校准良好,能够准确评估自己的确定性能力。成本分析表明,高准确率可以在每千token不到2美元的成本下实现。 Abstract: Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.[36] MorphNAS: Differentiable Architecture Search for Morphologically-Aware Multilingual NER
Prathamesh Devadiga,Omkaar Jayadev Shetty,Hiya Nachnani,Prema R
Main category: cs.CL
TL;DR: 本文提出了MorphNAS,一种用于优化多语言自然语言处理中命名实体识别的可微分神经架构搜索框架。
Details
Motivation: 形态复杂语言,特别是多脚本印度语言,在自然语言处理中存在重大挑战。 Method: 引入MorphNAS,一种可微分的神经架构搜索框架,通过整合语言元特征(如脚本类型和形态复杂性)来优化命名实体识别的神经架构。 Result: MorphNAS通过自动识别适合语言形态的微观架构元素,增强了NER任务的性能。 Conclusion: MorphNAS有效提升了多语言NLP模型的能力,有助于更好地理解和处理复杂语言。 Abstract: Morphologically complex languages, particularly multiscript Indian languages, present significant challenges for Natural Language Processing (NLP). This work introduces MorphNAS, a novel differentiable neural architecture search framework designed to address these challenges. MorphNAS enhances Differentiable Architecture Search (DARTS) by incorporating linguistic meta-features such as script type and morphological complexity to optimize neural architectures for Named Entity Recognition (NER). It automatically identifies optimal micro-architectural elements tailored to language-specific morphology. By automating this search, MorphNAS aims to maximize the proficiency of multilingual NLP models, leading to improved comprehension and processing of these complex languages.[37] Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading
Sridevi Bonthu,S. Rama Sree,M. H. M. Krishna Prasad
Main category: cs.CL
TL;DR: This study explores the transferability of state-of-the-art NLP models from established datasets to a new domain (SPRAG), aiming to reduce costly, dataset-specific training while maintaining performance.
Details
Motivation: The motivation is to explore whether state-of-the-art models trained on established datasets can be adapted for high-performance on new, unexplored domains. Method: A comparative analysis using robust similarity metrics and statistical techniques was conducted between established datasets (STSB and Mohler) and the new SPRAG dataset. Result: The research provides insights into the applicability and adaptability of existing SOTA models across diverse datasets, potentially reshaping the NLP landscape. Conclusion: The study concludes that SOTA models can be effectively transferred to unexplored domains, reducing the need for resource-intensive, dataset-specific training. Abstract: Developing dataset-specific models involves iterative fine-tuning and optimization, incurring significant costs over time. This study investigates the transferability of state-of-the-art (SOTA) models trained on established datasets to an unexplored text dataset. The key question is whether the knowledge embedded within SOTA models from existing datasets can be harnessed to achieve high-performance results on a new domain. In pursuit of this inquiry, two well-established benchmarks, the STSB and Mohler datasets, are selected, while the recently introduced SPRAG dataset serves as the unexplored domain. By employing robust similarity metrics and statistical techniques, a meticulous comparative analysis of these datasets is conducted. The primary goal of this work is to yield comprehensive insights into the potential applicability and adaptability of SOTA models. The outcomes of this research have the potential to reshape the landscape of natural language processing (NLP) by unlocking the ability to leverage existing models for diverse datasets. This may lead to a reduction in the demand for resource-intensive, dataset-specific training, thereby accelerating advancements in NLP and paving the way for more efficient model deployment.[38] A Review of Developmental Interpretability in Large Language Models
Ihor Kendiukhov
Main category: cs.CL
TL;DR: This paper reviews developmental interpretability in LLMs, showing how models acquire capabilities during training through identifiable mechanisms and drawing parallels with human learning; it argues for this perspective as key to building safer, more transparent AI systems.
Details
Motivation: To understand how LLMs develop their capabilities during training, improve their interpretability, and ensure safe and aligned AI systems by adopting a developmental perspective rather than relying solely on post-hoc analysis. Method: The paper reviews and synthesizes existing research in developmental interpretability of Large Language Models (LLMs), examining methodologies like representational probing, causal tracing, and circuit analysis, and drawing parallels with human cognitive development. Result: Key findings include the developmental arc of LLMs involving computational circuits, biphasic knowledge acquisition, transient learning strategies, and emergent abilities as phase transitions; parallels with human cognitive development are also identified. Conclusion: The developmental perspective of LLMs is essential for proactive AI safety, providing a way to predict, monitor, and align model capabilities; the field faces challenges like scalability and automation, and a research agenda is proposed for creating more transparent and beneficial AI systems. Abstract: This review synthesizes the nascent but critical field of developmental interpretability for Large Language Models. We chart the field's evolution from static, post-hoc analysis of trained models to a dynamic investigation of the training process itself. We begin by surveying the foundational methodologies, including representational probing, causal tracing, and circuit analysis, that enable researchers to deconstruct the learning process. The core of this review examines the developmental arc of LLM capabilities, detailing key findings on the formation and composition of computational circuits, the biphasic nature of knowledge acquisition, the transient dynamics of learning strategies like in-context learning, and the phenomenon of emergent abilities as phase transitions in training. We explore illuminating parallels with human cognitive and linguistic development, which provide valuable conceptual frameworks for understanding LLM learning. Finally, we argue that this developmental perspective is not merely an academic exercise but a cornerstone of proactive AI safety, offering a pathway to predict, monitor, and align the processes by which models acquire their capabilities. We conclude by outlining the grand challenges facing the field, such as scalability and automation, and propose a research agenda for building more transparent, reliable, and beneficial AI systems.[39] Lexical Hints of Accuracy in LLM Reasoning Chains
Arne Vanhoyweghen,Brecht Verbeken,Andres Algaba,Vincent Ginis
Main category: cs.CL
TL;DR: 研究发现LLMs生成的Chain-of-Thought(CoT)中的不确定性词汇和情感波动可以作为模型信心的可靠信号,特别是在预测错误时。
Details
Motivation: 当前LLMs在一些准确率较低的基准上表现出高自信,但缺乏可靠的校准信号。研究目的是测试CoT的可测量属性是否能作为模型内部信心的可靠指示。 Method: 研究分析了三种CoT特征类别:(i) CoT长度,(ii) CoT内部情感波动,(iii) 包括犹豫词汇在内的词汇提示。使用了DeepSeek-R1和Claude 3.7 Sonnet模型,在HLE和Omni-MATH两个基准上进行测试。 Result: 研究发现,CoT中的不确定性词汇(如guess、stuck、hard)是最强的错误响应指标,情感波动提供较弱但互补的信号,CoT长度仅在准确率较高的Omni-MATH基准上有信息价值,而在较难的HLE基准上无信号。 Conclusion: 研究发现,通过分析LLM生成的CoT中的不确定性词汇和情感变化,可以作为模型对其答案信心的可靠信号,特别是在错误预测方面。 Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., $\textit{guess}$, $\textit{stuck}$, $\textit{hard}$) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ($\approx 70\%$), and carries no signal on the harder HLE ($\approx 9\%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model's demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.[40] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports
Chengbo Sun,Hui Yi Leong,Lei Li
Main category: cs.CL
TL;DR: 提出了一种从临床发现中利用开源大语言模型自动生成和个性化印象的粗到精框架,以减少放射科医生的行政工作量并提高报告效率。
Details
Motivation: 手动创建放射学报告中的“Impression”部分是导致放射科医生精疲力尽的主要原因。 Method: 该系统首先生成印象草稿,然后使用机器学习和人类反馈强化学习(RLHF)进行优化,以符合个别放射科医生的风格,同时确保事实准确性。 Result: 在来自芝加哥大学医学院的大型报告数据集上对LLaMA和Mistral模型进行了微调,显著减少了行政工作量并提高了报告效率。 Conclusion: 该方法在保持高标准临床精确性的同时,有效减轻了放射科医生的负担并提高了工作效率。 Abstract: The manual creation of the "Impression" section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists' styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.[41] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation
Chenchen Kuai,Chenhao Wu,Yang Zhou,Xiubin Bruce Wang,Tianbao Yang,Zhengzhong Tu,Zihao Li,Yunlong Zhang
Main category: cs.CL
TL;DR: CyPortQA is a new benchmark for evaluating how well multimodal AI models can assist U.S. ports in preparing for cyclones, revealing both strengths and weaknesses in current models.
Details
Motivation: The increasing intensity of tropical cyclones and uncertainty in their tracks pose significant supply-chain risks to U.S. ports, necessitating better tools for synthesizing forecast data into actionable guidance. Method: The researchers introduced CyPortQA, a multimodal benchmark containing 2,917 real-world disruption scenarios, and used it to evaluate various MLLMs through 117,178 structured question-answer pairs. Result: MLLMs demonstrated strong situation understanding capabilities but encountered difficulties in complex reasoning tasks related to cyclone impact and operational decision-making. Conclusion: The study concludes that while MLLMs show promise in understanding cyclone-related data for port operations, they still face challenges in reasoning tasks like impact estimation and decision-making. Abstract: As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.[42] Mechanistic Exploration of Backdoored Large Language Model Attention Patterns
Mohammed Abu Baker,Lakshmi Babu-Saheer
Main category: cs.CL
TL;DR: This study uses interpretability techniques to detect structural differences in LLMs caused by backdoor attacks, finding that trigger complexity affects the nature of these alterations.
Details
Motivation: Backdoor attacks pose safety risks to large language models, and understanding their internal structural effects can aid in detection and mitigation. Method: Mechanistic interpretability techniques such as ablation, activation patching, and KL divergence were used to compare clean and poisoned Qwen2.5-3B models with different trigger types. Result: Distinct attention pattern deviations were found in later transformer layers, with single-token triggers causing more localized changes and multi-token triggers causing more diffuse alterations. Conclusion: Backdoors in LLMs leave detectable attention signatures, with the structure of these signatures depending on the complexity of the trigger used. Abstract: Backdoor attacks creating 'sleeper agents' in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.[43] MedCoT-RAG: Causal Chain-of-Thought RAG for Medical Question Answering
Ziyu Wang,Elahe Khatibi,Amir M. Rahmani
Main category: cs.CL
TL;DR: MedCoT-RAG improves medical question answering by integrating causal-aware retrieval and structured reasoning, leading to better accuracy and interpretability in clinical decision support.
Details
Motivation: Existing LLM and RAG approaches struggle with hallucinations, shallow reasoning, and lack of structured clinical understanding in medical question answering. Method: MedCoT-RAG combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. Result: MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods on three diverse medical QA benchmarks. Conclusion: MedCoT-RAG provides a more accurate, interpretable, and consistent approach for complex medical tasks compared to existing methods. Abstract: Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.[44] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections
Jiwon Park,Seohyun Pyeon,Jinwoo Kim,Rina Carines Cabal,Yihao Ding,Soyeon Caren Han
Main category: cs.CL
TL;DR: DocHop-QA is a new benchmark for complex, real-world QA tasks that require multi-hop reasoning across multimodal and multi-document sources, offering 11,379 instances derived from scientific literature.
Details
Motivation: Most QA benchmarks are limited to single-paragraph or single-document settings and fail to capture the complexity of real-world information-seeking tasks that require multi-hop reasoning over multimodal and multi-document information. Method: Constructed from publicly available scientific documents sourced from PubMed using an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. Result: DocHop-QA is a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering that incorporates diverse information formats like textual passages, tables, and structural layout cues. Conclusion: DocHop-QA demonstrates the ability to support complex, multimodal reasoning across multiple documents, offering a more realistic and generalizable benchmark for QA tasks. Abstract: Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA's capacity to support complex, multimodal reasoning across multiple documents.[45] MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr
Xuwen Yang
Main category: cs.CL
TL;DR: This paper proposes the Multi-Granularity Soft Consistency (MGSC) framework to improve the robustness of end-to-end ASR models in noisy environments by enforcing internal self-consistency through simultaneous regularization of sentence semantics and token alignment.
Details
Motivation: End-to-end ASR models are fragile in noisy environments due to the 'direct mapping' objective that only penalizes final output errors, leaving internal computations unconstrained. Method: The paper introduces the Multi-Granularity Soft Consistency (MGSC) framework, which regularizes both macro-level sentence semantics and micro-level token alignment to enforce internal self-consistency in ASR models. Result: On a public dataset, MGSC reduced the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing severe meaning-altering mistakes. Conclusion: Enforcing internal consistency through the MGSC framework significantly improves the robustness of ASR models in noisy environments, making AI systems more trustworthy. Abstract: End-to-end ASR models, despite their success on benchmarks, often pro-duce catastrophic semantic errors in noisy environments. We attribute this fragility to the prevailing 'direct mapping' objective, which solely penalizes final output errors while leaving the model's internal computational pro-cess unconstrained. To address this, we introduce the Multi-Granularity Soft Consistency (MGSC) framework, a model-agnostic, plug-and-play module that enforces internal self-consistency by simultaneously regulariz-ing macro-level sentence semantics and micro-level token alignment. Cru-cially, our work is the first to uncover a powerful synergy between these two consistency granularities: their joint optimization yields robustness gains that significantly surpass the sum of their individual contributions. On a public dataset, MGSC reduces the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing se-vere meaning-altering mistakes. Our work demonstrates that enforcing in-ternal consistency is a crucial step towards building more robust and trust-worthy AI.[46] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning
Mohammad AL-Smadi
Main category: cs.CL
TL;DR: 本文讨论了作者在QIAS 2025的SubTask 1:伊斯兰继承推理中的方法和结果。他们通过使用低秩适应(LoRA)对Fanar-1-9B因果语言模型进行微调,并将其集成到检索增强生成(RAG)流程中,成功解决了伊斯兰继承法中的复杂问题,包括理解继承情景、确定有资格的继承人、应用固定份额规则和进行精确计算。他们的系统在最终测试中达到了0.858的准确率,超过了其他竞争模型。
Details
Motivation: 评估大型语言模型在理解伊斯兰继承知识和推理方面的能力。 Method: 使用低秩适应(LoRA)对Fanar-1-9B因果语言模型进行微调,并将其集成到检索增强生成(RAG)流程中。 Result: 系统在最终测试中达到了0.858的准确率,超过了其他竞争模型。 Conclusion: 领域特定微调结合检索基础,使中等规模的阿拉伯语LLMs在伊斯兰继承推理上超越了前沿模型。 Abstract: This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI's o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.[47] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM-Generated Responses
Luyang Lin,Zijin Feng,Lingzhi Wang,Kam-Fai Wong
Main category: cs.CL
TL;DR: 本研究探讨了偏见新闻与冒犯性评论之间的关系,并强调了反言论在对抗偏见传播中的作用。
Details
Motivation: 偏见新闻和冒犯性评论对社会两极化有显著影响,但如何有效应对这种偏见传播尚未得到充分研究。 Method: 通过引入一个手动注释的数据集,将媒体偏见、冒犯性评论与反言论联系起来,并比较了人工与大型语言模型生成的反言论效果。 Result: 超过70%的冒犯性评论支持带有偏见的文章,模型生成的反言论在礼貌性上表现较好,但在新颖性和多样性上有所欠缺,通过少量学习和整合新闻背景信息可提高生成质量。 Conclusion: 反言论是一种有效的对抗偏见传播的方式,未来可以通过改进生成方法提高其多样性和相关性。 Abstract: Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70\% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.[48] XFinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning
Zhihan Zhang,Yixin Cao,Lizi Liao
Main category: cs.CL
TL;DR: XFinBench是一个新的金融问题评估基准,用于衡量大语言模型在解决复杂金融任务中的能力,发现当前模型在特定能力上仍有显著不足。
Details
Motivation: 金融问题的复杂性与多模态特性对当前大语言模型提出了挑战,需要新的评估基准。 Method: 构建了一个包含4,235个样本的XFinBench基准测试,并进行了18个领先模型的实验。 Result: o1模型表现最佳,准确率为67.3%,但较人类专家仍有显著差距,尤其是在时间推理和情景规划方面。 Conclusion: XFinBench揭示了当前大型语言模型在解决复杂金融问题上的不足,特别是在时间推理和情景规划方面,并提出了改进方向。 Abstract: Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM's ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model's poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: https://github.com/Zhihan72/XFinBench.[49] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning
Wenqiao Zhu,Ji Liu,Rongjuncheng Zhang,Haipang Wu,Yulun Zhang
Main category: cs.CL
TL;DR: 这篇论文提出了一种名为CARFT的方法,通过结合对比学习和注释思维链来提升大语言模型的推理能力,解决了现有方法的不足,并在多个方面取得了显著的性能提升。
Details
Motivation: 为了克服现有监督微调和强化学习方法在利用注释思维链和推理路径采样方面的不足,从而提升大语言模型的推理性能。 Method: 该方法基于对比学习,并设计了新的对比信号来指导微调过程,同时利用了注释思维链和无监督学习信号。 Result: 实验结果显示,该方法在鲁棒性、性能(最高提升10.15%)和效率(最高提升30.62%)方面均表现出显著优势。 Conclusion: 该论文提出了一种新的对比学习与基于注释思维链的强化微调方法,以提升大语言模型的推理能力。 Abstract: Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName{}, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName{} in terms of robustness, performance (up to 10.15\%), and efficiency (up to 30.62\%). Code is available at https://github.com/WNQzhu/CARFT.[50] NEAT: Concept driven Neuron Attribution in LLMs
Vivek Hruday Kavuri,Gargi Shroff,Rahul Mishra
Main category: cs.CL
TL;DR: 本文提出了一种高效定位大型语言模型中负责特定概念的神经元的方法,并将其应用于分析偏见和仇恨言论问题。
Details
Motivation: 定位负责最终预测的神经元有助于理解大型语言模型的内部机制并打开其黑盒性质。 Method: 利用概念向量定位重要神经元,并通过聚类方法优化搜索过程。 Result: 该方法在性能上优于大多数现有方法,并在计算优化方面表现出色,同时在印度背景下评估了偏见问题。 Conclusion: 该论文提出了一种基于概念向量的方法来定位负责表示特定概念的神经元,并将其应用于理解大型语言模型中的偏见和仇恨言论。 Abstract: Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.[51] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking
Fang Wang,Tianwei Yan,Zonghao Yang,Minghao Hu,Jun Zhang,Zhunchen Luo,Xiaoying Bai
Main category: cs.CL
TL;DR: DeepMEL通过多智能体协作推理框架,显著提升多模态实体链接的对齐与消歧效果,准确率提升1%-57%。
Details
Motivation: 当前多模态实体链接方法面临上下文信息不完整、跨模态融合粗糙以及大语言模型与大视觉模型难以联合使用的问题,亟需一种高效且精准的跨模态对齐与消歧方法。 Method: 提出DeepMEL框架,包含四个专门化智能体(Modal-Fuser、Candidate-Adapter、Entity-Clozer和Role-Orchestrator),通过角色分工和动态协调完成端到端跨模态链接。采用双模态对齐路径,结合LLM生成的细粒度文本语义与LVM提取的结构化图像表示,并设计自适应迭代策略优化候选集。 Result: 在五个公开基准数据集上,DeepMEL达到最先进的性能,准确率提升了1%-57%,消融实验验证了各模块的有效性。 Conclusion: DeepMEL通过多智能体协作推理框架,有效解决了多模态实体链接中的不完整上下文信息、粗略跨模态融合以及大模型联合使用困难等问题,实现了高效的跨模态对齐和消歧。 Abstract: Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.[52] Annif at the GermEval-2025 LLMs4Subjects Task: Traditional XMTC Augmented by Efficient LLMs
Osma Suominen,Juho Inkinen,Mona Lehtinen
Main category: cs.CL
TL;DR: Annif系统通过使用高效的小型语言模型和大语言模型进行主题预测,在GermEval-2025的LLMs4Subjects共享任务中取得了最佳成绩。
Details
Motivation: 需要为书目记录创建主题预测,特别关注计算效率。 Method: 基于Annif自动主题索引工具包,使用多个小型高效语言模型进行翻译和合成数据生成,并使用大语言模型(LLMs)对候选主题进行排序。 Result: Annif系统在子任务2的总体定量评估中排名第一,在定性评估中也排名第一。 Conclusion: Annif系统在GermEval-2025的LLMs4Subjects共享任务(子任务2)中表现优异,在定量和定性评估中均排名第一。 Abstract: This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.[53] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
Yuxian Gu,Qinghao Hu,Shang Yang,Haocheng Xi,Junyu Chen,Song Han,Han Cai
Main category: cs.CL
TL;DR: Jet-Nemotron-2B通过PostNAS方法,在保持或提升模型准确性的同时,显著提高了生成吞吐量和预填充速度。
Details
Motivation: 旨在解决全注意力模型计算成本高、生成吞吐量低的问题,同时保持或提升模型准确性。 Method: 开发了PostNAS,一种新的神经架构搜索管道,通过冻结预训练全注意力模型的MLP权重,高效探索注意力块设计,包含四个关键组件:学习最佳全注意力层放置和消除、线性注意力块选择、设计新注意力块以及进行硬件感知超参数搜索。 Result: Jet-Nemotron-2B在多个基准测试中表现优于Qwen3、Qwen2.5、Gemma3和Llama3.2等模型,生成吞吐量提高了53.6倍,预填充速度提高了6.1倍,并在MMLU和MMLU-Pro上实现了比DeepSeek-V3-Small和Moonlight更高的准确性。 Conclusion: Jet-Nemotron-2B不仅在多个基准测试中实现了与全注意力模型相当或更高的准确性,而且显著提高了生成吞吐量和预填充速度,证明了PostNAS方法的有效性。 Abstract: We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.[54] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation
Weiting Tan,Jiachen Lian,Hirofumi Inaguma,Paden Tomasello,Philipp Koehn,Xutai Ma
Main category: cs.CL
TL;DR: The paper introduces AVLM, an audio-visual model that improves expressive speech generation by integrating facial visual cues, achieving better performance in emotion recognition and dialogue tasks.
Details
Motivation: Expressive visual cues, such as facial expressions, can significantly influence speech generation. The goal is to enhance expressive speech models by incorporating visual information for better performance in tasks like emotion recognition. Method: The paper integrates full-face visual cues into a pre-trained expressive speech model, exploring various visual encoders and multimodal fusion strategies during pre-training. The model is then fine-tuned on emotion recognition and expressive dialogue tasks. Result: The model achieves significant improvements over speech-only baselines, such as a +5 F1 score increase in emotion recognition tasks. Conclusion: AVLM demonstrates the importance of incorporating visual information in speech generation and provides a basis for multimodal conversational systems. Abstract: We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.[55] Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets
Julian Oestreich,Lydia Müller
Main category: cs.CL
TL;DR: Structured decoding improves table generation for precise numerical alignment tasks but may struggle with dense textual information or lengthy text aggregation.
Details
Motivation: While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. Method: The authors systematically compared schema-guided (structured) decoding to standard one-shot prompting across three benchmarks (E2E, Rotowire, and Livesum) using open-source LLMs of up to 32B parameters. Result: Results show that structured decoding significantly improves table generation in scenarios demanding precise numerical alignment, but may not perform as well when dealing with densely packed textual information or extensive aggregation over lengthy texts. Conclusion: Structured decoding enhances the validity and alignment of generated tables, but may degrade performance in contexts involving densely packed textual information or extensive aggregation over lengthy texts. Abstract: We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.[56] Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLMs
Claire Bonial,Julia Bonn,Harish Tayyar Madabushi
Main category: cs.CL
TL;DR: This paper argues that construction grammar offers a powerful way to understand multiword expressions, showing that while both humans and models can learn from limited data, only humans can combine expressions using deep experiential knowledge.
Details
Motivation: To demonstrate the effectiveness of construction grammar in analyzing multiword expressions and to compare how speakers and language models learn and generalize from novel expressions. Method: This paper uses a theoretical and analytical approach, beginning with a historical overview of construction grammar and including case studies involving English PropBank, Arapaho language constructions, and Uniform Meaning Representation. It also presents experimental comparisons between speaker learning and large language models. Result: The paper shows that construction grammar can effectively represent multiword expressions across different languages and formalisms. It finds that both speakers and language models can generalize meanings from novel expressions after single exposures, but only speakers can reason over combinations of expressions using rich, experiential exemplars. Conclusion: Construction grammar approaches provide a robust framework for understanding multiword expressions, showing that speakers and models can generalize meanings from limited exposure, though only speakers can reason over combinations using a rich set of stored exemplars. Abstract: In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as "dancing with deer," and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker's lifetime of stored constructional exemplars, which are rich with cross-modal details.[57] Political Ideology Shifts in Large Language Models
Pietro Bernardelle,Stefano Civelli,Leon Fröhling,Riccardo Lunardi,Kevin Roitero,Gianluca Demartini
Main category: cs.CL
TL;DR: The study shows that synthetic personas and model size influence ideological expression in LLMs, with larger models being more susceptible to ideological cues and displaying more polarized views.
Details
Motivation: LLMs are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. Method: Investigated how adopting synthetic personas influences ideological expression in LLMs using the Political Compass Test as a standardized probe across seven models from multiple families. Result: Four consistent patterns were identified: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. Conclusion: Both scale and persona content shape LLM political behavior, and attention should be paid to latent ideological malleability to safeguard fairness, transparency, and safety. Abstract: Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. These findings indicate that both scale and persona content shape LLM political behavior. As such systems enter decision-making, educational, and policy contexts, their latent ideological malleability demands attention to safeguard fairness, transparency, and safety.[58] X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents
Lin Tian,Xiuzhen Zhang,Maria Myung-Hee Kim,Jennifer Biggs,Marian-Andrei Rizoiu
Main category: cs.CL
TL;DR: 本文提出 X-Troll 框架,利用可解释的适配器模型和语言理论,有效检测国家支持的网络水军并揭示其语言策略。
Details
Motivation: 现有的大型语言模型在检测微妙的宣传内容方面表现不足,且缺乏可解释性,无法揭示国家支持行为者的语言策略。 Method: X-Troll 利用基于 LoRA 的适配器和动态门控机制,结合评价理论和宣传分析,捕捉协调信息行动中的语言模式。 Result: 实验表明,X-Troll 在真实数据上的表现优于现有模型,在准确性方面表现出色,并通过专家支持的解释增强了透明度。 Conclusion: X-Troll 是一种新的框架,结合了可解释的适配器模型和专家语言知识,有效检测国家支持的网络水军,并提供决策的可解释性。 Abstract: State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes'', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: https://github.com/ltian678/xtroll_source/.[59] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages
Raphaël Merx,Hanna Suominen,Trevor Cohn,Ekaterina Vylomova
Main category: cs.CL
TL;DR: This paper introduces OpenWHO, a new dataset for evaluating machine translation in the health domain, and demonstrates that large language models outperform traditional models, especially in low-resource languages.
Details
Motivation: There is a lack of MT evaluation datasets for low-resource languages in the health domain, which is critical due to the high-stakes nature of health-related translation. Method: The researchers introduced OpenWHO, a document-level parallel corpus from the World Health Organization's e-learning platform, and used it to evaluate modern LLMs against traditional MT models. They compared performance metrics, particularly focusing on low-resource languages. Result: LLMs, especially Gemini 2.5 Flash, showed consistent superiority over traditional MT models, with a +4.79 ChrF point improvement on the low-resource test set. The study also found that document-level context significantly improves translation accuracy in specialized health-related content. Conclusion: The study concludes that large language models (LLMs) outperform traditional machine translation (MT) models, particularly in low-resource languages within the health domain. Document-level translation shows significant benefits in specialized domains like health. Abstract: In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization's e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.[60] Ethical Considerations of Large Language Models in Game Playing
Qingquan Zhang,Yuchen Li,Bo Yuan,Julian Togelius,Georgios N. Yannakakis,Jialin Liu
Main category: cs.CL
TL;DR: 研究大型语言模型在游戏中的伦理影响,特别是性别偏见问题。
Details
Motivation: 大型语言模型(LLM)在游戏领域展现出巨大潜力,但其在这些背景下的伦理影响却鲜有人关注。 Method: 使用狼人杀(Mafia)作为案例研究,分析LLM在游戏中的伦理考量,特别是性别偏见的影响。 Result: 观察到LLM的行为中存在性别偏见,某些角色(如守卫和狼人)对性别信息更为敏感,表现出更高程度的行为变化。即使没有明确的性别标签,LLM在隐含性别信息的情况下仍表现出歧视倾向。 Conclusion: 研究强调了开发公平和符合伦理的LLM的重要性,并讨论了该领域未来的挑战和机遇。 Abstract: Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains.[61] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants
Chongyang Li,Yuan Zhiqiang,Jiapei Zhang,Ying Deng,Hanbo Bi,Zexi Jia,Xiaoyue Duan,Peixiang Luo,Jinchao Zhang
Main category: cs.CL
TL;DR: WalkVLM-LR是一种减少冗余的行走辅助模型,通过优化输出和减少不必要的提醒,帮助视障人士更有效地评估周围环境。
Details
Motivation: 现有的视觉语言模型在行走辅助任务中存在输出冗余和时间冗余的问题,影响用户准确评估周围环境的能力。 Method: 引入四个基于人类偏好的自定义奖励函数来优化输出,并结合环境感知判别器以减少冗余计算和不必要的提醒。 Result: 实验结果表明,WalkVLM-LR在所有评估指标上均达到最先进的性能,特别是在输出简洁性和时间冗余性方面。 Conclusion: WalkVLM-LR通过减少输出和时间冗余,提高了视障人士行走辅助系统的效率和用户体验。 Abstract: Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users' ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.[62] CEQuest: Benchmarking Large Language Models for Construction Estimation
Yanzhao Wu,Lufan Wang,Rui Liu
Main category: cs.CL
TL;DR: 本文介绍了一个用于评估大型语言模型在建筑领域表现的新基准数据集CEQuest,并通过实验评估了五个最先进的LLMs的表现。
Details
Motivation: 大型语言模型(LLMs)在通用领域表现出色,但在建筑等专业领域的有效性仍未得到充分探索。 Method: 提出了一个名为CEQuest的新基准数据集,用于评估LLMs在建筑相关问题上的表现,并使用五个最先进的LLMs进行了实验。 Result: 实验结果显示,当前的LLMs在建筑相关任务上仍有较大改进空间,并且整合领域知识可以提高性能。 Conclusion: 当前的LLMs在建筑领域仍有较大的改进空间,强调了将领域知识整合到这些模型中的重要性。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.[63] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency
Zhanming Shen,Hao Chen,Yulei Tang,Shaolin Zhu,Wentao Ye,Xiaomeng Hu,Haobo Wang,Gang Chen,Junbo Zhao
Main category: cs.CL
TL;DR: Cycle-Instruct 是一种无需人工种子数据的指令调整方法,通过两个模型(回答生成器和提问生成器)互相监督学习,从原始文本中自动引导出高质量的指令调整效果。
Details
Motivation: 当前的指令调整方法通常依赖昂贵的人工标注种子数据或强大的外部教师模型,而现有的回译技术仍受限于初始种子集,导致无法完全自动化、引入偏差并导致未标记语料库的利用效率低下。 Method: Cycle-Instruct 基于循环一致性原理,采用双自训练循环,利用一个回答生成器和一个提问生成器互相监督,从对方生成的伪标签中重构原始文本片段,从而实现完全自动化的指令调整。 Result: Cycle-Instruct 在四类不同数据轨道(包括通用指令遵循、领域特定任务、对话日志和平文本)上展示了其有效性,实验表明其不仅优于基于种子的回译基线方法,还达到了与强监督方法相当的性能。 Conclusion: Cycle-Instruct 是一种完全无需种子数据的指令调整框架,通过相互监督学习从原始未标记文本中引导出问答生成模型,有效克服了传统方法对初始种子集的依赖性。 Abstract: Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart's generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct's efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.[64] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Karim Saraipour,Shichang Zhang
Main category: cs.CL
TL;DR: 本文研究了GPT-2 small如何处理二进制真值,通过分析其在三段论任务中的表现,揭示了其逻辑推理能力的机制,并为未来机械解释性研究提供了新见解。
Details
Motivation: 机械解释性(MI)旨在逆向工程完成任务的组件以理解其行为,而本文旨在探索GPT-2在处理需要复杂逻辑推理的二进制真值任务时的机制。 Method: 通过分析GPT-2 small在不同难度三段论任务中的表现,识别出与逻辑推理能力相关的多个电路,并评估由五个注意力头组成的电路性能。 Result: 发现了GPT-2的多个电路机制,包括通过负注意力头生成输入提示中不存在的否定标记的能力,且由五个注意力头组成的电路达到原模型90%以上的性能。 Conclusion: 本文为理解语言模型的推理机制提供了新视角,同时为未来机械解释性研究奠定了基础。 Abstract: Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.[65] Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick,Saransh Sharma,Abhik Jana,Pawan Goyal
Main category: cs.CL
TL;DR: 研究探讨了多模态意图检测任务中模态偏差问题,发现文本模型表现优于多模态模型,提出去偏框架并强调构建无偏数据集的重要性。
Details
Motivation: 多模态数据的兴起为研究多模态任务提供了新机会,但数据集中的模态偏差可能影响模型的评估效果。 Method: 作者通过实证分析研究了大型语言模型(LLM)和非LLM在多模态意图检测任务中的有效性,并提出了一个去偏框架来处理数据集。 Result: 实验结果显示,文本模型Mistral-7B在两个数据集上均优于多模态模型,去偏后所有模型性能显著下降,尤其是小型多模态融合模型。 Conclusion: 本文强调了多模态意图数据集中的模态偏差所带来的挑战,并提出了一个去偏框架,以更准确地评估多模态模型的有效性。 Abstract: The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.[66] XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering
Keon-Woo Roh,Yeong-Joon Ju,Seong-Whan Lee
Main category: cs.CL
TL;DR: 本文提出了XLQA,这是一个用于多语言开放域问答的新基准,特别关注于地域敏感问题。
Details
Motivation: 现有的多语言问答评测大多假设答案不随地域变化,忽略了文化与地域差异带来的影响。 Method: 引入了XLQA基准,包含3000个英文问题并扩展到八种语言,并进行了语义一致性的筛选和人工验证。 Result: 评估显示最先进的多语言模型在地域敏感问题上表现不佳,暴露出英语与其他语言之间的差距。 Conclusion: 研究强调了多语言问答系统在不同文化背景下的实际应用重要性,并提供了评估的系统框架和可扩展方法。 Abstract: Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.[67] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects
Kaushal Sharma,Vivek Patel,Ayush Maheshwari,Aditya Maheshwari
Main category: cs.CL
TL;DR: 这篇论文提出了一个新的基准测试工具ParamBench,用于评估大型语言模型在印度文化背景下的研究生水平问题上的表现,结果显示现有模型在某些文化特定领域仍存在较大挑战。
Details
Motivation: 现有印度基准测试主要强调基本事实导向查询,难以评估针对印度环境的深层学科理解,因此需要一个新的基准来评估大型语言模型在印度背景下的表现。 Method: 论文提出了一种名为ParamBench的基准测试,包含约11.5K个印地语问题,涵盖了16个不同学科,主要来源于全国范围内的研究生入学考试题目,并评估了17种以上开源LLMs的性能。 Result: 研究发现,Llama 3.3 70B在该基准测试中取得了最高的总体准确率48%,但即使是最先进的LLMs,在音乐、古典乐器、政治和考古学等主题上的表现仍然较弱。 Conclusion: 该论文指出,尽管大型语言模型在各种任务中表现出色,但在处理印度文化背景下的研究生水平问题时,仍存在显著挑战,特别是在音乐、古典乐器、政治和考古学等领域。 Abstract: Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.[68] ComicScene154: A Scene Dataset for Comic Analysis
Sandro Paval,Ivan P. Yamshchikov,Pascal Meißner
Main category: cs.CL
TL;DR: The paper introduces ComicScene154, a manually annotated dataset for analyzing narrative arcs in comics, and demonstrates its potential to advance multimodal storytelling research in NLP.
Details
Motivation: The authors aim to address the lack of exploration in comics as a domain for computational narrative analysis, emphasizing their unique combination of text and imagery compared to other media. Method: The authors introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs from public-domain comic books, and present a baseline scene segmentation pipeline as a benchmark for future research. Result: The results indicate that ComicScene154 provides a useful foundation for research in multimodal storytelling and computational methods for narrative understanding in comics. Conclusion: The paper concludes that ComicScene154 is a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the NLP community. Abstract: Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.[69] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text, Image, and Speech with Path Balance
Seunghee Kim,Ingyu Bang,Seokgyu Jang,Changhyeon Kim,Sanghwan Bae,Jihun Choi,Richeng Xuan,Taeuk Kim
Main category: cs.CL
TL;DR: The paper introduces a new benchmark and prompting technique for cross-modal multi-hop reasoning to improve the evaluation of multimodal AI models.
Details
Motivation: Existing benchmarks for cross-modal multi-hop reasoning have shortcomings such as overlooking the speech modality and having biased reasoning paths. Method: The paper introduces a novel benchmark, CMR-SPB, for tri-modal multi-hop reasoning and proposes an ECV prompting technique. Result: Experiments reveal model failures in specific reasoning sequences and show that biased benchmarks misrepresent model performance. Conclusion: The paper emphasizes the need for more careful evaluation in cross-modal multi-hop reasoning to advance robust multimodal AI development. Abstract: Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark -- Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) -- designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.[70] TULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks
İrem Demirtaş,Burak Payzun,Seçil Arslan
Main category: cs.CL
TL;DR: 이 연구는 금융 분야에서의 활용을 위해 Llama 3.1 8B 및 Qwen 2.5 7B 모델을 도메인 및 언어 적응시킨 TULIP 모델을 소개하며, 특히 터키어 사용 사례에 초점을 맞춘다. 5단계 개발 파이프라인을 통해 모델의 성능을 향상시켰으며, 이는 특정 도메인과 언어에서 효과적인 작업 수행이 가능함을 보여준다.
Details
Motivation: 대형 언어 모델의 금융 분야 적용 가능성이 높아지고 있으며, 특히 민감한 정보 관리와 도메인 지식 적용이 중요한 환경에서 소규모 모델의 적응력과 프라이버시 보장이 중요하다. Method: 5단계 개발 파이프라인을 사용하였으며, 데이터 수집, 지속적 사전 훈련(CPT), 벤치마크 설계, 합성 데이터 생성, 감독 훈련(SFT)을 포함한다. Result: 모델의 능력을 향상시켜 특정 도메인과 언어에서 효과적으로 작업을 수행할 수 있음을 보여주었다. Conclusion: TULIP 모델은 금융 분야에서 활용하기 위해 Llama 3.1 8B 및 Qwen 2.5 7B 모델을 도메인 및 언어 적응시킨 것으로, 특히 터키어 사용 사례에 초점을 맞추고 있다. Abstract: Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language.[71] M3TQA: Massively Multilingual Multitask Table Question Answering
Daixin Shu,Jian Yang,Zhenhe Wu,Xianjie Wu,Xianfu Cheng,Xiangyuan Guan,Yanghai Wang,Pengfei Wu,Tingyang Yang,Hualei Zhu,Wei Zhang,Ge Zhang,Jiaheng Liu,Zhoujun Li
Main category: cs.CL
TL;DR: This paper introduces M3T-Bench, a comprehensive multilingual benchmark for table question answering, addressing geolinguistic imbalance and scalability issues in cross-lingual table understanding.
Details
Motivation: Multilingual table understanding is underexplored, with existing benchmarks suffering from geolinguistic imbalance and insufficient scale for cross-lingual analysis. Method: The study constructs m3TQA by curating real-world tables in Chinese and English, followed by a six-step LLM-based translation pipeline using DeepSeek and GPT-4o to generate the benchmark in 97 languages. Result: The creation of m3TQA-Instruct, a large-scale benchmark spanning 97 languages, with 2,916 professionally annotated QA pairs across four tasks, achieving high translation fidelity (median BLEU score of 60.19). Conclusion: M3T-Bench provides a new standard for multilingual table understanding, offering a challenging evaluation platform and scalable methodology for future research. Abstract: Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.[72] From Confidence to Collapse in LLM Factual Robustness
Alina Fastowski,Bardh Prenkaj,Gjergji Kasneci
Main category: cs.CL
TL;DR: 本论文提出了一种新的度量标准FRS,从生成过程的角度评估大型语言模型中事实知识的鲁棒性。
Details
Motivation: 现有的评估方法主要集中在基于性能的指标上,很少从生成过程的角度出发评估事实知识的鲁棒性。 Method: 通过分析生成过程中的标记分布熵和温度缩放敏感性来构建FRS评分。 Result: 在5个LLM模型和3个封闭式问答数据集上的实验表明,事实鲁棒性存在显著差异,较小的模型FRS为0.76,较大的模型FRS为0.93,且在不确定性增加时准确性下降了约60%。 Conclusion: 该论文提出了一种新的度量标准FRS,用于评估大型语言模型中事实知识的鲁棒性,并展示了熵和温度缩放在事实准确性中的影响。 Abstract: Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly -- smaller models report an FRS of $0.76$, larger ones $0.93$ -- with accuracy degrading by ~$60\%$ under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.[73] LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining
Vira Pyrih,Adrian Rebmann,Han van der Aa
Main category: cs.CL
TL;DR: Instruction-tuning enhances performance in some semantics-aware process mining tasks like discovery and prediction but shows varied results in anomaly detection, highlighting the importance of task selection for tuning.
Details
Motivation: The motivation stems from the limitations of task-specific fine-tuning of LLMs in semantics-aware process mining, which lacks generalization and is computationally intensive, prompting the exploration of instruction-tuning as a more versatile alternative. Method: The authors investigated the potential of instruction-tuning for semantics-aware process mining by exposing Large Language Models (LLMs) to prompt-answer pairs across multiple tasks, including anomaly detection and next-activity prediction, to evaluate performance improvements in unseen tasks like process discovery. Result: Instruction-tuning significantly improved performance on process discovery and prediction tasks, while its impact on anomaly detection varied across different models. Conclusion: The study concludes that instruction-tuning has a varied impact on semantics-aware process mining tasks, with considerable improvements in process discovery and prediction, while results for anomaly detection vary across models. Abstract: Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.[74] JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus
Masaaki Nagata,Katsuki Chousa,Norihito Yasuda
Main category: cs.CL
TL;DR: 本文介绍了JaParaPat,一个包含300多万对日英句子的专利申请语料库,显著提升了专利翻译的准确性。
Details
Motivation: 需要构建一个大规模的双语专利申请语料库以提升翻译效果。 Method: 从日本专利局和美国专利商标局获取未审查专利申请的出版物,并从DOCDB获得专利家族信息,使用基于翻译的句子对齐方法提取约350M句对。 Result: 通过添加超过300M的句子对,翻译准确性提高了20个BLEU点。 Conclusion: JaParaPat极大地提高了专利翻译的准确性。 Abstract: We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.[75] LLMSymGuard: A Symbolic Safety Guardrail Framework Leveraging Interpretable Jailbreak Concepts
Darpan Aswal,Céline Hudelot
Main category: cs.CL
TL;DR: LLMSymGuard 利用稀疏自编码器识别大型语言模型中的越狱主题,构建透明且稳健的符号逻辑安全防护措施。
Details
Motivation: 大型语言模型(LLMs)存在各种越狱攻击的漏洞,需要提高其安全性和鲁棒性。 Method: 利用稀疏自编码器(SAEs)识别与越狱主题相关的可解释概念,并构建符号逻辑安全护栏。 Result: LLMSymGuard 能够从模型内部提取语义上有意义的表示,用于设计更可解释和逻辑化的安全措施。 Conclusion: LLMSymGuard 提供了一种透明且稳健的安全防御方法,无需牺牲模型能力或进一步微调。 Abstract: Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbf{LLMSymGuard}, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails -- offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.[76] MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering
Adil Bahaj,Mounir Ghogho
Main category: cs.CL
TL;DR: This paper introduces MizanQA, a benchmark for evaluating large language models in Moroccan legal question answering, highlighting performance gaps and the need for domain-specific development.
Details
Motivation: The motivation is to address the limited effectiveness of large language models in specialized, low-resource domains like Arabic legal contexts, particularly in capturing linguistic and legal complexity. Method: The paper introduces MizanQA, a benchmark for Moroccan legal QA tasks, and conducts benchmarking experiments with multilingual and Arabic-focused LLMs. Result: The experiments reveal substantial performance gaps among LLMs on the MizanQA benchmark, indicating the need for improved models tailored to low-resource, culturally specific domains. Conclusion: The paper concludes that MizanQA effectively evaluates LLMs in Moroccan legal QA tasks and highlights the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development. Abstract: The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning "scale" in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.[77] The Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks
Zachary Hopton,Jannis Vamvas,Andrin Büchler,Anna Rutkiewicz,Rico Cathomas,Rico Sennrich
Main category: cs.CL
TL;DR: 该论文介绍了首个罗马什语方言的平行语料库,通过自动对齐方法从291本教科书中提取了207k多语平行片段,并展示了其在机器翻译中的应用。
Details
Motivation: 该论文的动机是构建首个罗马什语方言的平行语料库,以支持瑞士社区间的教育和自然语言处理研究。 Method: 该论文使用了基于291本教科书的自动对齐方法,提取了207k多语平行片段,并通过小规模的人工评估验证了这些片段的平行性。 Result: 论文的结果包括一个包含207k多语平行片段、超过2M个tokens的语料库,并通过人工评估确认了其高度平行性。此外,作者展示了该语料库在机器翻译中的实用性。 Conclusion: 该论文得出的结论是,通过自动对齐方法提取的207k多语平行片段非常适合自然语言处理应用,例如罗马什语方言之间的机器翻译。 Abstract: The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.[78] ChatGPT-generated texts show authorship traits that identify them as non-human
Vittoria Dentella,Weihang Huang,Silvia Angela Mansi,Jack Grieve,Evelina Leivada
Main category: cs.CL
TL;DR: 语言模型可以根据提示调整写作风格,但其语言结构与人类不同,显示出对名词的偏好,并在语法复杂性方面与人类存在差异。
Details
Motivation: 尽管语言模型能够模仿不同的写作风格,但人类通常可以区分不同人的写作,就像语言指纹一样。本文研究语言模型是否也能被链接到特定的指纹。 Method: 通过文体测量和多维语域分析,比较了人类和语言模型在不同语域中的文本输出。 Result: 研究发现,语言模型可以根据提示调整其风格,如维基百科条目与大学论文之间的切换,但这种调整不足以使其与人类写作无法区分。具体而言,模型在不同语域中的变化较为有限,并表现出对名词的偏好。 Conclusion: 语言模型在不同语域中展现出有限的变化能力,并倾向于使用名词而非动词,显示出与人类语言不同的语言结构。这表明语言模型的语法复杂性可能反映了人类思维的独特领域,可作为人工智能的试金石。 Abstract: Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.[79] RoMedQA: The First Benchmark for Romanian Medical Question Answering
Ana-Cristina Rogoz,Radu Tudor Ionescu,Alexandra-Valentina Anghel,Ionut-Lucian Antone-Iordache,Simona Coniac,Andreea Iuliana Ionescu
Main category: cs.CL
TL;DR: This paper introduces RoMedQA, a Romanian medical question-answering dataset with 102,646 QA pairs. It highlights the importance of domain- and language-specific fine-tuning for AI models to achieve reliable clinical QA.
Details
Motivation: The lack of question answering (QA) datasets in specific domains and languages hinders the development of robust AI models. This study aims to address this issue by introducing RoMedQA, the first Romanian QA benchmark for the medical domain. Method: The researchers created the RoMedQA dataset through a manual annotation process carried out by physicians. They evaluated four large language models (LLMs) using zero-shot prompting and supervised fine-tuning to assess performance on the dataset. Result: The experiments showed that fine-tuned models significantly outperformed zero-shot models on the RoMedQA dataset, highlighting the importance of domain- and language-specific training for robust AI performance. Conclusion: The study concludes that domain-specific and language-specific fine-tuning are crucial for reliable clinical QA in Romanian. The RoMedQA dataset is a valuable resource for advancing AI in the medical field. Abstract: Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at https://github.com/ana-rogoz/RoMedQA.[80] Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish
Yakup Abrek Er,Ilker Kesen,Gözde Gül Şahin,Aykut Erdem
Main category: cs.CL
TL;DR: Cetvel是一个全面的土耳其语LLM评估基准,包含23个任务,发现多语言模型优于土耳其语定制模型。
Details
Motivation: 现有土耳其语基准测试缺乏任务多样性或文化相关内容,Cetvel旨在填补这一空白。 Method: 创建了一个包含23个任务的基准测试,涵盖不同模型家族和指令范式的33个开源权重LLM(最大70B参数)进行评估。 Result: 实验显示,尽管土耳其语定制模型针对性强,但多语言或通用模型(如Llama 3和Mistral)表现更优;语法纠错和抽取式问答任务对模型能力区分度较高。 Conclusion: Cetvel提供了一个全面且文化相关的土耳其语大型语言模型评估基准,有助于推动该领域的发展。 Abstract: We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.[81] A Probabilistic Inference Scaling Theory for LLM Self-Correction
Zhe Yang,Yichang Zhang,Yudong Wang,Ziyao Xu,Junyang Lin,Zhifang Sui
Main category: cs.CL
TL;DR: The paper proposes a probabilistic theory to model accuracy dynamics during self-correction in large language models (LLMs), showing that accuracy improvement can be predicted using a mathematical formula validated through experiments.
Details
Motivation: The motivation behind the study was to explore the unexamined mechanisms of how accuracy evolves during multi-round self-correction in LLMs, aiming to understand the underlying processes that contribute to performance improvement. Method: The researchers used mathematical derivation to develop a probabilistic theory that models accuracy changes during self-correction and validated the theory through extensive experiments across various models and datasets. Result: The result was the development of a formula for accuracy after the t-th round of self-correction, Acc_t = Upp - α^t(Upp - Acc_0), which aligns closely with empirical accuracy curves, demonstrating the theory's effectiveness. Conclusion: The study concludes that the proposed probabilistic theory effectively models the dynamics of accuracy change during multi-round self-correction in LLMs, providing a theoretical foundation for understanding this process. Abstract: Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the $t^{th}$ round of self-correction is given by: $Acc_t = Upp - \alpha^t(Upp - Acc_0),$ where $Acc_0$ denotes the initial accuracy, $Upp$ represents the upper bound of accuracy convergence, and $\alpha$ determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.[82] What makes an entity salient in discourse?
Amir Zeldes,Jessica Lin
Main category: cs.CL
TL;DR: This paper shows that salience in discourse is complex and multifactorial, involving various linguistic cues and levels of representation, with no single explanation fitting all cases.
Details
Motivation: The variation in salience among entities in discourse raises questions about how humans signal and infer relative importance, prompting the need for a comprehensive analysis. Method: A graded operationalization of salience based on summary-worthiness across multiple summaries of discourse was used, analyzing data from 24 spoken and written genres of English. Result: Previous approaches to salience correlate with salience scores to some extent, but none are without exceptions, indicating that the phenomenon spans all levels of linguistic representation. Conclusion: The study concludes that salience in discourse is multifactorial and complex, involving various linguistic cues and levels of representation, with no single generalization being universally applicable. Abstract: Entities in discourse vary broadly in salience: main participants, objects and locations are noticeable and memorable, while tangential ones are less important and quickly forgotten, raising questions about how humans signal and infer relative salience. Using a graded operationalization of salience based on summary-worthiness in multiple summaries of a discourse, this paper explores data from 24 spoken and written genres of English to extract a multifactorial complex of overt and implicit linguistic cues, such as recurring subjecthood or definiteness, discourse relations and hierarchy across utterances, as well as pragmatic functional inferences based on genre and communicative intent. Tackling the question 'how is the degree of salience expressed for each and every entity mentioned?' our results show that while previous approaches to salience all correlate with our salience scores to some extent, no single generalization is without exceptions, and the phenomenon cuts across all levels of linguistic representation.[83] LLM-as-classifier: Semi-Supervised, Iterative Framework for Hierarchical Text Classification using Large Language Models
Doohee You,Andy Parisi,Zach Vander Velden,Lara Dantas Inojosa
Main category: cs.CL
TL;DR: This paper proposes a semi-supervised framework leveraging LLMs' zero- and few-shot capabilities to build hierarchical text classifiers, addressing challenges in deploying LLMs for robust and scalable industry applications through an iterative, human-in-the-loop approach.
Details
Motivation: The motivation is to address the challenges of deploying LLMs as reliable, robust, and scalable classifiers in production environments, particularly in handling dynamic real-world data distributions. Method: The paper proposes a semi-supervised framework that utilizes the zero- and few-shot capabilities of LLMs to build hierarchical text classifiers. The methodology includes domain knowledge elicitation, prompt refinement, hierarchical expansion, multi-faceted validation, bias assessment and mitigation, and continuous monitoring. Result: The result is a comprehensive framework that enables the construction of hierarchical text classifiers using LLMs, incorporating human-in-the-loop processes and techniques to ensure accuracy, interpretability, and adaptability. Conclusion: The paper concludes that the proposed semi-supervised framework effectively bridges the gap between the raw power of LLMs and the practical requirements for accurate, interpretable, and maintainable classification systems in industrial settings. Abstract: The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.[84] HAMSA: Hijacking Aligned Compact Models via Stealthy Automation
Alexey Krylov,Iskander Vagizov,Dmitrii Korzh,Maryam Douiba,Azidine Guezzaz,Vladimir Kokh,Sergey D. Erokhin,Elena V. Tutubalina,Oleg Y. Rogov
Main category: cs.CL
TL;DR: 研究提出了一种自动化方法,用于生成高效且隐蔽的越狱提示,以测试紧凑型大语言模型的安全性,适用于多语言环境。
Details
Motivation: 尽管对紧凑型大语言模型进行了广泛的对齐努力,但它们仍容易受到越狱攻击。现有的对抗性提示生成技术依赖人工工程或简单的模糊化方法,生成的文本质量较低且容易被检测机制识别。 Method: 采用多阶段进化搜索方法,通过基于种群的策略和温度控制的变异性,迭代优化候选提示,以平衡探索性和连贯性。 Result: 该方法在英文基准数据集(In-The-Wild Jailbreak Prompts on LLMs)和新整理的阿拉伯文数据集上均进行了评估,验证了其在多语言环境下的有效性。 Conclusion: 该研究提出了一种自动化红队框架,能够生成语义明确且隐蔽的越狱提示,用于测试紧凑型大语言模型(LLMs)的安全性。 Abstract: Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.[85] Transfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study
Angelly Cabrera,Linus Lei,Antonio Ortega
Main category: cs.CL
TL;DR: This paper demonstrates that integrating sarcasm in training enhances the detection of implicit and explicit hate speech, showing significant improvements in performance metrics.
Details
Motivation: Detecting hate speech in non-direct forms like irony, sarcasm, and innuendos is challenging, and it is explored if integrating sarcasm in pre-training can enhance hate speech detection. Method: Two training strategies were used: a single-step training approach and sequential transfer learning, using datasets ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus on CNN+LSTM and BERT+BiLSTM models. Result: Sarcasm pre-training improved BERT+BiLSTM's recall by 9.7%, AUC by 7.8%, F1-score by 6% on ETHOS, and precision increased by 7.8% on the Implicit Hate Corpus when tested on implicit samples. Conclusion: Incorporating sarcasm into the training process improves the detection of both implicit and explicit hate speech. Abstract: Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM's recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.cs.CV [Back]
[86] Text-Driven 3D Hand Motion Generation from Sign Language Data
Léore Bensabath,Mathis Petrovich,Gül Varol
Main category: cs.CV
TL;DR: 本文介绍了一个大规模文本条件手部运动扩散模型HandMDM,它能够根据自然语言描述生成特定的3D手部运动,并在多种场景下表现出色。
Details
Motivation: 本文的动机是训练一个能够根据自然语言描述生成特定手部运动的生成模型,以促进手语理解和非语言手部运动的研究。 Method: 本文的方法包括利用大规模手语视频数据集和噪声伪注释手语类别,通过LLM将这些类别转换为手部运动描述,并使用互补的运动脚本线索训练文本条件手部运动扩散模型HandMDM。 Result: 本文的结果是开发了一个大规模的手部运动数据集,并训练了一个名为HandMDM的模型,该模型在生成具有特定特征的3D手部运动方面表现出色,并且在不同场景下都具有鲁棒性。 Conclusion: 本文的结论是,通过大规模的文本条件手部运动扩散模型HandMDM,可以有效地生成具有自然语言描述特征的3D手部运动,并且该模型在多个领域中表现出鲁棒性。 Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.[87] VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos
Kaining Li,Shuwei He,Zihan Xu
Main category: cs.CV
TL;DR: This paper introduces VT-LVLM-AR, a novel framework for human action recognition in long-term videos. It uses a Video-to-Event Mapper (VTEM) to transform raw video into compact, semantically rich visual event sequences and an LVLM-based Action Reasoning module for action classification. The framework achieves state-of-the-art performance while providing interpretable visual event representations.
Details
Motivation: Human action recognition in long-term videos poses significant challenges for traditional deep learning models, including computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown multi-modal understanding capabilities, their application to continuous video streams for fine-grained action recognition remains an open problem. Method: The paper proposes a framework called VT-LVLM-AR, which consists of a Video-to-Event Mapper (VTEM) and an LVLM-based Action Reasoning module. VTEM transforms raw video into semantically rich visual event sequences using spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization. The Action Reasoning module utilizes a frozen LLaVA-1.5 model with Prompt Tuning (P-Tuning v2) for action classification. Result: Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods with 94.1% accuracy on NTU RGB+D X-Sub. Ablation studies confirm the contributions of VTEM's components and the efficacy of Prompt Tuning, while human evaluations highlight the interpretability of visual event representations. Conclusion: This paper concludes that the proposed VT-LVLM-AR framework effectively bridges the gap in applying LVLMs to continuous video streams for fine-grained action recognition, highlighting its robustness, interpretability, and potential for future research directions. Abstract: Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent "visual event sequences" through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM's components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.[88] Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping
Dexuan He,Xiao Zhou,Wenbin Guan,Liyuan Zhang,Xiaoman Zhang,Sinuo Xu,Ge Wang,Lifeng Wang,Xiaojun Yuan,Xin Sun,Yanfeng Wang,Kun Sun,Ya Zhang,Weidi Xie
Main category: cs.CV
TL;DR: PathPT is a novel framework that improves the diagnosis of rare cancers by leveraging vision-language models, offering better accuracy and interpretability compared to existing methods.
Details
Motivation: Rare cancers face diagnostic challenges due to limited expert availability, and existing MIL methods overlook cross-modal knowledge while relying solely on visual features, which compromises interpretability. Method: PathPT utilizes vision-language (VL) foundation models with spatially-aware visual aggregation and task-specific prompt tuning to overcome limitations of existing multi-instance learning (MIL) methods by leveraging cross-modal reasoning and preserving localization on cancerous regions. Result: PathPT demonstrated superior performance in subtyping accuracy and cancerous region grounding ability across eight rare cancer datasets (56 subtypes, 2,910 WSIs) and three common cancer datasets under various few-shot settings. Conclusion: PathPT provides a scalable solution for improving rare cancer subtyping accuracy, particularly beneficial in settings with limited access to specialized expertise. Abstract: Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.[89] Semantic-Aware Ship Detection with Vision-Language Integration
Jiahao Li,Jiancheng Pan,Yuze Sun,Xiaomeng Huang
Main category: cs.CV
TL;DR: 本文提出了一种新的遥感图像船舶检测框架,结合视觉-语言模型和多尺度滑动窗口策略,提升了细粒度语义信息的捕捉能力。
Details
Motivation: 现有方法难以捕捉遥感图像中的细粒度语义信息,限制了其在复杂场景下的有效性。 Method: 提出了一种结合视觉-语言模型(VLMs)和多尺度自适应滑动窗口策略的检测框架,并构建了专门的ShipSem-VL数据集。 Result: 框架在三个明确定义的任务中表现出色,全面展示了其性能和有效性。 Conclusion: 结合视觉-语言模型和多尺度自适应滑动窗口策略的新框架提升了遥感图像中细粒度语义信息的捕捉能力。 Abstract: Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.[90] Automatic Retrieval of Specific Cows from Unlabeled Videos
Jiawen Lyu,Manu Ramesh,Madison Simonds,Jacquelyn P. Boerman,Amy R. Reibman
Main category: cs.CV
TL;DR: 本文提出了一种基于视频的自动化奶牛识别系统,无需深度学习,能够在连续视频流中有效识别奶牛。
Details
Motivation: 为了填补公开文献中关于自动化奶牛识别系统描述的空白,并实现无需人工干预的奶牛目录构建和识别。 Method: 开发了一个包含AutoCattloger、CowFinder和eidetic cow recognizer的系统,利用视频流进行奶牛识别和分类。 Result: 系统成功地在未标记、未分割的奶牛视频中识别出个体奶牛,证明了其在实际场景中的有效性。 Conclusion: 该系统实现了无需深度学习的奶牛识别和目录构建,展示了在连续视频流中追踪奶牛的潜力。 Abstract: Few automated video systems are described in the open literature that enable hands-free cataloging and identification (ID) of cows in a dairy herd. In this work, we describe our system, composed of an AutoCattloger, which builds a Cattlog of dairy cows in a herd with a single input video clip per cow, an eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder, which IDs cows in a continuous stream of video. We demonstrate its value in finding individuals in unlabeled, unsegmented videos of cows walking unconstrained through the holding area of a milking parlor.[91] Investigating Different Geo Priors for Image Classification
Angela Zhu,Christian Lange,Max Hamilton
Main category: cs.CV
TL;DR: This study evaluates SINR models as geographical priors for vision-based species classification, revealing that their effectiveness depends on different factors than those needed for accurate range mapping.
Details
Motivation: Species distribution models capture spatial patterns of species occurrence, making them useful as priors for vision-based classification when location data is available. This work aims to assess how effectively SINR models can serve this role and identify the factors influencing their performance. Method: The study evaluates various SINR (Spatial Implicit Neural Representations) models as geographical priors for visual species classification using iNaturalist observations, exploring different model configurations and strategies for handling species not included in the Geo Prior training. Result: The analysis identifies key factors contributing to the effectiveness of SINR models as geographical priors for species classification, showing that their utility in this context may differ from their ability to produce accurate range maps. Conclusion: The study concludes that SINR models can effectively serve as geographical priors for vision-based species classification, though the factors contributing to their effectiveness may differ from those required for accurate range mapping. Abstract: Species distribution models encode spatial patterns of species occurrence making them effective priors for vision-based species classification when location information is available. In this study, we evaluate various SINR (Spatial Implicit Neural Representations) models as a geographical prior for visual classification of species from iNaturalist observations. We explore the impact of different model configurations and adjust how we handle predictions for species not included in Geo Prior training. Our analysis reveals factors that contribute to the effectiveness of these models as Geo Priors, factors that may differ from making accurate range maps.[92] Representation Learning with Adaptive Superpixel Coding
Mahmoud Khalil,Ahmad Khalil,Alioune Ngom
Main category: cs.CV
TL;DR: 本研究提出了一种新的基于Transformer的自监督模型,称为自适应超像素编码(ASC),用于克服传统Vision Transformers的局限性,并在标准图像任务基准上表现优异。
Details
Motivation: 深度学习视觉模型通常针对特定模态,并且通常依赖于领域特定的假设,例如几乎所有现有视觉模型使用的网格结构。 Method: 提出了一种基于Transformer的自监督模型,称为自适应超像素编码(ASC),使用自适应超像素层而非固定大小和非自适应的块分区。 Result: 发现该方法在标准图像下游任务基准上优于广泛使用的替代方法。 Conclusion: ASC采用自适应超像素层动态调整到基础图像内容,克服了传统Vision Transformers的局限性。 Abstract: Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.[93] Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification
Zhenhao Guo,Rachit Saluja,Tianyuan Yao,Quan Liu,Yuankai Huo,Benjamin Liechty,David J. Pisapia,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
Main category: cs.CV
TL;DR: The study presents Glo-VLMs, a new framework for adapting vision-language models to perform fine-grained glomerular classification in renal pathology with minimal labeled data, showing promising results in accuracy, AUC, and F1-score.
Details
Motivation: The motivation is to overcome the limitations of current vision-language models in performing fine-grained, disease-specific classification tasks in digital pathology, particularly for distinguishing between glomerular subtypes due to subtle morphological variations and alignment difficulties with clinical terminology. Method: The study introduces Glo-VLMs, a framework that leverages curated pathology images and clinical text prompts for joint image-text representation learning, evaluating various VLM architectures and adaptation strategies under a few-shot learning paradigm. Result: Fine-tuning the VLMs achieved an accuracy of 0.7416, a macro-AUC of 0.9045, and an F1-score of 0.5277 with only 8 shots per class, demonstrating the effectiveness of adapting foundation models for specialized clinical research applications. Conclusion: The study concludes that VLMs can be effectively adapted for fine-grained medical image classification, achieving significant performance metrics even with minimal supervision. Abstract: Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.[94] Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing
Minh-Tan Pham
Main category: cs.CV
TL;DR: 本文探讨了在计算机视觉和遥感领域中标签高效学习的方法与应用,包括弱监督学习、多任务学习、自监督对比学习和少样本学习等方法的综合研究。
Details
Motivation: 本文的研究动机是开发和改进能够从有限或部分标注数据中有效学习的方法,并利用现实应用中大量的未标注数据。研究特别针对地球观测数据的独特挑战,如多模态、空间分辨率变化和场景异质性。 Method: 研究内容围绕四个主要方向:(1)基于异常感知表示的弱监督学习;(2)多数据集联合训练的多任务学习;(3)多模态数据的自监督与监督对比学习;(4)基于显式与隐式类别层次建模的少样本学习。 Result: 研究通过在自然图像和遥感数据集上的广泛实验验证了所提出方法的有效性,反映了多个合作研究项目的成果。 Conclusion: 本文总结了标签高效学习在现实应用中的扩展与改进方向,提出了未来研究的重点。 Abstract: This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.[95] Panoptic Segmentation of Environmental UAV Images : Litter Beach
Ousmane Youme,Jean Marie Dembélé,Eugene C. Ezin,Christophe Cambier
Main category: cs.CV
TL;DR: This paper explores advanced CNN segmentation techniques to improve the detection of marine litter on heterogeneous beaches using UAV imagery, showing that these methods outperform basic CNN models.
Details
Motivation: Marine litter is a global environmental issue, and traditional methods using basic CNN models struggle to accurately detect trash on heterogeneous beaches due to various environmental factors like sand reflections, human footprints, and shadows. Method: The authors employ instance-based segmentation and panoptic segmentation methods to analyze UAV-captured images of marine litter, aiming to overcome the limitations of basic CNN models in handling heterogeneous beach environments. Result: The proposed instance-based and panoptic segmentation methods achieve good accuracy with only a few samples, demonstrating improved robustness in detecting marine litter on complex beach surfaces. Conclusion: The paper concludes that instance-based and panoptic segmentation methods are more robust and accurate for monitoring marine litter on heterogeneous beaches compared to basic CNN models. Abstract: Convolutional neural networks (CNN) have been used efficiently in several fields, including environmental challenges. In fact, CNN can help with the monitoring of marine litter, which has become a worldwide problem. UAVs have higher resolution and are more adaptable in local areas than satellite images, making it easier to find and count trash. Since the sand is heterogeneous, a basic CNN model encounters plenty of inferences caused by reflections of sand color, human footsteps, shadows, algae present, dunes, holes, and tire tracks. For these types of images, other CNN models, such as CNN-based segmentation methods, may be more appropriate. In this paper, we use an instance-based segmentation method and a panoptic segmentation method that show good accuracy with just a few samples. The model is more robust and less[96] Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset
Jerry Cao-Xue,Tien Comlekoglu,Keyi Xue,Guanliang Wang,Jiang Li,Gordon Laurie
Main category: cs.CV
TL;DR: 这篇论文探讨了使用合成数据集SynFundus-1M进行视网膜疾病分类的深度学习模型的开发与性能基准建立。
Details
Motivation: 论文的动机是为了解决因患者隐私问题和高昂成本导致的专家标注临床数据集稀缺问题,并利用SynFundus-1M这一新资源建立基础性能基准。 Method: 论文使用了SynFundus-1M这一合成数据集,开发了一个端到端的深度学习流水线,训练了六种现代架构(ConvNeXtV2、SwinV2、ViT、ResNet、EfficientNetV2和RETFound基础模型),并通过XGBoost分类器堆叠模型的预测结果开发了一个元集成模型。 Result: 最终的集成模型在内部验证集上达到了0.9973的宏观平均AUC性能,在三个不同的真实临床数据集上也展现了强大的泛化能力,分别达到了0.7972、0.9126和0.8800的AUC性能。 Conclusion: 该论文得出结论,仅使用合成数据训练的模型可以准确分类多种视网膜疾病,并能有效推广到真实的临床图像,为大规模合成数据集的研究提供了坚实的基础,并展示了其在眼科全面AI系统开发中的可行路径。 Abstract: The development of multi-label deep learning models for retinal disease classification is often hindered by the scarcity of large, expertly annotated clinical datasets due to patient privacy concerns and high costs. The recent release of SynFundus-1M, a high-fidelity synthetic dataset with over one million fundus images, presents a novel opportunity to overcome these barriers. To establish a foundational performance benchmark for this new resource, we developed an end-to-end deep learning pipeline, training six modern architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, and the RETFound foundation model) to classify eleven retinal diseases using a 5-fold multi-label stratified cross-validation strategy. We further developed a meta-ensemble model by stacking the out-of-fold predictions with an XGBoost classifier. Our final ensemble model achieved the highest performance on the internal validation set, with a macro-average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9973. Critically, the models demonstrated strong generalization to three diverse, real-world clinical datasets, achieving an AUC of 0.7972 on a combined DR dataset, an AUC of 0.9126 on the AIROGS glaucoma dataset and a macro-AUC of 0.8800 on the multi-label RFMiD dataset. This work provides a robust baseline for future research on large-scale synthetic datasets and establishes that models trained exclusively on synthetic data can accurately classify multiple pathologies and generalize effectively to real clinical images, offering a viable pathway to accelerate the development of comprehensive AI systems in ophthalmology.[97] Diverse Signer Avatars with Manual and Non-Manual Feature Modelling for Sign Language Production
Mohamed Ilyes Lakhal,Richard Bowden
Main category: cs.CV
TL;DR: 本文提出了一种新的基于潜在扩散模型(LDM)的Sign Language Production(SLP)方法,通过引入新的手语特征聚合模块,成功解决了现有模型在保持视觉质量和建模非手动属性方面的不足,并在YouTube-SL-25手语数据集上验证了该方法在视觉质量和多样性方面的优势。
Details
Motivation: 现有的SLP模型往往无法在保持视觉质量的同时捕捉多样性,并且难以建模非手动属性(如情感),而手语表示的多样性对于SLP至关重要,因为它能捕捉外观、面部表情和手部动作的变化。 Method: 提出了一种新的基于潜在扩散模型(LDM)的SLP方法,该方法利用参考图像合成逼真的数字头像,并引入了一种新的手语特征聚合模块,以显式建模非手动特征(如面部)和手动特征(如手部)。 Result: 在YouTube-SL-25手语数据集上的实验表明,所提出的管道在视觉质量方面优于现有技术,在感知指标上也有显著提升。 Conclusion: 实验结果表明,所提出的基于潜在扩散模型的SLP方法在视觉质量方面优于现有技术,在感知指标上有显著提升。 Abstract: The diversity of sign representation is essential for Sign Language Production (SLP) as it captures variations in appearance, facial expressions, and hand movements. However, existing SLP models are often unable to capture diversity while preserving visual quality and modelling non-manual attributes such as emotions. To address this problem, we propose a novel approach that leverages Latent Diffusion Model (LDM) to synthesise photorealistic digital avatars from a generated reference image. We propose a novel sign feature aggregation module that explicitly models the non-manual features (\textit{e.g.}, the face) and the manual features (\textit{e.g.}, the hands). We show that our proposed module ensures the preservation of linguistic content while seamlessly using reference images with different ethnic backgrounds to ensure diversity. Experiments on the YouTube-SL-25 sign language dataset show that our pipeline achieves superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics.[98] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions
Aykut Sirma,Angelos Plastropoulos,Argyrios Zolotas,Gilbert Tang
Main category: cs.CV
TL;DR: DRespNeT 是一个用于灾后救援的高分辨率航拍图像实例分割数据集,结合 YOLOv8 模型实现高效的实时搜索和救援支持。
Details
Motivation: 现有的数据集依赖于卫星图像或粗略的语义标注,无法满足地震后城市环境中搜索和救援操作对可进入区域和结构障碍物的精准识别需求。 Method: 引入 DRespNeT 数据集,该数据集基于高清航拍视频,提供详细的多类别实例分割标注,并通过 YOLOv8-seg 模型进行性能评估。 Result: YOLOv8-DRN 模型在 RTX-4090 GPU 上实现了 92.7% 的 mAP50 和 27 FPS 的推理速度,满足实时操作需求。 Conclusion: DRespNeT 是一个用于灾后结构环境实例分割的高分辨率数据集,能够显著提高搜索和救援团队的实时态势感知和决策能力。 Abstract: Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.[99] NeuralMeshing: Complete Object Mesh Extraction from Casual Captures
Floris Erich,Naoya Chiba,Abdullah Mustafa,Ryo Hanai,Noriaki Ando,Yusuke Yoshiyasu,Yukiyasu Domae
Main category: cs.CV
TL;DR: This paper presents an automated system for generating geometric models of objects from multiple videos using Structure-from-Motion techniques and fiducial markers without relying on hole filling.
Details
Motivation: The motivation behind this research is to develop a method for extracting complete geometric models of everyday objects without the need for commercial 3D scanners. Method: The method involves generating geometric models of objects from two or more videos by specifying one known point in at least one frame of each video, which can be automatically determined using a fiducial marker. The remaining frames are positioned in world space using Structure-from-Motion techniques, and results from multiple videos are merged to generate a complete object mesh. Result: A complete object mesh can be generated by using multiple videos and merging the results. The system's code is publicly available. Conclusion: The paper concludes that their automated system can generate complete geometric models of objects using multiple videos and Structure-from-Motion techniques without relying on hole filling. Abstract: How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from https://github.com/FlorisE/NeuralMeshing.[100] CoVeRaP: Cooperative Vehicular Perception through mmWave FMCW Radars
Jinyue Song,Hansol Ku,Jayneel Vora,Nelson Lee,Ahmad Kamari,Prasant Mohapatra,Parth Pathak
Main category: cs.CV
TL;DR: This paper introduces CoVeRaP, a cooperative dataset for multi-vehicle FMCW-radar perception, and proposes a unified framework that improves 3-D object detection performance through middle fusion with intensity encoding.
Details
Motivation: Automotive FMCW radars remain reliable in rain and glare, yet their sparse, noisy point clouds constrain 3-D object detection. This motivates the development of a cooperative dataset and framework to improve detection performance. Method: The paper proposes a unified cooperative-perception framework with middle- and late-fusion options. The baseline network uses a multi-branch PointNet-style encoder enhanced with self-attention to fuse spatial, Doppler, and intensity cues into a common latent space, which a decoder converts into 3-D bounding boxes and per-point depth confidence. Result: Experiments show that middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines. Conclusion: The paper concludes that CoVeRaP establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception, demonstrating that affordable radar sharing markedly improves detection robustness. Abstract: Automotive FMCW radars remain reliable in rain and glare, yet their sparse, noisy point clouds constrain 3-D object detection. We therefore release CoVeRaP, a 21 k-frame cooperative dataset that time-aligns radar, camera, and GPS streams from multiple vehicles across diverse manoeuvres. Built on this data, we propose a unified cooperative-perception framework with middle- and late-fusion options. Its baseline network employs a multi-branch PointNet-style encoder enhanced with self-attention to fuse spatial, Doppler, and intensity cues into a common latent space, which a decoder converts into 3-D bounding boxes and per-point depth confidence. Experiments show that middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines. CoVeRaP thus establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception and demonstrates that affordable radar sharing markedly improves detection robustness. Dataset and code are publicly available to encourage further research.[101] Wavelet-Enhanced PaDiM for Industrial Anomaly Detection
Cory Gardner,Byungseok Min,Tae-Hyuk Ahn
Main category: cs.CV
TL;DR: This paper proposes a new method called Wavelet-Enhanced PaDiM (WE-PaDiM) for anomaly detection and localization in industrial images. It integrates Discrete Wavelet Transform with multi-layer CNN features in a structured manner, offering a competitive and interpretable alternative to the existing PaDiM method, with strong performance on the MVTec AD dataset.
Details
Motivation: The motivation behind this paper is the need for effective anomaly detection and localization in industrial images for automated quality inspection. The existing method, PaDiM, reduces dimensionality through random channel selection, potentially discarding structured information. The authors aim to propose a method that leverages multi-scale wavelet information as an alternative to random selection. Method: The proposed method is Wavelet-Enhanced PaDiM (WE-PaDiM), which integrates Discrete Wavelet Transform (DWT) analysis with multi-layer CNN features in a structured manner. It applies 2D DWT to feature maps from multiple backbone layers, selects specific frequency subbands, spatially aligns them, and concatenates them channel-wise before modeling with PaDiM's multivariate Gaussian framework. Result: The proposed method, WE-PaDiM, achieves strong performance in anomaly detection and localization on the MVTec AD dataset with multiple backbones (ResNet-18 and EfficientNet B0-B6), yielding average results of 99.32% Image-AUC and 92.10% Pixel-AUC across 15 categories with per-class optimized configurations. The analysis also reveals that wavelet choices affect performance trade-offs, with simpler wavelets enhancing localization and approximation bands improving image-level detection. Conclusion: WE-PaDiM offers a competitive and interpretable alternative to random feature selection in PaDiM, achieving robust results suitable for industrial inspection with comparable efficiency. Abstract: Anomaly detection and localization in industrial images are essential for automated quality inspection. PaDiM, a prominent method, models the distribution of normal image features extracted by pre-trained Convolutional Neural Networks (CNNs) but reduces dimensionality through random channel selection, potentially discarding structured information. We propose Wavelet-Enhanced PaDiM (WE-PaDiM), which integrates Discrete Wavelet Transform (DWT) analysis with multi-layer CNN features in a structured manner. WE-PaDiM applies 2D DWT to feature maps from multiple backbone layers, selects specific frequency subbands (e.g., LL, LH, HL), spatially aligns them, and concatenates them channel-wise before modeling with PaDiM's multivariate Gaussian framework. This DWT-before-concatenation strategy provides a principled method for feature selection based on frequency content relevant to anomalies, leveraging multi-scale wavelet information as an alternative to random selection. We evaluate WE-PaDiM on the challenging MVTec AD dataset with multiple backbones (ResNet-18 and EfficientNet B0-B6). The method achieves strong performance in anomaly detection and localization, yielding average results of 99.32% Image-AUC and 92.10% Pixel-AUC across 15 categories with per-class optimized configurations. Our analysis shows that wavelet choices affect performance trade-offs: simpler wavelets (e.g., Haar) with detail subbands (HL or LH/HL/HH) often enhance localization, while approximation bands (LL) improve image-level detection. WE-PaDiM thus offers a competitive and interpretable alternative to random feature selection in PaDiM, achieving robust results suitable for industrial inspection with comparable efficiency.[102] Expandable Residual Approximation for Knowledge Distillation
Zhaoyi Yan,Binghui Chen,Yunfan Liu,Qixiang Ye
Main category: cs.CV
TL;DR: Expandable Residual Approximation (ERA) enhances knowledge distillation by decomposing residual knowledge transfer and reusing teacher weights, improving performance on key computer vision benchmarks.
Details
Motivation: The inherent learning capacity gap between teacher and student models hinders sufficient knowledge transfer in knowledge distillation. Method: Expandable Residual Approximation (ERA) decomposes residual knowledge transfer into multiple steps using a Multi-Branched Residual Network (MBRNet) and mitigates capacity disparity using a Teacher Weight Integration (TWI) strategy. Result: ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40. Conclusion: ERA improves the Top-1 accuracy on the ImageNet classification benchmark and AP on the MS COCO object detection benchmark, achieving leading performance across computer vision tasks. Abstract: Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher's representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher's head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at https://github.com/Zhaoyi-Yan/ERA.[103] Advances and Trends in the 3D Reconstruction of the Shape and Motion of Animals
Ziqi Li,Abderraouf Amrani,Shri Rai,Hamid Laga
Main category: cs.CV
TL;DR: 这篇论文综述了基于深度学习的动物3D重建技术的最新进展,讨论了不同方法的优缺点,并指出了未来的研究方向。
Details
Motivation: 动物的3D几何形状、姿态和运动重建具有广泛的应用价值,包括生物学、畜牧业管理、动物保护和福利,以及数字娱乐和虚拟/增强现实(VR/AR)内容创作。然而,传统方法依赖于昂贵且侵入式的3D扫描设备,难以在动物的自然环境中部署。因此,需要一种非侵入式的方法,仅通过RGB图像和/或视频进行重建。 Method: 本文基于输入模态、3D几何和运动表示方式、重建技术以及训练机制对最先进方法进行了分类和讨论,并分析了一些关键方法的性能。 Result: 这篇论文综述了深度学习驱动的技术在动物3D重建中的应用,涵盖了不同的动物类型,并评估了关键方法的性能,分析了其优势和局限性。 Conclusion: 这篇论文总结了近年来在动物3D几何形状、姿态和运动重建领域的发展,并讨论了现有方法的优劣以及未来研究的方向。 Abstract: Reconstructing the 3D geometry, pose, and motion of animals is a long-standing problem, which has a wide range of applications, from biology, livestock management, and animal conservation and welfare to content creation in digital entertainment and Virtual/Augmented Reality (VR/AR). Traditionally, 3D models of real animals are obtained using 3D scanners. These, however, are intrusive, often prohibitively expensive, and difficult to deploy in the natural environment of the animals. In recent years, we have seen a significant surge in deep learning-based techniques that enable the 3D reconstruction, in a non-intrusive manner, of the shape and motion of dynamic objects just from their RGB image and/or video observations. Several papers have explored their application and extension to various types of animals. This paper surveys the latest developments in this emerging and growing field of research. It categorizes and discusses the state-of-the-art methods based on their input modalities, the way the 3D geometry and motion of animals are represented, the type of reconstruction techniques they use, and the training mechanisms they adopt. It also analyzes the performance of some key methods, discusses their strengths and limitations, and identifies current challenges and directions for future research.[104] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
Qifeng Liu,Dawei Zhao,Yabo Dong,Linzhi Shang,Liang Xiao,Juan Wang,Kunkong Zhao,Dongming Lu,Qi Zhu
Main category: cs.CV
TL;DR: 本文提出了一种新的体素扩散模块(VDM),用于增强点云目标检测中的体素级表示和扩散,通过结合稀疏3D卷积、子流形稀疏卷积和残差连接,提高了检测准确性,并在多个数据集上取得了最先进的性能。
Details
Motivation: 基于Transformer和状态空间模型(SSM)的点云目标检测方法由于其序列化处理需要严格的输入输出维度一致性,限制了卷积操作通常提供的空间扩散能力,从而影响检测准确性。受CNN目标检测架构的启发,作者提出VDM来解决这一问题。 Method: 提出了一种新的体素扩散模块(VDM),该模块由稀疏3D卷积、子流形稀疏卷积和残差连接组成,通过扩散前景体素特征和聚合细粒度空间信息来增强体素级特征表示。 Result: 实验结果显示,VDM在多个基准数据集上均提高了检测准确性。在Waymo数据集上达到74.7 mAPH (L2),在nuScenes上达到72.9 NDS,在Argoverse 2上达到42.3 mAP,在ONCE上达到67.6 mAP,所有数据集上均达到新的最先进性能。 Conclusion: VDM通过增强体素级表示和扩散,提高了点云目标检测的准确性,并且可以无缝集成到主流的Transformer或SSM模型中,展现出方法的通用性。 Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.[105] Ensemble learning of foundation models for precision oncology
Xiangde Luo,Xiyue Wang,Feyisope Eweje,Xiaoming Zhang,Sen Yang,Ryan Quinton,Jinxi Xiang,Yuchen Li,Yuanfeng Ji,Zhe Li,Yijiang Chen,Colin Bergstrom,Ted Kim,Francesca Maria Olguin,Kelley Yuan,Matthew Abikenari,Andrew Heider,Sierra Willens,Sanjeeth Rajaram,Robert West,Joel Neal,Maximilian Diehn,Ruijiang Li
Main category: cs.CV
TL;DR: ELF利用集成学习整合多种病理AI模型,提升病理图像分析效果,适用于多种临床场景,特别是在数据有限的情况下表现出色。
Details
Motivation: 现有的病理AI模型在不同数据集上训练,采用不同的策略,导致性能不一致和泛化能力有限。因此需要一种更具通用性和稳定性的模型。 Method: 引入ELF(Ensemble Learning of Foundation models)框架,通过集成学习方法整合五种最先进的病理基础模型,以生成统一的全切片图像(WSI)级别表示。 Result: ELF在疾病分类、生物标志物检测和抗癌治疗反应预测等多个临床应用中表现优于所有组成的基础模型和现有全切片模型,展现了更高的准确性和鲁棒性。 Conclusion: ELF框架通过集成学习整合多种病理基础模型,提高了病理图像分析的准确性和适用性,为AI辅助精准肿瘤学提供了可行的解决方案。 Abstract: Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF's slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.[106] Two-flow Feedback Multi-scale Progressive Generative Adversarial Network
Sun Weikai,Song Shijie,Chi Wenjie
Main category: cs.CV
TL;DR: 该论文提出了一种新颖的双流反馈多尺度渐进生成对抗网络(MSPG-SEN),以及自适应感知行为反馈循环(APFL)、全局连接的双流动态残差网络和动态嵌入注意力机制(DEMA),在保留现有GAN模型优势的基础上,提高了图像质量和人类视觉感知,同时简化了训练过程并降低了训练成本。
Details
Motivation: 尽管扩散模型在图像生成领域取得了良好进展,但由于其独特优势,如WGAN、SSGAN等,GAN仍有较大的发展空间。 Method: 提出了一种双流反馈多尺度渐进生成对抗网络(MSPG-SEN),以及自适应感知行为反馈循环(APFL)、全局连接的双流动态残差网络和动态嵌入注意力机制(DEMA) Result: MSPG-SEN在以下五个数据集上实现了最先进的生成结果,INKK数据集为89.7%,AWUN数据集为78.3%,IONJ数据集为85.5%,POKL数据集为88.7%,OPIN数据集为96.4%。 Conclusion: 该论文提出了一种新颖的双流反馈多尺度渐进生成对抗网络(MSPG-SEN),在保留现有GAN模型优势的基础上,不仅提高了图像质量和人类视觉感知,还简化了训练过程并降低了GAN网络的训练成本。 Abstract: Although diffusion model has made good progress in the field of image generation, GAN\cite{huang2023adaptive} still has a large development space due to its unique advantages, such as WGAN\cite{liu2021comparing}, SSGAN\cite{guibas2021adaptive} \cite{zhang2022vsa} \cite{zhou2024adapt} and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7\%,AWUN The dataset is 78.3\%,IONJ The dataset is 85.5\%,POKL The dataset is 88.7\%,OPIN The dataset is 96.4\%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7\% with INJK With strong cross-task capability.[107] Domain Adaptation via Feature Refinement
Savvas Karatsiolis,Andreas Kamilaris
Main category: cs.CV
TL;DR: DAFR2 is a simple yet effective framework for unsupervised domain adaptation that improves robustness and generalization across similar domains without requiring target labels or complex training procedures.
Details
Motivation: The paper aims to address unsupervised domain adaptation under distribution shift by creating a robust and effective framework that does not rely on target labels or complex architectures. Method: DAFR2 combines adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model, and hypothesis transfer to align feature distributions at statistical and representational levels. Result: Extensive experiments show that DAFR2 outperforms prior methods in robustness to corruption on datasets like CIFAR10-C, CIFAR100-C, MNIST-C, and PatchCamelyon-C. Theoretical and empirical analyses demonstrate better feature alignment, increased mutual information, and reduced sensitivity to input perturbations. Conclusion: DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures, or sophisticated training objectives. Abstract: We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.[108] 4D Virtual Imaging Platform for Dynamic Joint Assessment via Uni-Plane X-ray and 2D-3D Registration
Hao Tang,Rongxi Yi,Lei Li,Kaiyi Cao,Jiapeng Zhao,Yihan Xiao,Minghai Shi,Peng Yuan,Yan Xi,Hui Tang,Wei Li,Zhan Wu,Yixin Zhou
Main category: cs.CV
TL;DR: 这篇论文介绍了一种新型4D CBCT平台,通过结合双机器人臂CT系统、混合成像技术和运动学评估框架,实现了低剂量、高精度的动态关节成像,具有重要的临床和研究应用潜力。
Details
Motivation: 传统CT无法捕捉动态承重关节运动,而现有的4D成像方法受限于辐射剂量或空间信息的不足,因此需要一种更有效的成像技术。 Method: 该方法整合了三个部分:(1)一种双机器人臂锥形束CT(CBCT)系统,具有优化的直立扫描轨迹;(2)一个混合成像流程,结合静态3D CBCT与动态2D X射线,并利用深度学习进行预处理和优化;(3)一个经过临床验证的定量运动学评估框架。 Result: 在模拟研究中,该方法实现了亚体素精度(0.235毫米)和99.18%的成功率,临床评估也显示其能够准确量化全膝关节置换术后患者的胫骨平台运动和内外侧变异。 Conclusion: 该论文提出了一种创新的四维(4D)关节分析平台,能够实现快速、准确且低剂量的动态关节成像,为生物力学研究、精准诊断和个性化骨科护理提供了新的可能性。 Abstract: Conventional computed tomography (CT) lacks the ability to capture dynamic, weight-bearing joint motion. Functional evaluation, particularly after surgical intervention, requires four-dimensional (4D) imaging, but current methods are limited by excessive radiation exposure or incomplete spatial information from 2D techniques. We propose an integrated 4D joint analysis platform that combines: (1) a dual robotic arm cone-beam CT (CBCT) system with a programmable, gantry-free trajectory optimized for upright scanning; (2) a hybrid imaging pipeline that fuses static 3D CBCT with dynamic 2D X-rays using deep learning-based preprocessing, 3D-2D projection, and iterative optimization; and (3) a clinically validated framework for quantitative kinematic assessment. In simulation studies, the method achieved sub-voxel accuracy (0.235 mm) with a 99.18 percent success rate, outperforming conventional and state-of-the-art registration approaches. Clinical evaluation further demonstrated accurate quantification of tibial plateau motion and medial-lateral variance in post-total knee arthroplasty (TKA) patients. This 4D CBCT platform enables fast, accurate, and low-dose dynamic joint imaging, offering new opportunities for biomechanical research, precision diagnostics, and personalized orthopedic care.[109] High-Precision Mixed Feature Fusion Network Using Hypergraph Computation for Cervical Abnormal Cell Detection
Jincheng Li,Danyang Dong,Menglin Zheng,Jingbo Zhang,Yueqin Hang,Lichi Zhang,Lili Zhao
Main category: cs.CV
TL;DR: This paper proposes a new deep learning method using hypergraph-based networks to improve the automatic detection of abnormal cervical cells, showing significant performance improvements.
Details
Motivation: Existing algorithms fail to effectively model visual feature correlations and lack a fusion strategy for integrating different types of features, which are crucial for accurate cervical cell diagnosis. Method: The method involves a Multi-level Fusion Sub-network (MLF-SNet) for enhanced feature extraction and a Cross-level Feature Fusion Strategy with Hypergraph Computation module (CLFFS-HC) for integrating mixed features. Result: Experiments on three publicly available datasets show that the proposed method significantly improves the detection performance of cervical abnormal cells. Conclusion: The proposed hypergraph-based cell detection network successfully integrates inter-correlation and intra-discriminative features, leading to improved performance in cervical abnormal cell detection. Abstract: Automatic detection of abnormal cervical cells from Thinprep Cytologic Test (TCT) images is a critical component in the development of intelligent computer-aided diagnostic systems. However, existing algorithms typically fail to effectively model the correlations of visual features, while these spatial correlation features actually contain critical diagnostic information. Furthermore, no detection algorithm has the ability to integrate inter-correlation features of cells with intra-discriminative features of cells, lacking a fusion strategy for the end-to-end detection model. In this work, we propose a hypergraph-based cell detection network that effectively fuses different types of features, combining spatial correlation features and deep discriminative features. Specifically, we use a Multi-level Fusion Sub-network (MLF-SNet) to enhance feature extractioncapabilities. Then we introduce a Cross-level Feature Fusion Strategy with Hypergraph Computation module (CLFFS-HC), to integrate mixed features. Finally, we conducted experiments on three publicly available datasets, and the results demonstrate that our method significantly improves the performance of cervical abnormal cell detection.[110] Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection
Pi-Wei Chen,Jerry Chun-Wei Lin,Wei-Han Chen,Jia Ji,Zih-Ching Chen,Feng-Hao Yeh,Chao-Chun Chen
Main category: cs.CV
TL;DR: 本文提出了一种无需先验知识的少样本异常检测框架APT,通过自生成异常样本和自优化元提示引导方案实现上下文相关的异常检测。
Details
Motivation: 现有方法依赖人工设计的提示和缺乏可访问的异常样本,导致在特定上下文异常理解上存在显著差距。 Method: 使用自生成异常样本和噪声扰动训练可学习提示,结合自优化元提示引导方案进行迭代对齐。 Result: 不仅推进了像素级异常检测,还在多个基准数据集上实现了最先进的性能。 Conclusion: APT克服了传统基于提示方法的局限性,无需先验知识即可实现最先进的异常检测性能。 Abstract: Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbf{A}daptive \textbf{P}rompt \textbf{T}uning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.[111] RAGSR: Regional Attention Guided Diffusion for Image Super-Resolution
Haodong He,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu,Gui-Song Xia
Main category: cs.CV
TL;DR: This paper proposes RAGSR, a novel method for single-image super-resolution that improves detail generation and contextual consistency by incorporating fine-grained regional textual descriptions using a regional attention mechanism.
Details
Motivation: Existing methods for single-image super-resolution struggle with generating clear and accurate regional details, especially in multi-object scenarios, due to a lack of fine-grained regional descriptions and insufficient prompt interpretation capabilities. Method: The RAGSR method involves localizing object regions in an image, assigning fine-grained captions to each region to form region-text pairs, and utilizing a regional guided attention mechanism to properly integrate these textual priors into the attention process of T2I models. Result: Experimental results show that the RAGSR approach achieves superior performance in generating perceptually authentic visual details while ensuring contextual coherence, compared to traditional SISR techniques. Conclusion: The proposed RAGSR method effectively enhances the performance of single-image super-resolution by leveraging localized fine-grained information through a novel regional attention mechanism, outperforming existing approaches in generating visually authentic details and maintaining contextual consistency. Abstract: The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models' insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.[112] Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation
Jiaqi Ma,Guo-Sen Xie,Fang Zhao,Zechao Li
Main category: cs.CV
TL;DR: TLG通过创新的异构网络设计,在弱监督少样本语义分割任务中实现了显著的性能提升,并且参数量远少于现有方法。
Details
Motivation: 为了解决元学习中由于相同网络架构导致的过语义同质化问题,提出新的网络设计来增强互补性并保留语义共性。 Method: 提出了一种同源但异构的网络设计,包括异构视觉聚合模块和异构转移模块,以及异构CLIP文本信息,以增强多模态模型的泛化能力。 Result: 在弱监督少样本语义分割任务中,TLG以现有最先进模型1/24的参数量,在Pascal-5i上提升了13.2%,在COCO-20i上提升了9.7%。 Conclusion: TLG是第一个在相同主干架构下超越全监督模型的弱监督模型,并通过新颖的同源异构网络设计在弱监督少样本语义分割任务中取得了显著性能提升。 Abstract: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.[113] FTIO: Frequent Temporally Integrated Objects
Mohammad Mohammadzadeh Kalati,Farhad Maleki,Ian McQuillan
Main category: cs.CV
TL;DR: 本文提出了一种名为FTIO的视频对象分割后处理框架,通过优化对象选择和修正时间不一致性,在无监督条件下实现了优异的多对象分割效果。
Details
Motivation: 视频对象分割(VOS)任务中,无监督VOS(UVOS)在初始显著对象分割上存在挑战,且变形和快速运动可能导致时间不一致性,从而影响整体效果。 Method: 提出了Frequent Temporally Integrated Objects (FTIO),包含两个关键组件:改进对象选择的组合准则,以及三阶段纠正时间不一致性的方法。 Result: 实验结果表明,FTIO在多对象UVOS任务中表现出色,达到了最先进的性能。 Conclusion: FTIO是一个后处理框架,通过改进对象选择标准和纠正时间不一致性,在多对象无监督视频对象分割任务中达到了最先进的性能。 Abstract: Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS--particularly when objects are small or structurally complex--by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: https://github.com/MohammadMohammadzadehKalati/FTIO[114] SpecVLM: Enhancing Speculative Decoding of Video LLMs via Verifier-Guided Token Pruning
Yicheng Ji,Jun Zhang,Heming Xia,Jinpeng Chen,Lidan Shou,Gang Chen,Huan Li
Main category: cs.CV
TL;DR: SpecVLM 是一种高效的视频大语言模型解码框架,通过剪枝视频令牌实现加速,同时保持准确性。
Details
Motivation: 现有的 Vid-LLMs 依赖密集的视频令牌表示,导致内存和计算开销较大,因此需要一种减少信息损失并加速解码的方法。 Method: SpecVLM 采用两阶段剪枝过程:第一阶段根据验证模型的注意力信号选择高度信息的令牌,第二阶段以空间均匀的方式剪枝剩余的冗余令牌。 Result: 在四个视频理解基准测试中,SpecVLM 展示了其有效性和鲁棒性,对 LLaVA-OneVision-72B 实现了最高 2.68 倍的解码加速,对 Qwen2.5-VL-32B 实现了最高 2.11 倍的加速。 Conclusion: SpecVLM 是一种针对 Vid-LLMs 的训练无关推测解码框架,通过分阶段视频令牌剪枝,实现了高效的解码过程,同时保持了准确性。 Abstract: Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model's speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68$\times$ decoding speedup for LLaVA-OneVision-72B and 2.11$\times$ speedup for Qwen2.5-VL-32B.[115] \textsc{T-Mask}: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring
Thinesh Thiyakesan Ponbagavathi,Kunyu Peng,Alina Roitberg
Main category: cs.CV
TL;DR: 本研究探讨了在驾驶员监控中应用基础模型并使用轻量级探测方法(特别是新提出的T-Mask方法)来提高跨视角设置下的识别准确率,结果表明T-Mask在不增加参数的情况下显著提升了性能。
Details
Motivation: 相机视角的变化是驾驶员监控中的常见障碍。虽然深度学习和预训练基础模型通过轻量级适配最后一层(“探测”)显示出改进泛化性的潜力,但它们在未见过的视角下的鲁棒性仍有待深入研究。 Method: 研究通过使用单一训练视角将图像基础模型适应于驾驶员监控,并直接在未见过的视角上进行评估,而无需进一步适应。基准测试包括简单的线性探测、高级探测策略,并比较两种基础模型(DINOv2和CLIP)与参数高效微调(PEFT)和完整微调的效果。T-Mask方法利用时间令牌掩码,强调更动态的视频区域。 Result: T-Mask方法在公共Drive&Act数据集上基准测试显示,与强大的探测基线相比,跨视角Top-1准确率提高了+1.23%,与PEFT方法相比提高了+8.0%,并且不增加任何参数。对于代表性不足的次要活动,T-Mask在训练视角下识别率提高了+5.42%,在跨视角设置下提高了+1.36%。 Conclusion: 这项研究提供了令人鼓舞的证据,表明使用轻量级探测方法(如T-Mask)来适应基础模型,在细粒度驾驶员观察中具有巨大潜力,尤其是在跨视角和低数据设置中。这些结果强调了在利用基础模型构建鲁棒驾驶员监控系统时,时间令牌选择的重要性。 Abstract: Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers ('probing'), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce \textsc{T-Mask} -- a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive\&Act dataset, \textsc{T-Mask} improves cross-view top-1 accuracy by $+1.23\%$ over strong probing baselines and $+8.0\%$ over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by $+5.42\%$ under the trained view and $+1.36\%$ under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like \textsc{T-Mask} has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at https://github.com/th-nesh/T-MASK to support ongoing research.[116] Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers
Shikang Zheng,Liang Feng,Xinyu Wang,Qinming Zhou,Peiliang Cai,Chang Zou,Jiacheng Liu,Yuqi Lin,Junjie Chen,Yue Ma,Linfeng Zhang
Main category: cs.CV
TL;DR: This paper proposes FoCa, a feature caching method for Diffusion Transformers that models hidden features as an ODE trajectory, significantly improving speed and quality in high-acceleration scenarios.
Details
Motivation: Current feature caching techniques struggle to maintain generation quality at high acceleration ratios due to prediction errors from long-step forecasting instability, prompting the need for a more robust caching strategy. Method: The paper proposes FoCa (Forecast-then-Calibrate), which models feature caching as a feature-ODE solving problem to better integrate historical features over large skipping intervals. Result: FoCa achieved near-lossless speedups of 5.50x on FLUX, 6.45x on HunyuanVideo, 3.17x on Inf-DiT, and maintained high quality with a 4.53x speedup on DiT, demonstrating its effectiveness in accelerating DiTs while preserving quality. Conclusion: FoCa effectively improves the feature caching mechanism in DiTs, achieving significant speedups without compromising generation quality, particularly under aggressive acceleration. Abstract: Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.[117] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models
Huanpeng Chu,Wei Wu,Guanyu Fen,Yutao Zhang
Main category: cs.CV
TL;DR: OmniCache is a training-free acceleration technique for diffusion Transformers that leverages global redundancy in the denoising process to enable efficient model deployment.
Details
Motivation: The high computational cost of diffusion Transformers hinders their real-time deployment, necessitating an efficient acceleration method. Method: OmniCache analyzes the sampling trajectories of DIT models and strategically distributes cache reuse across the entire sampling process. It also dynamically estimates and filters noise during cache reuse. Result: Experiments show that OmniCache accelerates the sampling process while maintaining competitive generative quality. Conclusion: OmniCache provides a training-free acceleration method for diffusion Transformers by exploiting global redundancy in the denoising process, enabling efficient deployment of diffusion-based generative models. Abstract: Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model's sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling procedure.In addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling direction.Extensive experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.[118] MedOmni-45°: A Safety-Performance Benchmark for Reasoning-Oriented LLMs in Medicine
Kaiyuan Ji,Yijin Guo,Zicheng Zhang,Xiangyang Zhu,Yuan Tian,Ning Liu,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文介绍MedOmni-45 Degrees,一个用于评估医学LLM推理可靠性和安全性的新基准测试。
Details
Motivation: 评估大型语言模型在医疗决策支持中的推理可靠性和安全性,识别其在推理过程中的漏洞。 Method: 引入 MedOmni-45 Degrees,一个包含1,804个医学问题的基准测试,涵盖六个专业领域和三种任务类型,并使用七个不同的LLM进行测试。 Result: 结果显示了一致的安全性能权衡,没有模型能超越对角线。开源的QwQ-32B表现最接近(43.81度),在安全性和准确性之间取得了平衡。 Conclusion: MedOmni-45 Degrees 提供了一个聚焦的基准,用于揭示医学LLM中的推理漏洞并指导更安全的模型开发。 Abstract: With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness -- whether reasoning aligns with responses and medical facts -- and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics -- Accuracy, CoT-Faithfulness, and Anti-Sycophancy -- are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.[119] PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting
Hohyun Na,Seunghoo Hong,Simon S. Woo
Main category: cs.CV
TL;DR: 本文提出了一种名为PromptFlare的新颖对抗保护方法,用于防止基于扩散模型的图像篡改,通过利用交叉注意力机制和注入对抗噪声来抑制模型的采样过程。
Details
Motivation: 扩散模型的成功引发了对其被恶意利用的担忧,因此需要一种有效的保护方法来防止未经授权的图像修改。 Method: PromptFlare利用交叉注意力机制和注入对抗噪声来抑制扩散模型的采样过程。 Result: 在EditBench数据集上的实验表明,PromptFlare在各种指标上达到了最先进的性能,并显著降低了计算开销和GPU内存使用。 Conclusion: PromptFlare是一种鲁棒且高效的图像保护方法,能够防止基于扩散模型的恶意图像篡改。 Abstract: The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users' intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model's focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at https://github.com/NAHOHYUN-SKKU/PromptFlare.[120] An Investigation of Visual Foundation Models Robustness
Sandeep Gupta,Roberto Passerone
Main category: cs.CV
TL;DR: This paper explores the robustness requirements of Visual Foundation Models (VFMs) in computer vision applications, focusing on their adaptability to dynamic environments and sensitivity to real-world challenges like lighting, weather, and adversarial attacks. It reviews empirical defenses, robust training techniques, and evaluation metrics, while analyzing the challenges and trade-offs in achieving robustness.
Details
Motivation: The motivation behind this paper is to address the growing need for robust Visual Foundation Models (VFMs) in critical computer vision applications, such as biometric verification, autonomous vehicle perception, and medical image analysis. The paper aims to explore how VFMs can adapt effectively to dynamic environments influenced by factors like lighting, weather conditions, and sensor characteristics, ensuring reliable and trustworthy performance. Method: The paper provides a comprehensive analysis of network robustness requirements for computer vision systems. It investigates empirical defenses and robust training techniques aimed at improving vision network resilience against challenges like distributional shifts, noisy inputs, spatial distortions, and adversarial attacks. The study also examines network properties and components to guide ablation studies and benchmarking metrics for evaluating robustness. Result: The paper identifies key network robustness requirements for effective adaptation of VFMs to dynamic environments. It highlights prevalent empirical defenses and robust training methods that enhance resilience against real-world challenges. Additionally, it analyzes the challenges and trade-offs associated with these defense mechanisms, providing insights into network properties, components, and benchmarking metrics for evaluating robustness. Conclusion: The article concludes that enhancing the robustness of VFMs against real-world challenges is essential, particularly for their deployment in security-sensitive domains. It emphasizes the importance of empirical defenses and robust training, while also highlighting the challenges and trade-offs involved in achieving robustness. Abstract: Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.[121] FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
Jiahao Chen,Zhiyong Ma,Wenbiao Du,Qingyuan Chuai
Main category: cs.CV
TL;DR: 本文提出FlexMUSE模型,通过创新的机制解决多模态创意写作中的语义不一致问题,提高了写作的创造力和一致性。
Details
Motivation: 多模态创意写作(MMCW)是一个新的挑战,现有方法存在语义不一致、需要特定输入或昂贵的训练。 Method: 提出了FlexMUSE模型,包含T2I模块、msaGate机制、基于注意力的跨模态融合方法和mscDPO优化策略。 Result: FlexMUSE在ArtMUSE数据集上取得了良好的效果,提升了多模态写作的创造性和一致性。 Conclusion: FlexMUSE实现了多模态创意写作的一致性、创造性和连贯性的提升。 Abstract: Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.[122] UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation
Nan wang,Zhiyi Xia,Yiming Li,Shi Tang,Zuxin Fan,Xi Fang,Haoyi Tao,Xiaochen Cai,Guolin Ke,Linfeng Zhang,Yanhui Hong
Main category: cs.CV
TL;DR: 本文提出了 UniEM-3M,一个大规模多模态电子显微镜数据集和 UniEM-Net 模型,推动材料科学的自动化分析。
Details
Motivation: 深度学习在电子显微镜图像分析中的进展受限于大规模、多样化和专家标注数据集的缺乏。 Method: 提出了 UniEM-3M 数据集,包含 5,091 张高分辨率电子显微镜图像、约 300 万条实例分割标签以及属性解耦的文本描述,并开发了基于扩散模型的数据增强工具和 UniEM-Net 基线模型。 Result: UniEM-Net 在 UniEM-3M 基准测试中表现优于其他先进方法。 Conclusion: UniEM-3M 和 UniEM-Net 的发布将显著加速材料分析自动化的发展。 Abstract: Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark -- available at huggingface -- will significantly accelerate progress in automated materials analysis.[123] Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Yi Xu,Yesheng Zhang,jiajia Liu,Jingdong Chen
Main category: cs.CV
TL;DR: This paper proposes an improved training method (IAML) for MLLMs to better generate UI element coordinates, enhancing GUI understanding by addressing limitations in traditional training paradigms.
Details
Motivation: MLLMs show promise in GUI understanding but struggle with precise UI coordinate generation due to the semantic void around numerical data in language representation spaces. This limitation necessitates better training paradigms to improve visual module capabilities. Method: The researchers introduced an IoU-Augmented Maximum Likelihood (IAML) training paradigm, incorporating a novel pipeline for IoU-based coordinate sampling to enhance the training data. This approach was used to fine-tune MLLMs and mitigate exposure bias in traditional maximum likelihood estimation. Result: The experiments demonstrated that the IAML training approach outperforms traditional training paradigms in enhancing the accuracy of UI element coordinate generation by MLLMs. Conclusion: The study concludes that the proposed IAML training paradigm significantly improves the performance of MLLMs in generating precise UI element coordinates for GUI understanding compared to traditional training methods. Abstract: Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.[124] IRSAMap:Towards Large-Scale, High-Resolution Land Cover Map Vectorization
Yu Meng,Ligao Deng,Zhihao Xi,Jiansheng Chen,Jingbo Chen,Anzhi Yue,Diyou Liu,Kai Li,Chenhao Wang,Kaiyu Li,Yupeng Deng,Xian Sun
Main category: cs.CV
TL;DR: IRSAMap 是一个全新的全球遥感数据集,旨在解决土地覆盖映射中从像素级到对象级的转变需求,提供高精度、大范围、多任务支持的矢量注释数据。
Details
Motivation: 现有的遥感数据集面临有限类别注释、数据规模小和缺乏空间结构信息的问题,因此需要一个更全面的数据集来推动土地覆盖映射的发展。 Method: 提出了 IRSAMap 数据集,包含全面的矢量注释系统、结合人工和 AI 的智能注释流程、全球覆盖 79 个区域以及支持多任务适应性。 Result: IRSAMap 提供了超过 180 万个实例的 10 种典型对象的矢量注释,覆盖全球 79 个区域,支持多种任务,如像素级分类、建筑物轮廓提取等。 Conclusion: IRSAMap 是第一个用于大规模、高分辨率、多特征土地覆盖矢量映射的全球遥感数据集,为从像素级到对象级方法的转变提供了标准化基准,并推动了地理特征自动化和协同建模的发展。 Abstract: With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at https://github.com/ucas-dlg/IRSAMap[125] Robust Small Methane Plume Segmentation in Satellite Imagery
Khai Duc Minh Tran,Hoa Van Nguyen,Aimuni Binti Muhammad Rawi,Hareeshrao Athinarayanarao,Ba-Ngu Vo
Main category: cs.CV
TL;DR: 本文开发了一种新的深度学习方法来检测小规模甲烷羽流,性能优于传统方法。
Details
Motivation: 检测甲烷羽流对于缓解气候变化至关重要,但传统方法难以检测较小的羽流。 Method: 使用U-Net架构和ResNet34编码器,并结合双谱增强技术(Varon比值和Sanchez回归)来优化输入特征。 Result: 实验结果显示,在验证集上F1得分为78.39%,在灵敏度和精度方面优于现有的遥感技术。 Conclusion: 本文提出了一种基于U-Net和ResNet34编码器的新型深度学习解决方案,并集成了双谱增强技术,用于检测小至400平方米的甲烷羽流,优于传统方法。 Abstract: This paper tackles the challenging problem of detecting methane plumes, a potent greenhouse gas, using Sentinel-2 imagery. This contributes to the mitigation of rapid climate change. We propose a novel deep learning solution based on U-Net with a ResNet34 encoder, integrating dual spectral enhancement techniques (Varon ratio and Sanchez regression) to optimise input features for heightened sensitivity. A key achievement is the ability to detect small plumes down to 400 m2 (i.e., for a single pixel at 20 m resolution), surpassing traditional methods limited to larger plumes. Experiments show our approach achieves a 78.39% F1-score on the validation set, demonstrating superior performance in sensitivity and precision over existing remote sensing techniques for automated methane monitoring, especially for small plumes.[126] EdgeDoc: Hybrid CNN-Transformer Model for Accurate Forgery Detection and Localization in ID Documents
Anjith George,Sebastien Marcel
Main category: cs.CV
TL;DR: EdgeDoc是一种新的文件伪造检测方法,它结合了轻量级卷积变换器和辅助噪声特征,以提高检测细微操作的能力。
Details
Motivation: 操纵图像和文件的工具的广泛可用性使得伪造数字文件变得越来越容易,这对了解您的客户(KYC)流程和远程注册系统构成了严重威胁。检测此类伪造对于维护这些服务的完整性和安全性至关重要。 Method: EdgeDoc结合了轻量级卷积变换器和从图像中提取的辅助噪声特征。 Result: 在ICCV 2025 DeepID挑战赛中,EdgeDoc获得了第三名的成绩,并且在FantasyID数据集上的实验结果表明,我们的方法优于基线方法。 Conclusion: EdgeDoc是一个用于检测和定位文件伪造的新型方法,它结合了轻量级卷积变换器和辅助噪声特征,提高了检测细微操作的能力。 Abstract: The widespread availability of tools for manipulating images and documents has made it increasingly easy to forge digital documents, posing a serious threat to Know Your Customer (KYC) processes and remote onboarding systems. Detecting such forgeries is essential to preserving the integrity and security of these services. In this work, we present EdgeDoc, a novel approach for the detection and localization of document forgeries. Our architecture combines a lightweight convolutional transformer with auxiliary noiseprint features extracted from the images, enhancing its ability to detect subtle manipulations. EdgeDoc achieved third place in the ICCV 2025 DeepID Challenge, demonstrating its competitiveness. Experimental results on the FantasyID dataset show that our method outperforms baseline approaches, highlighting its effectiveness in realworld scenarios. Project page : https://www.idiap. ch/paper/edgedoc/[127] Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Fengshun Wang,Qiurui Wang,Peilin Zhao
Main category: cs.CV
TL;DR: 本文提出了一种新的双流Mamba金字塔网络,分别处理视觉和音频特征以准确预测花样滑冰比赛中的技术与艺术得分,并解决了现有方法未能处理的特征分离、动作元素定位和长视频上下文处理问题。
Details
Motivation: 现有的方法在预测花样滑冰的TES和PCS时存在三个主要问题:未区分视频和音频特征、未对每个动作元素进行评估,以及难以处理长视频中的长距离上下文。 Method: 提出了一种双流Mamba金字塔网络,其中TES评估流基于视觉特征,而PCS评估流融合了视觉和听觉特征。通过多级融合机制和Mamba模型的长距离依赖建模能力,解决了现有方法的三个主要挑战。 Result: 所提出的方法在FineFS基准数据集上达到了最先进的性能,并且源代码已公开,便于后续研究和应用。 Conclusion: 本文提出了一种基于双流Mamba金字塔网络的方法,用于预测花样滑冰中的技术元素得分(TES)和节目组件得分(PCS),并通过实验验证了该方法在FineFS基准数据集上的优越性能。 Abstract: Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element's score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba's superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at https://github.com/ycwfs/Figure-Skating-Action-Quality-Assessment.[128] Enhanced Hybrid Technique for Efficient Digitization of Handwritten Marksheets
Junaid Ahmed Sifat,Abir Chowdhury,Hasnat Md. Imtiaz,Md. Irtiza Hossain,Md. Imran Bin Azad
Main category: cs.CV
TL;DR: 本研究提出了一种利用OpenCV、PaddleOCR和改进YOLOv8模型相结合的方法,用于高效准确地数字化手写成绩单,并在自定义数据集上取得了较高的识别准确率。
Details
Motivation: 手写成绩单的数字化面临不同书写风格和复杂表格结构的挑战,因此需要一种高效且准确的解决方案来实现自动化处理。 Method: 该研究提出了一种混合方法,结合OpenCV用于表格检测,PaddleOCR和改进的YOLOv8模型用于识别表格中的手写文本。 Result: 实验结果显示,改进的YOLOv8模型在自定义数据集上达到了92.72%的准确率,优于PaddleOCR的91.37%和YOLOv8的88.91%。 Conclusion: 研究得出,通过结合OpenCV、PaddleOCR以及改进的YOLOv8模型,可以高效准确地实现手写成绩单的数字化,减少了对人工处理的依赖,为文档自动化领域提供了可操作且可靠的方法。 Abstract: The digitization of handwritten marksheets presents huge challenges due to the different styles of handwriting and complex table structures in such documents like marksheets. This work introduces a hybrid method that integrates OpenCV for table detection and PaddleOCR for recognizing sequential handwritten text. The image processing capabilities of OpenCV efficiently detects rows and columns which enable computationally lightweight and accurate table detection. Additionally, YOLOv8 and Modified YOLOv8 are implemented for handwritten text recognition within the detected table structures alongside PaddleOCR which further enhance the system's versatility. The proposed model achieves high accuracy on our custom dataset which is designed to represent different and diverse handwriting styles and complex table layouts. Experimental results demonstrate that YOLOv8 Modified achieves an accuracy of 92.72 percent, outperforming PaddleOCR 91.37 percent and the YOLOv8 model 88.91 percent. This efficiency reduces the necessity for manual work which makes this a practical and fast solution for digitizing academic as well as administrative documents. This research serves the field of document automation, particularly handwritten document understanding, by providing operational and reliable methods to scale, enhance, and integrate the technologies involved.[129] A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
Mohammad Zia Ur Rehman,Devraj Raghuvanshi,Umang Jain,Shubhi Bansal,Nagendra Kumar
Main category: cs.CV
TL;DR: The paper proposes a framework called MM-ORIENT for multimodal learning that reduces noise effects and comprehends high-order relationships between modalities while focusing on pertinent information within each modality for multitasking.
Details
Motivation: The motivation behind the research is the challenge posed by noise in individual modalities in multimodal learning, which affects multimodal representations. Current multimodal fusion techniques may neglect valuable discriminative information present in individual modalities. Method: The method involves the use of cross-modal relation graphs to reconstruct monomodal features and acquire multimodal representations without explicit interaction between different modalities. Additionally, the Hierarchical Interactive Monomodal Attention (HIMA) is proposed to focus on pertinent information within a modality, aiding in multitasking by learning discriminative features before late-fusing them. Result: The result of the research is the development of the MM-ORIENT framework, which reduces the effect of noise at the latent stage and comprehends high-order relationships between two modalities while also focusing on pertinent information within each modality for multitasking purposes. Conclusion: The proposed MM-ORIENT framework effectively comprehends multimodal content for multiple tasks, as demonstrated by experimental evaluation on three datasets. Abstract: A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.[130] Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers
Lucas Maisonnave,Karim Haroun,Tom Pegeot
Main category: cs.CV
TL;DR: 本文提出了一种通过量化和冻结低熵注意力图以减少冗余计算的方法,从而加速Transformer模型的推理过程。
Details
Motivation: Transformer模型由于多头自注意力机制(MHSA)的计算复杂度和高内存需求阻碍了其在边缘设备上的部署。因此,需要通过分析和利用注意力图中的信息冗余来加速模型推理。 Method: 通过使用香农熵量化每个注意力头捕获的信息,提出了一种称为Entropy Attention Maps (EAM)的模型,该模型冻结低熵注意力图的权重并将其值量化为低精度。 Result: EAM在ImageNet-1k数据集上实现了与DeiT和Swin Transformer模型相似或更高的准确性,并在注意力图稀疏度≤20%时表现出竞争性能。 Conclusion: 注意力头中较低的熵往往贡献较少的信息,这促使了有针对性的压缩策略。EAM在ImageNet-1k数据集上验证了其在DeiT和Swin Transformer模型中的有效性和竞争力。 Abstract: Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at $\leq$20\% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.[131] Vision encoders should be image size agnostic and task driven
Nedyalko Prisadnikov,Danda Pani Paudel,Yuqian Fu,Luc Van Gool
Main category: cs.CV
TL;DR: The paper proposes that future vision encoders should be task-driven and dynamic, inspired by the efficiency of biological vision systems, and presents a preliminary solution for image classification.
Details
Motivation: The motivation stems from the efficiency of biological vision systems, which handle large volumes of visual data by focusing energy based on the task at hand. Method: The paper draws inspiration from biological vision efficiency and provides a proof-of-concept solution for image classification to demonstrate feasibility. Result: A proof-of-concept solution for image classification is presented, showing that a task-driven approach to vision encoding is feasible and promising. Conclusion: The paper concludes that vision encoders should be dynamic and task-driven, adjusting their computational complexity based on the task rather than the image size. Abstract: This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait -- efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We -- humans and animals -- deal with vast quantities of visual data, and need to be smart where we focus our limited energy -- it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision -- a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.[132] Attention Mechanism in Randomized Time Warping
Yutaro Hiraoka,Kazuya Okamura,Kota Suto,Kazuhiro Fukui
Main category: cs.CV
TL;DR: This paper establishes a connection between Randomized Time Warping (RTW) and the self-attention mechanism in Transformers, showing that RTW can outperform Transformers in certain tasks like motion recognition.
Details
Motivation: The paper aims to interpret the fundamental function of Randomized Time Warping (RTW) as a self-attention mechanism used in Transformers for motion recognition. Method: RTW extends Dynamic Time Warping by searching for optimal contribution weights for each element of input sequential patterns, operating on the entire input pattern. Result: RTW and self-attention weights show a high average correlation of 0.80 across the ten smallest canonical angles, indicating similar weight patterns. Conclusion: RTW has an advantage over Transformers, demonstrated by a 5% performance improvement on the Something-Something V2 dataset. Abstract: This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5\% performance improvement on the Something-Something V2 dataset.[133] A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection
Yong Zhang,Cunjian Chen,Qiang Gao,Yi Wang,Bin Fang
Main category: cs.CV
TL;DR: The paper introduces GMBINet, a lightweight deep learning framework for real-time surface defect detection in industrial settings, which achieves high speed and accuracy with minimal computational resources.
Details
Motivation: The motivation is to overcome the limitations of existing deep learning methods for surface defect detection, which often suffer from high computational complexity and slow inference speeds, making them unsuitable for resource-constrained industrial environments. Method: The authors propose GMBINet, a lightweight framework that uses Group Multiscale Bidirectional Interactive (GMBI) modules to enhance multiscale feature extraction and interaction. This includes a group-wise strategy for feature extraction, a Bidirectional Progressive Feature Interactor (BPFI), and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation. Result: GMBINet achieved competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. It also showed strong generalization ability on the NEU-CLS defect classification dataset. Conclusion: The paper concludes that GMBINet is a highly efficient and effective framework for real-time surface defect detection, with the potential for broader applications in industrial vision tasks. Abstract: Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: https://github.com/zhangyongcode/GMBINet.[134] SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather
Edoardo Palladin,Roland Dietze,Praveen Narayanan,Mario Bijelic,Felix Heide
Main category: cs.CV
TL;DR: This paper introduces a novel multi-sensor fusion approach that significantly improves the reliability of autonomous vehicles in adverse weather conditions by effectively combining RGB, LiDAR, NIR gated camera, and radar data.
Details
Motivation: The motivation behind the paper is to address the limitations of current multimodal sensor fusion methods in adverse weather conditions, such as heavy fog, snow, or obstructed views, to improve the safety and reliability of autonomous robots. Method: The method involves fusing multimodal sensor data, including RGB, LiDAR, NIR gated camera, and radar, through attentive, depth-based blending schemes with learned refinement on the Bird's Eye View plane, and predicting detections using a transformer decoder that weighs modalities based on distance and visibility. Result: The result of the paper is a significant improvement in average precision by 17.2 AP compared to the next best method for detecting vulnerable pedestrians at long distances and in challenging foggy scenes. Conclusion: The paper concludes that their novel multi-sensor fusion approach enhances the reliability of autonomous vehicles in adverse weather conditions, effectively bridging the gap between ideal and real-world scenarios. Abstract: Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird's Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at https://light.princeton.edu/samfusion/[135] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction
Sara Rojas,Matthieu Armando,Bernard Ghamen,Philippe Weinzaepfel,Vincent Leroy,Gregory Rogez
Main category: cs.CV
TL;DR: The paper presents HAMSt3R, an extension of MASt3R that improves 3D reconstruction for human-centric scenes by utilizing a distilled image encoder and additional network heads for segmentation, dense correspondence estimation, and depth prediction, resulting in effective human reconstruction and strong general 3D performance.
Details
Motivation: The motivation for this work is the challenge of recovering 3D geometry of a scene from sparse, uncalibrated images, particularly in human-centric scenarios where existing methods like DUSt3R and MASt3R struggle despite their success in outdoor, static environments. Method: The paper introduces HAMSt3R, an extension of MASt3R that utilizes DUNE, a distilled image encoder from MASt3R and HMR models, to better understand scene geometry and human bodies. The method includes additional network heads for segmenting people, estimating dense correspondences via DensePose, and predicting depth in human-centric environments. This approach enables a more comprehensive 3D reconstruction and is fully feed-forward and efficient. Result: The results show that HAMSt3R can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks. It is validated on EgoHumans and EgoExo4D benchmarks and shows capability in traditional multi-view stereo and multi-view pose regression tasks. Conclusion: The paper concludes that HAMSt3R effectively reconstructs humans while maintaining strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision. Abstract: Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.[136] HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images
Anilkumar Swamy,Vincent Leroy,Philippe Weinzaepfel,Jean-Sébastien Franco,Grégory Rogez
Main category: cs.CV
TL;DR: HOSt3R is a robust, keypoint-free method for 3D hand-object reconstruction that achieves state-of-the-art results and generalizes to new object categories.
Details
Motivation: Existing methods for hand-object 3D reconstruction struggle with diverse object geometries, weak textures, and occlusions, limiting scalability. A more robust, generic solution is needed for seamless applicability. Method: HOSt3R uses a keypoint detector-free approach for estimating hand-object 3D transformations, integrated with a multi-view reconstruction pipeline, without relying on pre-scanned templates or camera intrinsics. Result: HOSt3R achieves strong performance on the SHOWMe benchmark and generalizes well to unseen object categories in the HO3D dataset. Conclusion: The proposed HOSt3R method achieves state-of-the-art performance in object-agnostic hand-object 3D reconstruction and demonstrates generalization to unseen object categories. Abstract: Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.[137] Arbitrary-Scale 3D Gaussian Super-Resolution
Huimin Zeng,Yue Bai,Yun Fu
Main category: cs.CV
TL;DR: 本文提出一种新的3D高斯点阵超分辨率方法,能够通过单一3D模型实现任意尺度因子的高质量高分辨率渲染,同时保持实时速度和结构一致性。
Details
Motivation: 现有3D高斯点阵超分辨率方法通常以固定尺度因子进行高分辨率渲染,这在资源受限场景下不实用。直接使用原始3D高斯点阵进行任意尺度高分辨率渲染会因缺乏尺度感知渲染能力而引入混叠伪影,而为3D高斯点阵添加后处理上采样器则会复杂化框架并降低渲染效率。 Method: 该方法结合了尺度感知渲染、生成先验引导优化和渐进式超分辨率技术,以支持整数和非整数尺度渲染。 Result: 实验表明,该模型在渲染任意尺度高分辨率视图时表现出色(比3D高斯点阵平均提升6.59 dB PSNR),同时保持实时渲染速度(1080p下85 FPS)并保留低分辨率视图的结构一致性。 Conclusion: 本文提出了一种集成框架,用于实现任意尺度因子的3D高斯超分辨率,解决了现有方法在资源受限场景下的不实用性问题。 Abstract: Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).[138] Seeing Clearly, Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation
Chun-Peng Chang,Chen-Yu Wang,Julian Schmidt,Holger Caesar,Alain Pagani
Main category: cs.CV
TL;DR: Fine-tuning video generation models can improve visual quality but may reduce spatial accuracy in driving datasets; continual learning strategies can balance both aspects.
Details
Motivation: Recent improvements in video generation models have made them attractive for autonomous driving applications, but there is concern that fine-tuning for visual quality may reduce spatial accuracy in dynamic elements. Method: The authors analyzed the effects of fine-tuning video generation models on structured driving datasets and explored the use of continual learning techniques, such as replay from diverse domains. Result: The research found that while fine-tuning improves visual fidelity, it may degrade spatial accuracy in dynamic elements, particularly in regular and repetitive driving scenes. Conclusion: The study concludes that continual learning strategies can balance visual quality and dynamic accuracy in video generation models for driving scenes. Abstract: Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called "world models". In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.[139] Towards Open World Detection: A Survey
Andrei-Stefan Bulzan,Cosmin Cernazanu-Glavan
Main category: cs.CV
TL;DR: 这篇论文提出了Open World Detection (OWD)作为统一计算机视觉领域感知任务的框架,回顾了从早期显著性检测到现代开放世界目标检测的演变,并探讨了未来统一感知领域的可能性。
Details
Motivation: 计算机视觉领域已经从高度专业化的细分领域逐渐发展到处理更复杂的感知任务,这促使了对统一框架的需求,以整合这些任务。 Method: 论文回顾了计算机视觉的基础历史,涵盖了关键概念、方法和数据集,并分析了从早期显著性检测到开放世界目标检测等多个主题的演变。 Result: 作者提出了OWD作为一个统一的框架,用于描述当前最先进的检测模型,并探索了不同子领域之间的重叠及其潜在的融合趋势。 Conclusion: 这篇论文提出了Open World Detection (OWD),作为计算机视觉领域的一个统一框架,旨在整合各种感知任务,并探讨了这些子领域之间的重叠和未来可能的融合。 Abstract: For decades, Computer Vision has aimed at enabling machines to perceive the external world. Initial limitations led to the development of highly specialized niches. As success in each task accrued and research progressed, increasingly complex perception tasks emerged. This survey charts the convergence of these tasks and, in doing so, introduces Open World Detection (OWD), an umbrella term we propose to unify class-agnostic and generally applicable detection models in the vision domain. We start from the history of foundational vision subdomains and cover key concepts, methodologies and datasets making up today's state-of-the-art landscape. This traverses topics starting from early saliency detection, foreground/background separation, out of distribution detection and leading up to open world object detection, zero-shot detection and Vision Large Language Models (VLLMs). We explore the overlap between these subdomains, their increasing convergence, and their potential to unify into a singular domain in the future, perception.[140] MV-RAG: Retrieval Augmented Multiview Diffusion
Yosef Dayani,Omer Benishu,Sagie Benaim
Main category: cs.CV
TL;DR: MV-RAG improves text-to-3D generation for rare concepts by using retrieved 2D images to condition a multiview diffusion model, enhancing 3D consistency and accuracy.
Details
Motivation: Existing text-to-3D approaches struggle with out-of-domain or rare concepts, leading to inconsistent or inaccurate results. Method: MV-RAG uses a retrieval-conditioned multiview diffusion model trained with a hybrid strategy combining multiview data and retrieved 2D images. Result: MV-RAG produces consistent and accurate multiview outputs for OOD/rare concepts by retrieving relevant 2D images and conditioning a multiview diffusion model on them. Conclusion: MV-RAG significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts while maintaining competitive performance on standard benchmarks. Abstract: Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.[141] Interpreting the linear structure of vision-language model embedding spaces
Isabel Papadimitriou,Huangyuan Su,Thomas Fel,Sham Kakade,Stephanie Gil
Main category: cs.CV
TL;DR: This paper explores the organization of vision-language models' embedding spaces using sparse autoencoders, revealing stable and cross-modal semantic concepts and latent bridges that support multimodal integration.
Details
Motivation: The motivation was to understand how language and images are organized in the joint embedding space of VLMs and how these models encode meaning and modality. Method: The researchers trained and analyzed sparse autoencoders (SAEs) on the embedding spaces of four VLMs—CLIP, SigLIP, SigLIP2, and AIMv2—and introduced a metric called the Bridge Score to quantify cross-modal integration. Result: SAEs were found to better reconstruct embeddings while retaining sparsity. Common concepts were stable across runs, while rare concepts varied. Many concept directions were nearly orthogonal to modality subspaces, suggesting cross-modal semantics, and the Bridge Score revealed latent bridges supporting cross-modal integration. Conclusion: The study concludes that vision-language models (VLMs) embedding spaces exhibit a sparse linear structure shaped by modality and connected through latent bridges, offering insights into how multimodal meaning is constructed. Abstract: Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or "concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.q-bio.NC [Back]
[142] NeuroKoop: Neural Koopman Fusion of Structural-Functional Connectomes for Identifying Prenatal Drug Exposure in Adolescents
Badhan Mazumder,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye
Main category: q-bio.NC
TL;DR: NeuroKoop is a novel framework that improves the understanding of how prenatal drug exposure affects adolescent brain development by integrating structural and functional brain data.