Table of Contents
cs.CL [Back]
[1] Cache Mechanism for Agent RAG Systems
Shuhang Lin,Zhencan Peng,Lingyao Li,Xiao Lin,Xi Zhu,Yongfeng Zhang
Main category: cs.CL
TL;DR: 本文提出了ARC(Agent RAG Cache Mechanism),一种无需标注的动态缓存框架,通过结合历史查询模式和嵌入空间中缓存项的几何结构,为每个LLM代理高效维护小规模高价值语料库,在显著降低存储和检索延迟的同时保持高回答率。
Details
Motivation: 尽管检索增强生成(RAG)提升了大语言模型代理的性能,但针对代理级别的缓存管理——特别是如何动态构建、维护和更新紧凑且相关的语料库——仍缺乏研究。现有方法在存储效率和响应速度方面存在不足,因此需要一种能自适应代理需求的高效缓存机制。 Method: 提出ARC缓存机制,该方法无需人工标注,通过分析代理的历史查询分布模式,并结合缓存项在嵌入空间中的内在几何结构,动态选择和更新高相关性的小型语料库,实现对缓存内容的自动化管理。 Result: 在三个检索数据集上的实验表明,ARC将存储需求降至原始语料库的0.015%,最高实现79.8%的有答案率,并将平均检索延迟降低80%。 Conclusion: ARC能够显著提升RAG驱动的LLM代理在效率和有效性方面的表现,为代理级缓存管理提供了一种可扩展且实用的解决方案。 Abstract: Recent advances in Large Language Model (LLM)-based agents have been propelled by Retrieval-Augmented Generation (RAG), which grants the models access to vast external knowledge bases. Despite RAG's success in improving agent performance, agent-level cache management, particularly constructing, maintaining, and updating a compact, relevant corpus dynamically tailored to each agent's need, remains underexplored. Therefore, we introduce ARC (Agent RAG Cache Mechanism), a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each agent. By synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space, ARC automatically maintains a high-relevance cache. With comprehensive experiments on three retrieval datasets, our experimental results demonstrate that ARC reduces storage requirements to 0.015% of the original corpus while offering up to 79.8% has-answer rate and reducing average retrieval latency by 80%. Our results demonstrate that ARC can drastically enhance efficiency and effectiveness in RAG-powered LLM agents.[2] Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model
Cristian García-Romero,Miquel Esplà-Gomis,Felipe Sánchez-Martínez
Main category: cs.CL
TL;DR: 提出一种利用代理多语言机器翻译模型内部表示来区分人类与机器翻译句子的新方法,在非英语语对上显著优于现有技术,准确率提升至少5个百分点。
Details
Motivation: 现代机器翻译系统依赖从互联网收集的大规模平行语料,但其中可能包含大量机器生成的翻译文本,过度依赖这类合成数据会显著降低翻译质量,因此需要有效过滤非人工翻译内容。 Method: 利用一个代理的多语言机器翻译模型的内部表示,通过分析其对句子的编码特征来区分人类翻译和机器翻译的句子。 Result: 实验结果表明,该方法在识别非人工翻译方面优于当前最先进的技术,尤其在非英语语言对上表现突出,准确率提升至少5个百分点。 Conclusion: 所提出的方法能有效识别并过滤机器生成的翻译文本,有助于提升机器翻译系统的训练数据质量和最终翻译性能。 Abstract: Modern machine translation (MT) systems depend on large parallel corpora, often collected from the Internet. However, recent evidence indicates that (i) a substantial portion of these texts are machine-generated translations, and (ii) an overreliance on such synthetic content in training data can significantly degrade translation quality. As a result, filtering out non-human translations is becoming an essential pre-processing step in building high-quality MT systems. In this work, we propose a novel approach that directly exploits the internal representations of a surrogate multilingual MT model to distinguish between human and machine-translated sentences. Experimental results show that our method outperforms current state-of-the-art techniques, particularly for non-English language pairs, achieving gains of at least 5 percentage points of accuracy.[3] LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation
Gyeom Hwangbo,Hyungjoo Chae,Minseok Kang,Hyeonjong Ju,Soohyun Oh,Jinyoung Yeo
Main category: cs.CL
TL;DR: 本文提出LEGO-Eval评估框架和LEGO-Bench基准,用于更准确地评估细粒度指令与生成3D场景之间的对齐性,实验表明现有方法在真实场景生成上存在显著局限。
Details
Motivation: 现有3D场景生成因指令粗略而缺乏真实空间布局和对象属性,导致具身智能体学习到不符合现实的先验知识,且当前评估方法无法可靠评估细粒度对齐。 Method: 设计LEGO-Eval评估框架,结合多种工具显式 grounding 场景组件,并构建包含复杂真实环境指令的LEGO-Bench基准进行评测。 Result: LEGO-Eval在F1分数上比基于VLM的评估方法高出0.41;在LEGO-Bench上的测试显示,现有生成方法最多仅有10%的成功率能完全对齐细粒度指令。 Conclusion: 需更精细的指令和评估机制来提升3D场景生成的真实性,LEGO-Eval和LEGO-Bench为改进生成模型提供了有效工具和标准。 Abstract: Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.[4] Targeted Error Correction in Knowledge Distillation: Small Language Models Surpass GPT
Hee-Jin Lee,Zhen Guo,Luchao Jin,Morteza Moazami Goudarzi
Main category: cs.CL
TL;DR: 提出了一种Analyze-Revise-Finetune (ARF) 管道,利用小型开源语言模型在客服摘要任务中超越大型专有模型。
Details
Motivation: 提升小型开源语言模型在特定任务上的性能,同时降低成本和保护数据隐私。 Method: 首先分析GPT-3.5生成的摘要错误,使用Llama 3.1 70B作为编辑器模型进行针对性修正,生成高质量训练数据,再用这些数据微调Llama 3.1 8B模型。 Result: 经过微调的小型模型在摘要任务上表现优于GPT-3.5,且具有更高的成本效益和数据隐私保障。 Conclusion: ARF管道为增强开源语言模型提供了一个可推广的框架,适用于多种下游应用。 Abstract: We introduce an Analyze-Revise-Finetune (ARF) pipeline that enables smaller open-source language models (LLMs) to surpass substantially larger proprietary models in customer service summarization tasks. The pipeline first analyzes and categorizes common errors in summaries produced by a teacher model (GPT-3.5), then performs a targeted revision using a compact editor model (Llama 3.1 70B) to generate high-quality, refined training data. Fine-tuning a smaller student model (Llama 3.1 8B) on this refined data resulted in superior summarization performance compared to GPT-3.5. The ARF pipeline improves cost efficiency and data privacy while maintaining competitive accuracy, illustrating a generalizable framework for enhancing open-source LLMs across diverse downstream applications.[5] Data-Efficient Adaptation and a Novel Evaluation Method for Aspect-based Sentiment Analysis
Yan Cathy Hua,Paul Denny,Jörg Wicker,Katerina Taškova
Main category: cs.CL
TL;DR: 本文提出了针对方面情感分析(ABSA)在低资源领域(如教育)的三项贡献:一种灵活的评估方法FTS-OBP,对小规模生成语言模型的系统研究及多任务微调策略,并发布了首个公开的教育评论ABSA资源集。
Details
Motivation: ABSA研究集中在商业领域,教育资源等低资源领域的分析需求未被满足,且现有方法依赖大量资源,传统评估方式过于严格,限制了生成模型的准确评估。 Method: 提出FTS-OBP评估方法以容忍边界变化;研究小规模生成语言模型在无数据、轻数据场景下的表现;设计多任务微调策略;在教育评论数据上进行实验验证。 Result: FTS-OBP与传统指标高度相关且更具灵活性;1.5-3.8B的小模型通过少量样本(200-1000)在单GPU上达到甚至超过大型专有模型的表现;发布了首个公开的教育ABSA数据集。 Conclusion: 该工作推动了ABSA在低资源领域的应用,提供了高效、低成本的解决方案和新评估标准,并开源资源促进后续研究。 Abstract: Aspect-based Sentiment Analysis (ABSA) is a fine-grained opinion mining approach that identifies and classifies opinions associated with specific entities (aspects) or their categories within a sentence. Despite its rapid growth and broad potential, ABSA research and resources remain concentrated in commercial domains, leaving analytical needs unmet in high-demand yet low-resource areas such as education and healthcare. Domain adaptation challenges and most existing methods' reliance on resource-intensive in-training knowledge injection further hinder progress in these areas. Moreover, traditional evaluation methods based on exact matches are overly rigid for ABSA tasks, penalising any boundary variations which may misrepresent the performance of generative models. This work addresses these gaps through three contributions: 1) We propose a novel evaluation method, Flexible Text Similarity Matching and Optimal Bipartite Pairing (FTS-OBP), which accommodates realistic extraction boundary variations while maintaining strong correlation with traditional metrics and offering fine-grained diagnostics. 2) We present the first ABSA study of small decoder-only generative language models (SLMs; <7B parameters), examining resource lower bounds via a case study in education review ABSA. We systematically explore data-free (in-context learning and weight merging) and data-light fine-tuning methods, and propose a multitask fine-tuning strategy that significantly enhances SLM performance, enabling 1.5-3.8 B models to surpass proprietary large models and approach benchmark results with only 200-1,000 examples on a single GPU. 3) We release the first public set of education review ABSA resources to support future research in low-resource domains.[6] ROBoto2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment
Anthony Hevia,Sanjana Chintalapati,Veronica Ka Wai Lai,Thanh Tam Nguyen,Wai-Tat Wong,Terry Klassen,Lucy Lu Wang
Main category: cs.CL
TL;DR: ROBOTO2是一个开源的、基于网页的大语言模型辅助平台,用于临床试验偏倚风险评估,通过结合PDF解析、检索增强的LLM提示和人工反馈,简化了传统的ROB2标注流程。
Details
Motivation: 为了减轻传统偏倚风险(ROB2)评估过程中繁重的手动标注工作,提高系统性综述中偏倚评估的效率和可重复性。 Method: 开发了一个集成了PDF解析、检索增强的大语言模型提示和人机交互审核的交互式Web平台,并构建了一个包含521项儿科临床试验报告的数据集用于基准测试。 Result: 发布了ROBOTO2平台及其代码和数据,构建了包含8954个信号问题和1202条证据片段的数据集,对4种大语言模型进行了ROB2性能基准测试,揭示了当前模型的能力与挑战。 Conclusion: ROBOTO2有效提升了偏倚风险评估的自动化水平,具有良好的开放性和可扩展性,为未来系统性综述中的自动化评估研究提供了重要资源和参考。 Abstract: We present ROBOTO2, an open-source, web-based platform for large language model (LLM)-assisted risk of bias (ROB) assessment of clinical trials. ROBOTO2 streamlines the traditionally labor-intensive ROB v2 (ROB2) annotation process via an interactive interface that combines PDF parsing, retrieval-augmented LLM prompting, and human-in-the-loop review. Users can upload clinical trial reports, receive preliminary answers and supporting evidence for ROB2 signaling questions, and provide real-time feedback or corrections to system suggestions. ROBOTO2 is publicly available at https://roboto2.vercel.app/, with code and data released to foster reproducibility and adoption. We construct and release a dataset of 521 pediatric clinical trial reports (8954 signaling questions with 1202 evidence passages), annotated using both manually and LLM-assisted methods, serving as a benchmark and enabling future research. Using this dataset, we benchmark ROB2 performance for 4 LLMs and provide an analysis into current model capabilities and ongoing challenges in automating this critical aspect of systematic review.[7] Reading Between the Lines: The One-Sided Conversation Problem
Victoria Ebert,Rishabh Singh,Tuochao Chen,Noah A. Smith,Shyamnath Gollakota
Main category: cs.CL
TL;DR: 本文提出了单边对话问题(1SC),旨在从仅记录对话一方的情况下推断和学习对话内容,研究了缺失说话人语句的重建与单边转录本摘要生成两个任务,发现利用未来一轮信息和话语长度提示可提升重建效果,而高质量摘要无需重建即可生成,为隐私保护的对话AI提供了新方向。
Details
Motivation: 在许多现实场景中(如远程医疗、呼叫中心),只能记录对话的一方,限制了传统对话AI的应用,因此需要研究如何从单边对话中有效学习和推断。 Method: 通过形式化1SC问题,在MultiWOZ、DailyDialog和Candor数据集上评估提示与微调模型的表现,采用人类A/B测试和LLM-as-a-judge指标分析不同策略对语句重建和摘要生成的影响。 Result: 发现利用未来一轮信息和话语长度提示能提升语句重建质量;占位符提示可减少幻觉;大模型通过提示即可生成较好重建结果,小模型需微调;高质量摘要可在不重建缺失语句的情况下生成。 Conclusion: 1SC被确立为一个新颖且具有挑战性的问题,实验结果表明在隐私敏感场景下实现有效对话理解是可行的,推动了隐私保护型对话AI的发展。 Abstract: Conversational AI is constrained in many real-world settings where only one side of a dialogue can be recorded, such as telemedicine, call centers, and smart glasses. We formalize this as the one-sided conversation problem (1SC): inferring and learning from one side of a conversation. We study two tasks: (1) reconstructing the missing speaker's turns for real-time use cases, and (2) generating summaries from one-sided transcripts. Evaluating prompting and finetuned models on MultiWOZ, DailyDialog, and Candor with both human A/B testing and LLM-as-a-judge metrics, we find that access to one future turn and information about utterance length improves reconstruction, placeholder prompting helps to mitigate hallucination, and while large models generate promising reconstructions with prompting, smaller models require finetuning. Further, high-quality summaries can be generated without reconstructing missing turns. We present 1SC as a novel challenge and report promising results that mark a step toward privacy-aware conversational AI.[8] PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech
Michel Wong,Ali Alshehri,Sophia Kao,Haotian He
Main category: cs.CL
TL;DR: 提出PolyNorm,一种基于大语言模型的提示方法,用于减少文本规范化对人工规则的依赖,并通过自动数据整理和评估流程实现多语言扩展。
Details
Motivation: 传统文本规范化系统虽然准确率高,但工程复杂、难以扩展,尤其在低资源语言中面临语言覆盖难题。 Method: 采用基于大语言模型的提示学习方法,并设计了一种语言无关的自动数据整理与评估流程。 Result: 在八种语言上的实验表明,相比生产级系统,词错误率(WER)持续降低。同时发布了涵盖多种文本规范化现象的多语言数据集PolyNorm-Benchmark。 Conclusion: PolyNorm能有效减少对人工规则的依赖,具备良好的跨语言可扩展性,且通过释放基准数据集推动后续研究。 Abstract: Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering effort, are difficult to scale, and pose challenges to language coverage, particularly in low-resource settings. We propose PolyNorm, a prompt-based approach to TN using Large Language Models (LLMs), aiming to reduce the reliance on manually crafted rules and enable broader linguistic applicability with minimal human intervention. Additionally, we present a language-agnostic pipeline for automatic data curation and evaluation, designed to facilitate scalable experimentation across diverse languages. Experiments across eight languages show consistent reductions in the word error rate (WER) compared to a production-grade-based system. To support further research, we release PolyNorm-Benchmark, a multilingual data set covering a diverse range of text normalization phenomena.[9] A Computational Approach to Analyzing Disrupted Language in Schizophrenia: Integrating Surprisal and Coherence Measures
Gowtham Premananth,Carol Espy-Wilson
Main category: cs.CL
TL;DR: 该研究利用计算语言学中的“意外度”(surprisal)和“语义连贯性”(semantic coherence)来量化精神分裂症患者的语言紊乱,并探讨这些指标在患者与健康对照组之间的差异及其与症状严重程度的关系。
Details
Motivation: 语言紊乱是精神分裂症的典型特征,反映潜在的认知障碍。寻找可量化的语言指标有助于实现客观诊断和症状严重程度评估。 Method: 使用计算模型计算精神分裂症患者和健康对照组在自发语言中的 surprisal 和 semantic coherence 两个语言学指标,并比较组间差异及其与临床症状严重程度的相关性。 Result: 精神分裂症患者的语言表现出更高的 surprisal 和更低的 semantic coherence,且这些语言指标与症状严重程度显著相关。 Conclusion: surprisal 和 semantic coherence 可作为表征精神分裂症语言紊乱的有效计算指标,具有成为客观生物标志物的潜力。 Abstract: Language disruptions are one of the well-known effects of schizophrenia symptoms. They are often manifested as disorganized speech and impaired discourse coherence. These abnormalities in spontaneous language production reflect underlying cognitive disturbances and have the potential to serve as objective markers for symptom severity and diagnosis of schizophrenia. This study focuses on how these language disruptions can be characterized in terms of two computational linguistic measures: surprisal and semantic coherence. By computing surprisal and semantic coherence of language using computational models, this study investigates how they differ between subjects with schizophrenia and healthy controls. Furthermore, this study provides further insight into how language disruptions in terms of these linguistic measures change with varying degrees of schizophrenia symptom severity.[10] CARMA: Comprehensive Automatically-annotated Reddit Mental Health Dataset for Arabic
Saad Mankarious,Ayah Zirikly
Main category: cs.CL
TL;DR: 本文提出了CARMA,首个大规模自动标注的阿拉伯语Reddit帖子数据集,用于检测六种心理健康状况,填补了阿拉伯语资源稀缺的空白。
Details
Motivation: 阿拉伯语群体面临心理健康问题早期检测资源不足和文化禁忌的挑战,且缺乏相关标注数据集,导致研究滞后。 Method: 构建了一个自动标注的大规模阿拉伯语Reddit数据集(CARMA),涵盖六种心理疾病及对照组,并进行词汇与语义的定性与定量分析,使用多种模型进行分类实验。 Result: CARMA在规模和多样性上超过现有数据集,实验表明各类模型均可有效分类,验证了其在阿拉伯语心理健康检测中的潜力。 Conclusion: CARMA为阿拉伯语心理健康研究提供了宝贵资源,推动了低资源语言下心理健康自动检测的发展。 Abstract: Mental health disorders affect millions worldwide, yet early detection remains a major challenge, particularly for Arabic-speaking populations where resources are limited and mental health discourse is often discouraged due to cultural stigma. While substantial research has focused on English-language mental health detection, Arabic remains significantly underexplored, partly due to the scarcity of annotated datasets. We present CARMA, the first automatically annotated large-scale dataset of Arabic Reddit posts. The dataset encompasses six mental health conditions, such as Anxiety, Autism, and Depression, and a control group. CARMA surpasses existing resources in both scale and diversity. We conduct qualitative and quantitative analyses of lexical and semantic differences between users, providing insights into the linguistic markers of specific mental health conditions. To demonstrate the dataset's potential for further mental health analysis, we perform classification experiments using a range of models, from shallow classifiers to large language models. Our results highlight the promise of advancing mental health detection in underrepresented languages such as Arabic.[11] Control Barrier Function for Aligning Large Language Models
Yuya Miyaoka,Masaki Inoue
Main category: cs.CL
TL;DR: 提出一种基于控制屏障函数(CBF)的控制框架,用于对齐大语言模型,通过安全过滤器干预生成文本,无需微调且可结合评估模型直接应用。
Details
Motivation: 为了在不微调大语言模型的情况下实现用户期望的文本生成对齐,并确保生成内容的安全性与合规性。 Method: 利用控制屏障函数(CBF)设计一个附加式安全过滤器,应用于基线模型生成的预测token,动态干预文本生成过程。 Result: 该框架可在开源语言模型上实现,有效生成正面内容,且无需微调模型,同时支持集成外部评估模型指导对齐。 Conclusion: 所提出的CBF-based安全过滤框架为大语言模型的对齐提供了一种灵活、可扩展且无需微调的解决方案,适用于安全敏感的文本生成场景。 Abstract: This paper proposes a control-based framework for aligning large language models (LLMs) by leveraging a control barrier function (CBF) to ensure user-desirable text generation. The presented framework applies the CBF safety filter to the predicted token generated from the baseline LLM, to intervene in the generated text. The safety filter includes two significant advantages: this safety filter is an add-on type, allowing it to be used for alignment purposes without fine-tuning the baseline LLM, and if there is an evaluation model regarding the desired alignment, it can be directly applied to the filter design. The overall text-generation system is implemented with open-source language models, aiming to generate positive text.[12] MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity
Kaiyuan Zhang,Chenghao Yang,Zhoufutu Wen,Sihang Yuan,Qiuyue Wang,Chaoyi Huang,Guosheng Zhu,He Wang,Huawenyu Lu,Jianing Wen,Jianpeng Jiao,Lishu Luo,Longxiang Liu,Sijin Wu,Xiaolei Zhu,Xuanliang Zhang,Ge Zhang,Yi Lin,Guang Shi,Chaoyou Fu,Wenhao Huang
Main category: cs.CL
TL;DR: 本文提出了MME-CC,一个基于视觉的多模态认知能力评估基准,涵盖空间、几何和知识推理三类任务,系统评估了16种主流MLLM的认知能力,发现闭源模型整体领先,但空间与几何推理仍较弱,并揭示了常见错误模式及思维链的三阶段依赖。
Details
Motivation: 现有多种模态基准过于侧重文本推理或未能系统捕捉以视觉为中心的认知行为,导致MLLMs的认知能力评估不足,因此需要一个聚焦视觉认知的系统性评估基准。 Method: 提出MME-CC基准,将11个代表性推理任务划分为空间、几何和基于知识的推理三类,对16个主流MLLM进行大规模实验,并进行细粒度分析,识别错误模式与思维链结构。 Result: 实验显示闭源模型总体领先(如Gemini-2.5-Pro得分为42.66,GLM-4.5V为30.45),但空间与几何推理能力普遍较弱(≤30%);发现了方向判断错误、跨视角身份保持脆弱、反事实指令遵循差等常见错误;思维链通常经历提取->推理->验证三阶段,且高度依赖视觉提取。 Conclusion: MME-CC为评估MLLM的视觉认知能力提供了系统框架,揭示了当前模型在视觉推理上的局限,呼吁未来研究将认知能力置于多模态模型评估与设计的核心位置。 Abstract: As reasoning models scale rapidly, the essential role of multimodality in human cognition has come into sharp relief, driving a growing need to probe vision-centric cognitive behaviors. Yet, existing multimodal benchmarks either overemphasize textual reasoning or fall short of systematically capturing vision-centric cognitive behaviors, leaving the cognitive capacity of MLLMs insufficiently assessed. To address this limitation, we introduce MME-CC (Multi-Modal Evaluation benchmark of Cognitive Capacity), a vision-grounded benchmark that organizes 11 representative reasoning tasks into three fundamental categories of visual information: spatial, geometric, and knowledge-based reasoning, and provides fine-grained analyses of MLLMs' cognitive capacity across these dimensions. Based on MME-CC, we conduct extensive experiments over 16 representative MLLMs. Our study reveals that closed-source models currently lead overall (e.g., 42.66 for Gemini-2.5-Pro vs. 30.45 for GLM-4.5V), while spatial and geometric reasoning remain broadly weak (less than or equal to 30%). We further identify common error patterns, including orientation mistakes, fragile cross-view identity persistence, and poor adherence to counterfactual instructions, and observe that Chain-of-Thought typically follows a three-stage process (extract -> reason -> verify) with heavy reliance on visual extraction. We hope this work catalyzes a shift toward treating the cognitive capacity of MLLMs as central to both evaluation and model design.[13] Who Sees the Risk? Stakeholder Conflicts and Explanatory Policies in LLM-based Risk Assessment
Srishti Yadav,Jasmina Gajcin,Erik Miehling,Elizabeth Daly
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的框架,用于从不同利益相关者的角度评估AI系统的风险,并通过可解释性方法生成利益相关者特定的风险政策,揭示其在医疗AI、自动驾驶和欺诈检测等场景中的风险认知差异与冲突。
Details
Motivation: 不同利益相关者对AI系统风险的认知存在差异,传统的风险评估方法难以捕捉这些差异并提供透明解释,因此需要一种更人性化、可解释的风险评估方式。 Method: 利用LLM作为‘裁判’来预测和解释风险,结合Risk Atlas Nexus和GloVE解释方法,构建利益相关者特定的可解释风险评估框架,并设计交互式可视化工具以揭示利益相关者间的冲突来源。 Result: 在三个真实AI应用场景中验证了该框架的有效性,结果显示不同利益相关者的风险感知显著影响风险判断和冲突模式,且该方法能有效揭示冲突原因。 Conclusion: 利益相关者的视角显著影响AI风险评估结果,所提出的框架提升了LLM评估的透明性和可解释性,有助于实现以人为中心的AI治理目标。 Abstract: Understanding how different stakeholders perceive risks in AI systems is essential for their responsible deployment. This paper presents a framework for stakeholder-grounded risk assessment by using LLMs, acting as judges to predict and explain risks. Using the Risk Atlas Nexus and GloVE explanation method, our framework generates stakeholder-specific, interpretable policies that shows how different stakeholders agree or disagree about the same risks. We demonstrate our method using three real-world AI use cases of medical AI, autonomous vehicles, and fraud detection domain. We further propose an interactive visualization that reveals how and why conflicts emerge across stakeholder perspectives, enhancing transparency in conflict reasoning. Our results show that stakeholder perspectives significantly influence risk perception and conflict patterns. Our work emphasizes the importance of these stakeholder-aware explanations needed to make LLM-based evaluations more transparent, interpretable, and aligned with human-centered AI governance goals.[14] Measuring Aleatoric and Epistemic Uncertainty in LLMs: Empirical Evaluation on ID and OOD QA Tasks
Kevin Wang,Subre Abdoul Moktar,Jia Li,Kangshuo Li,Feng Chen
Main category: cs.CL
TL;DR: 本文对大语言模型(LLM)中的不确定性估计(UE)方法进行了全面的实证研究,评估了12种UE方法在问答任务中对分布内和分布外数据的鲁棒性和有效性。
Details
Motivation: 确保大语言模型输出的可信度至关重要,而不确定性估计在其中起关键作用。需要系统评估不同UE方法在不同类型不确定性(如偶然性和认知性)下的表现。 Method: 研究涵盖了12种不同的不确定性估计方法和4种生成质量指标(包括基于LLM批评者的LLMScore),在问答任务的分布内和分布外数据集上进行评估。 Result: 基于信息的方法在分布内表现优异;基于密度的方法和P(True)指标在分布外表现更好;语义一致性方法在不同数据集和指标下表现稳定可靠。 Conclusion: 不同类型的不确定性估计方法各有优势,应根据具体应用场景(如是否为分布外数据)选择合适的方法以提高LLM输出的可信度。 Abstract: Large Language Models (LLMs) have become increasingly pervasive, finding applications across many industries and disciplines. Ensuring the trustworthiness of LLM outputs is paramount, where Uncertainty Estimation (UE) plays a key role. In this work, a comprehensive empirical study is conducted to examine the robustness and effectiveness of diverse UE measures regarding aleatoric and epistemic uncertainty in LLMs. It involves twelve different UE methods and four generation quality metrics including LLMScore from LLM criticizers to evaluate the uncertainty of LLM-generated answers in Question-Answering (QA) tasks on both in-distribution (ID) and out-of-distribution (OOD) datasets. Our analysis reveals that information-based methods, which leverage token and sequence probabilities, perform exceptionally well in ID settings due to their alignment with the model's understanding of the data. Conversely, density-based methods and the P(True) metric exhibit superior performance in OOD contexts, highlighting their effectiveness in capturing the model's epistemic uncertainty. Semantic consistency methods, which assess variability in generated answers, show reliable performance across different datasets and generation metrics. These methods generally perform well but may not be optimal for every situation.[15] BengaliMoralBench: A Benchmark for Auditing Moral Reasoning in Large Language Models within Bengali Language and Culture
Shahriyar Zaman Ridoy,Azmine Toushik Wasi,Koushik Ahamed Tonmoy
Main category: cs.CL
TL;DR: 本文提出了BengaliMoralBench,首个针对孟加拉语及南亚文化背景的大规模伦理评估基准,填补了多语言大模型在本地化伦理对齐方面的空白。
Details
Motivation: 现有伦理评估基准主要以英语和西方价值观为中心,缺乏对孟加拉语等非西方语言及其文化细微差别的考量,限制了大模型在真实场景中的负责任部署。 Method: 构建了涵盖五个道德领域、50个子主题的BengaliMoralBench数据集,通过母语者共识从美德伦理、常识伦理和正义伦理三个视角进行标注,并采用统一提示协议对多种多语言大模型进行零样本评估。 Result: 实验显示主流多语言大模型表现差异显著(准确率50-91%),并在文化理解、常识推理和道德公平性方面普遍存在不足。 Conclusion: BengaliMoralBench为低资源多语言环境下的文化对齐AI提供了重要基础,推动更符合本地伦理规范的负责任人工智能发展。 Abstract: As multilingual Large Language Models (LLMs) gain traction across South Asia, their alignment with local ethical norms, particularly for Bengali, which is spoken by over 285 million people and ranked 6th globally, remains underexplored. Existing ethics benchmarks are largely English-centric and shaped by Western frameworks, overlooking cultural nuances critical for real-world deployment. To address this, we introduce BengaliMoralBench, the first large-scale ethics benchmark for the Bengali language and socio-cultural contexts. It covers five moral domains, Daily Activities, Habits, Parenting, Family Relationships, and Religious Activities, subdivided into 50 culturally relevant subtopics. Each scenario is annotated via native-speaker consensus using three ethical lenses: Virtue, Commonsense, and Justice ethics. We conduct systematic zero-shot evaluation of prominent multilingual LLMs, including Llama, Gemma, Qwen, and DeepSeek, using a unified prompting protocol and standard metrics. Performance varies widely (50-91% accuracy), with qualitative analysis revealing consistent weaknesses in cultural grounding, commonsense reasoning, and moral fairness. BengaliMoralBench provides a foundation for responsible localization, enabling culturally aligned evaluation and supporting the deployment of ethically robust AI in diverse, low-resource multilingual settings such as Bangladesh.[16] LGM: Enhancing Large Language Models with Conceptual Meta-Relations and Iterative Retrieval
Wenchang Lei,Ping Zou,Yue Wang,Feng Sun,Lei Zhao
Main category: cs.CL
TL;DR: 提出语言图模型(LGM),通过提取自然语言中的元关系(继承、别名、组合)并结合反思机制,提升大语言模型对模糊或概念错位指令的理解能力。
Details
Motivation: 大语言模型在处理包含模糊或概念不一致术语的用户指令时表现不佳,需增强其概念理解与澄清能力。 Method: 构建语言图模型(LGM),提取继承、alias和组合等元关系,使用反思机制验证这些关系,并通过概念迭代检索算法动态向LLM提供相关信息。 Result: 在标准基准测试中,LGM consistently 优于现有的RAG基线方法,且能处理任意长度文本而无需截断。 Conclusion: LGM通过结构化元关系和动态检索机制,显著提升了大语言模型对复杂概念的理解能力,突破了传统RAG对上下文窗口的依赖。 Abstract: Large language models (LLMs) exhibit strong semantic understanding, yet struggle when user instructions involve ambiguous or conceptually misaligned terms. We propose the Language Graph Model (LGM) to enhance conceptual clarity by extracting meta-relations-inheritance, alias, and composition-from natural language. The model further employs a reflection mechanism to validate these meta-relations. Leveraging a Concept Iterative Retrieval Algorithm, these relations and related descriptions are dynamically supplied to the LLM, improving its ability to interpret concepts and generate accurate responses. Unlike conventional Retrieval-Augmented Generation (RAG) approaches that rely on extended context windows, our method enables large language models to process texts of any length without the need for truncation. Experiments on standard benchmarks demonstrate that the LGM consistently outperforms existing RAG baselines.[17] Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification
Shaghayegh Kolli,Richard Rosenbaum,Timo Cavelius,Lasse Strothe,Andrii Lata,Jana Diesner
Main category: cs.CL
TL;DR: 提出一种结合知识图谱、大语言模型和网络搜索代理的混合事实核查方法,在FEVER基准上达到0.93的F1分数,并能有效识别原标记为‘信息不足’的可验证声明。
Details
Motivation: 大语言模型生成流畅但缺乏可靠事实支撑,知识图谱事实核查准确但覆盖有限,需融合二者优势以提升准确性与覆盖范围。 Method: 采用三步自治流程:1)基于DBpedia的知识图谱检索;2)基于大语言模型的任务特定提示分类;3)仅在知识图谱覆盖不足时调用网络搜索代理。 Result: 在FEVER数据集的Supported/Refuted分类任务中取得0.93的F1分数,无需任务微调;并通过重新标注研究发现系统能为大量NEI声明找到有效证据。 Conclusion: 该模块化、开源的事实核查框架结合了各组件优势,具备良好的泛化能力和回退机制,提升了事实核查的准确性与覆盖率。 Abstract: Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency. By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component. Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one-hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient. Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task-specific fine-tuning. To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers. With this paper, we present a modular, opensource fact-checking pipeline with fallback strategies and generalization across datasets.[18] Beyond Ranked Lists: The SARAL Framework for Cross-Lingual Document Set Retrieval
Shantanu Agarwal,Joel Barry,Elizabeth Boschee,Scott Miller
Main category: cs.CL
TL;DR: 本文介绍了ISI团队在MATERIAL项目中提出的SARAL方法,用于改进跨语言信息检索(CLIR),强调检索与查询相关的一组文档而不仅仅是排序列表,并在多语言评估中表现优于其他团队。
Details
Motivation: 提升跨语言信息检索的效果,特别是在检索相关文档集合而非简单排序列表方面。 Method: 提出了一种新颖的CLIR方法SARAL,注重领域自适应和摘要技术,以支持查询相关文档集的检索。 Result: 在MATERIAL第三阶段评估中,SARAL在六项评测条件中的五项中超过了其他团队,涵盖波斯语、哈萨克语和格鲁吉亚语三种语言。 Conclusion: SARAL方法在多语言环境下显著提升了跨语言信息检索性能,验证了其在实际应用中的有效性。 Abstract: Machine Translation for English Retrieval of Information in Any Language (MATERIAL) is an IARPA initiative targeted to advance the state of cross-lingual information retrieval (CLIR). This report provides a detailed description of Information Sciences Institute's (ISI's) Summarization and domain-Adaptive Retrieval Across Language's (SARAL's) effort for MATERIAL. Specifically, we outline our team's novel approach to handle CLIR with emphasis in developing an approach amenable to retrieve a query-relevant document \textit{set}, and not just a ranked document-list. In MATERIAL's Phase-3 evaluations, SARAL exceeded the performance of other teams in five out of six evaluation conditions spanning three different languages (Farsi, Kazakh, and Georgian).[19] IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs
Souvik Rana,Arul Menezes,Ashish Kulkarni,Chandra Khatri,Shubham Agarwal
Main category: cs.CL
TL;DR: 本文提出了IndicSuperTokenizer,一种用于印度多语言大模型的新型分词器,结合了子词和多词分词及语言特定预分词,在生育率得分上显著优于现有方法,并提升了推理吞吐量。
Details
Motivation: 设计高效的多语言分词器面临多样文字和丰富形态变化的挑战,现有子词方法在多语言场景下的有效性尚不充分。 Method: 提出IndicSuperTokenizer,结合子词与多词分词方法,并引入语言特定的预分词策略,优化多语言文本的分词效果。 Result: 在英语、22种印度语言和代码数据上评估,平均生育率得分比LLaMA4提升39.5%,比Sutra提升18%,推理吞吐量提高44%,且在英文和印度语基准上保持相当性能。 Conclusion: IndicSuperTokenizer通过融合多种分词策略,在多语言环境下实现了更优的语言对齐和效率提升,为多语言大模型提供了有效的分词解决方案。 Abstract: Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods such as Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present IndicSuperTokenizer, a tokenizer for Indic multilingual LLMs, that combines both subword and multi-word tokenization, along with language-specific pre-tokenization, leading to more linguistically aligned tokens and achieving a new state-of-the-art in fertility score. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We also present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.[20] Comparing the Performance of LLMs in RAG-based Question-Answering: A Case Study in Computer Science Literature
Ranul Dayarathne,Uvini Ranaweera,Upeksha Ganegoda
Main category: cs.CL
TL;DR: 本研究比较了四种开源大语言模型(Mistral-7b-instruct、LLaMa2-7b-chat、Falcon-7b-instruct、Orca-mini-v3-7b)与GPT-3.5在计算机科学文献问答任务中结合RAG技术的性能,结果表明Mistral-7b-instruct表现最佳,而Orca-mini-v3-7b响应延迟最低。
Details
Motivation: 随着RAG技术的发展,评估不同大语言模型在特定领域问答任务中的表现成为必要,以了解开源模型是否能与闭源先进模型竞争。 Method: 采用准确率、精确率、人工评分、Gemini模型评分和余弦相似度作为评估指标,对比多个开源与闭源大语言模型在二分类和长答案问题上的表现。 Result: GPT-3.5结合RAG表现优异;在开源模型中,Mistral-7b-instruct在回答准确性上领先,Orca-mini-v3-7b响应最快,LLaMa2-7b-chat延迟最高。 Conclusion: 开源大语言模型在良好基础设施支持下,可与GPT-3.5等闭源模型相媲美,具备实际应用潜力。 Abstract: Retrieval Augmented Generation (RAG) is emerging as a powerful technique to enhance the capabilities of Generative AI models by reducing hallucination. Thus, the increasing prominence of RAG alongside Large Language Models (LLMs) has sparked interest in comparing the performance of different LLMs in question-answering (QA) in diverse domains. This study compares the performance of four open-source LLMs, Mistral-7b-instruct, LLaMa2-7b-chat, Falcon-7b-instruct and Orca-mini-v3-7b, and OpenAI's trending GPT-3.5 over QA tasks within the computer science literature leveraging RAG support. Evaluation metrics employed in the study include accuracy and precision for binary questions and ranking by a human expert, ranking by Google's AI model Gemini, alongside cosine similarity for long-answer questions. GPT-3.5, when paired with RAG, effectively answers binary and long-answer questions, reaffirming its status as an advanced LLM. Regarding open-source LLMs, Mistral AI's Mistral-7b-instruct paired with RAG surpasses the rest in answering both binary and long-answer questions. However, among the open-source LLMs, Orca-mini-v3-7b reports the shortest average latency in generating responses, whereas LLaMa2-7b-chat by Meta reports the highest average latency. This research underscores the fact that open-source LLMs, too, can go hand in hand with proprietary models like GPT-3.5 with better infrastructure.[21] SCALE: Upscaled Continual Learning of Large Language Models
Jin-woo Lee,Junhwa Choi,Bongkyu Hwang,Jinho Choo,Bogun Kim,JeongSeon Yi,Joonseok Lee,DongYoung Jung,Jaeseon Park,Kyoungwon Park,Suk-hoon Jung
Main category: cs.CL
TL;DR: 本文提出了一种名为SCALE的宽度扩展架构,用于大语言模型的持续预训练,通过插入轻量级扩展并冻结预训练参数,在不干扰原有功能的前提下提升模型容量。
Details
Motivation: 现有持续预训练方法在单纯扩大参数规模时面临遗忘严重和稳定性不足的问题,因此需要一种能更好平衡模型稳定性与可塑性的结构化扩展方案。 Method: 提出SCALE架构,基于Persistent Preservation和Collaborative Adaptation两大原则,在线性模块中插入可训练的轻量扩展,冻结原始参数,并设计了SCALE-Preserve、SCALE-Adapt和SCALE-Route三种变体。 Result: 在合成传记基准上显著缓解了深度扩展带来的严重遗忘问题;在韩语持续预训练中,英语评估上的遗忘更少,韩语任务性能提升明显,取得了更好的稳定性-可塑性权衡。 Conclusion: SCALE通过结构化宽度扩展而非单纯参数扩展,有效实现了知识保留与新知识获取的平衡,为大模型持续学习提供了更优的架构路径。 Abstract: We revisit continual pre-training for large language models and argue that progress now depends more on scaling the right structure than on scaling parameters alone. We introduce SCALE, a width upscaling architecture that inserts lightweight expansion into linear modules while freezing all pre-trained parameters. This preserves the residual and attention topologies and increases capacity without perturbing the base model's original functionality. SCALE is guided by two principles: Persistent Preservation, which maintains the base model's behavior via preservation-oriented initialization and freezing of the pre-trained weights, and Collaborative Adaptation, which selectively trains a subset of expansion components to acquire new knowledge with minimal interference. We instantiate these ideas as SCALE-Preserve (preservation-first), SCALE-Adapt (adaptation-first), and SCALE-Route, an optional routing extension that performs token-level routing between preservation and adaptation heads. On a controlled synthetic biography benchmark, SCALE mitigates the severe forgetting observed with depth expansion while still acquiring new knowledge. In continual pre-training on a Korean corpus, SCALE variants achieve less forgetting on English evaluations and competitive gains on Korean benchmarks, with these variants offering the best overall stability-plasticity trade-off. Accompanying analysis clarifies when preservation provably holds and why the interplay between preservation and adaptation stabilizes optimization compared to standard continual learning setups.[22] How to Evaluate Speech Translation with Source-Aware Neural MT Metrics
Mauro Cettolo,Marco Gaido,Matteo Negri,Sara Papi,Luisa Bentivogli
Main category: cs.CL
TL;DR: 本研究首次系统探讨了在缺乏源文本转录的情况下,如何利用语音识别转录和反向翻译生成的文本代理来改进语音到文本翻译(ST)系统的自动评估,提出了一种新的跨语言重分段算法以解决对齐问题,并验证了所提方法的有效性。
Details
Motivation: 传统的ST系统评估依赖于参考翻译,忽略了源音频信息;而机器翻译中结合源文本的神经指标表现更优,因此探索将类似思想应用于ST系统,尤其是在无源文本转录的真实场景下具有重要意义。 Method: 采用两种策略生成输入音频的文本代理:自动语音识别(ASR)转录和参考翻译的反向翻译;提出一种两步跨语言重分段算法以缓解合成源与参考翻译之间的对齐不一致问题;并在多个ST基准和系统上测试源感知MT指标的表现。 Result: 实验表明,在词错误率低于20%时,ASR转录作为合成源比反向翻译更可靠;而反向翻译虽精度稍低,但计算成本更低且仍有效;所提出的重分段算法显著提升了源感知指标在ST评估中的鲁棒性和相关性。 Conclusion: 通过引入可靠的文本代理和跨语言重分段技术,源感知评估指标可有效提升ST系统评估的准确性,为未来更合理、更精确的ST评估方法奠定了基础。 Abstract: Automatic evaluation of speech-to-text translation (ST) systems is typically performed by comparing translation hypotheses with one or more reference translations. While effective to some extent, this approach inherits the limitation of reference-based evaluation that ignores valuable information from the source input. In machine translation (MT), recent progress has shown that neural metrics incorporating the source text achieve stronger correlation with human judgments. Extending this idea to ST, however, is not trivial because the source is audio rather than text, and reliable transcripts or alignments between source and references are often unavailable. In this work, we conduct the first systematic study of source-aware metrics for ST, with a particular focus on real-world operating conditions where source transcripts are not available. We explore two complementary strategies for generating textual proxies of the input audio, automatic speech recognition (ASR) transcripts, and back-translations of the reference translation, and introduce a novel two-step cross-lingual re-segmentation algorithm to address the alignment mismatch between synthetic sources and reference translations. Our experiments, carried out on two ST benchmarks covering 79 language pairs and six ST systems with diverse architectures and performance levels, show that ASR transcripts constitute a more reliable synthetic source than back-translations when word error rate is below 20%, while back-translations always represent a computationally cheaper but still effective alternative. Furthermore, our cross-lingual re-segmentation algorithm enables robust use of source-aware MT metrics in ST evaluation, paving the way toward more accurate and principled evaluation methodologies for speech translation.[23] Benchmarking the Thinking Mode of Multimodal Large Language Models in Clinical Tasks
Jindong Hong,Tianjie Chen,Lingjie Luo,Chuanyang Zheng,Ting Xu,Haibao Yu,Jianing Qiu,Qianzhong Chen,Suning Huang,Yan Xu,Yong Gui,Yijun He,Jiankai Sun
Main category: cs.CL
TL;DR: 该研究评估了两种领先的多模态大语言模型(Seed1.5-VL和Gemini-2.5-Flash)在医疗任务中的“思考模式”能力,发现激活思考模式对多数任务的性能提升有限,尤其在开放性视觉问答和医学图像解读等复杂任务上表现仍不理想。
Details
Motivation: 随着具备显式控制内部思维过程的‘推理型MLLMs’的出现,亟需评估其在临床任务中增强的推理能力对模型性能和可靠性的影响。 Method: 通过在VQA-RAD和ROCOv2数据集上的四个视觉医疗任务,对比分析两种MLLM在‘思考模式’与‘非思考模式’下的表现。 Result: 激活思考模式相较于标准非思考模式仅带来边际性能提升;模型在复杂医疗任务上的表现仍然欠佳。 Conclusion: 当前双状态MLLM在医学应用中仍存在局限,需引入领域特定的医学数据和更先进的医学知识整合方法以提升性能。 Abstract: A recent advancement in Multimodal Large Language Models (MLLMs) research is the emergence of "reasoning MLLMs" that offer explicit control over their internal thinking processes (normally referred as the "thinking mode") alongside the standard "non-thinking mode". This capability allows these models to engage in a step-by-step process of internal deliberation before generating a final response. With the rapid transition to and adoption of these "dual-state" MLLMs, this work rigorously evaluated how the enhanced reasoning processes of these MLLMs impact model performance and reliability in clinical tasks. This paper evaluates the active "thinking mode" capabilities of two leading MLLMs, Seed1.5-VL and Gemini-2.5-Flash, for medical applications. We assessed their performance on four visual medical tasks using VQA-RAD and ROCOv2 datasets. Our findings reveal that the improvement from activating the thinking mode remains marginal compared to the standard non-thinking mode for the majority of the tasks. Their performance on complex medical tasks such as open-ended VQA and medical image interpretation remains suboptimal, highlighting the need for domain-specific medical data and more advanced methods for medical knowledge integration.[24] Generative Artificial Intelligence in Bioinformatics: A Systematic Review of Models, Applications, and Methodological Advances
Riasad Alvi,Sayeem Been Zaman,Wasimul Karim,Arefin Ittesafun Abian,Mohaimenul Azam Khan Raiaan,Saddam Mukta,Md Rafi Ur Rashid,Md Rafiqul Islam,Yakub Sebastian,Sami Azam
Main category: cs.CL
TL;DR: 本文综述了生成式人工智能(GenAI)在生物信息学中的方法进展、预测性能和专业化应用,系统地提出了六个研究问题,涵盖从序列分析到分子设计等多个子领域,并探讨了模型架构、数据资源及未来发展方向。
Details
Motivation: 随着GenAI在基因组学、蛋白质组学等领域的快速发展,亟需系统性评估其方法优势与局限,以指导未来在生物医学中的深入应用。 Method: 基于PRISMA方法提出六个研究问题(RQs),系统回顾GenAI在生物信息学中的应用,分析其在模型架构、数据集成、结构建模等方面的表现。 Result: 专用模型架构优于通用模型;GenAI在分子分析、功能预测和合成数据生成方面表现优异;主流分子、细胞和文本数据集有效支持模型训练与泛化。 Conclusion: GenAI在生物信息学中展现出强大潜力,但需解决可扩展性不足和数据偏差等问题,未来应加强生物学机制融合与稳健性评估。 Abstract: Generative artificial intelligence (GenAI) has become a transformative approach in bioinformatics that often enables advancements in genomics, proteomics, transcriptomics, structural biology, and drug discovery. To systematically identify and evaluate these growing developments, this review proposed six research questions (RQs), according to the preferred reporting items for systematic reviews and meta-analysis methods. The objective is to evaluate impactful GenAI strategies in methodological advancement, predictive performance, and specialization, and to identify promising approaches for advanced modeling, data-intensive discovery, and integrative biological analysis. RQ1 highlights diverse applications across multiple bioinformatics subfields (sequence analysis, molecular design, and integrative data modeling), which demonstrate superior performance over traditional methods through pattern recognition and output generation. RQ2 reveals that adapted specialized model architectures outperformed general-purpose models, an advantage attributed to targeted pretraining and context-aware strategies. RQ3 identifies significant benefits in the bioinformatics domains, focusing on molecular analysis and data integration, which improves accuracy and reduces errors in complex analysis. RQ4 indicates improvements in structural modeling, functional prediction, and synthetic data generation, validated by established benchmarks. RQ5 suggests the main constraints, such as the lack of scalability and biases in data that impact generalizability, and proposes future directions focused on robust evaluation and biologically grounded modeling. RQ6 examines that molecular datasets (such as UniProtKB and ProteinNet12), cellular datasets (such as CELLxGENE and GTEx) and textual resources (such as PubMedQA and OMIM) broadly support the training and generalization of GenAI models.[25] Silenced Biases: The Dark Side LLMs Learned to Refuse
Rom Himelstein,Amit LeVi,Brit Youngmann,Yaniv Nemcovsky,Avi Mendelson
Main category: cs.CL
TL;DR: 本文提出了“沉默偏见”(silenced biases)的概念,指安全对齐的大型语言模型中隐藏在拒绝回答背后的不公平偏好,并提出Silenced Bias Benchmark(SBB)通过激活引导技术揭示这些偏见,以更真实地评估模型公平性。
Details
Motivation: 现有公平性评估方法常将模型的拒绝回答误判为公平表现,忽视了潜藏于模型隐空间中的偏见,导致对模型公平性的误判。 Method: 提出SBB基准,利用激活引导(activation steering)减少模型在问答中的拒绝行为,从而暴露其潜在偏见,并支持扩展至不同群体和主题。 Result: 在多个LLM上的实验表明,模型表面的拒绝行为与其内部存在的严重偏见存在显著差异,揭示了当前安全对齐模型中被掩盖的公平性问题。 Conclusion: SBB提供了一种更深入、可扩展的公平性评估框架,有助于推动超越对齐训练遮蔽效应的真正公平模型的发展。 Abstract: Safety-aligned large language models (LLMs) are becoming increasingly widespread, especially in sensitive applications where fairness is essential and biased outputs can cause significant harm. However, evaluating the fairness of models is a complex challenge, and approaches that do so typically utilize standard question-answer (QA) styled schemes. Such methods often overlook deeper issues by interpreting the model's refusal responses as positive fairness measurements, which creates a false sense of fairness. In this work, we introduce the concept of silenced biases, which are unfair preferences encoded within models' latent space and are effectively concealed by safety-alignment. Previous approaches that considered similar indirect biases often relied on prompt manipulation or handcrafted implicit queries, which present limited scalability and risk contaminating the evaluation process with additional biases. We propose the Silenced Bias Benchmark (SBB), which aims to uncover these biases by employing activation steering to reduce model refusals during QA. SBB supports easy expansion to new demographic groups and subjects, presenting a fairness evaluation framework that encourages the future development of fair models and tools beyond the masking effects of alignment training. We demonstrate our approach over multiple LLMs, where our findings expose an alarming distinction between models' direct responses and their underlying fairness issues.[26] EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation
Yunbo Long,Yuhan Liu,Alexandra Brintrup
Main category: cs.CL
TL;DR: 本文提出了EQ-Negotiator,一个结合博弈论与隐马尔可夫模型(HMM)的框架,使小型语言模型(SLM)在隐私受限场景下实现高效、合乎伦理的自动化谈判,尤其在信贷协商中表现优于大模型。
Details
Motivation: 大型语言模型(LLM)在自动谈判中性能优异,但计算成本高且存在数据隐私问题,难以适用于对隐私敏感的边缘设备;而小型语言模型(SLM)虽轻量但情感理解与策略能力不足,尤其在复杂情绪角色扮演中表现差。因此需要一种能在保护隐私的同时提升SLM谈判能力的新方法。 Method: 提出EQ-Negotiator框架,其核心是结合游戏理论与隐马尔可夫模型(HMM)的推理系统,用于在线学习和追踪债务人的情绪状态,无需预训练;该系统赋予SLM战略智能,以应对操纵、缓解冲突并遵守伦理规范。 Result: 通过多场景代理对代理模拟(包括欺骗、威胁、装受害者等对抗策略),7B参数的语言模型配合EQ-Negotiator在债务回收率和谈判效率上超越了超过其10倍大小的基线LLM。 Conclusion: 研究表明,战略性情感智能比模型规模更为关键,EQ-Negotiator实现了从静态角色描述到动态情感架构的跃迁,为边缘设备上的高效、合乎伦理且隐私保护的AI谈判者铺平了道路。 Abstract: The deployment of large language models (LLMs) in automated negotiation has set a high performance benchmark, but their computational cost and data privacy requirements render them unsuitable for many privacy-sensitive, on-device applications such as mobile assistants, embodied AI agents or private client interactions. While small language models (SLMs) offer a practical alternative, they suffer from a significant performance gap compared to LLMs in playing emotionally charged complex personas, especially for credit negotiation. This paper introduces EQ-Negotiator, a novel framework that bridges this capability gap using emotional personas. Its core is a reasoning system that integrates game theory with a Hidden Markov Model(HMM) to learn and track debtor emotional states online, without pre-training. This allows EQ-Negotiator to equip SLMs with the strategic intelligence to counter manipulation while de-escalating conflict and upholding ethical standards. Through extensive agent-to-agent simulations across diverse credit negotiation scenarios, including adversarial debtor strategies like cheating, threatening, and playing the victim, we show that a 7B parameter language model with EQ-Negotiator achieves better debt recovery and negotiation efficiency than baseline LLMs more than 10 times its size. This work advances persona modeling from descriptive character profiles to dynamic emotional architectures that operate within privacy constraints. Besides, this paper establishes that strategic emotional intelligence, not raw model scale, is the critical factor for success in automated negotiation, paving the way for effective, ethical, and privacy-preserving AI negotiators that can operate on the edge.[27] LFC-DA: Logical Formula-Controlled Data Augmentation for Enhanced Logical Reasoning
Shenghao Li
Main category: cs.CL
TL;DR: 提出LFC-DA方法,通过符号逻辑控制生成多样化且逻辑严谨的自然语言问题,提升预训练模型的逻辑推理能力。
Details
Motivation: 现有逻辑数据增强方法依赖人工标注成本高,或直接使用大模型生成导致样本不可解释且逻辑单一。 Method: 将逻辑文本映射为命题表达式,构建紧凑规则库,通过有界状态空间搜索发现有效公式,并将其转回自然语言问题。 Result: 在ReClor和LogiQA上显著提升了预训练模型的逻辑推理准确率。 Conclusion: LFC-DA能有效实现LLM引导下的逻辑数据增强,兼顾多样性与逻辑严谨性。 Abstract: For complex logical data augmentation, heavy reliance on human annotation is costly, whereas direct generation with large language models yields uninterpretable and logically homogeneous examples. To address this, we present LFC-DA, a symbolic-logic-controlled pipeline: logical text is first mapped to propositional expressions, a compact rule library is compiled, and a bounded state-space search systematically discovers valid formulas that are then verbalized back into natural-language questions, ensuring both diversity and logical rigor under propositional logic. Experiments on ReClor and LogiQA show significant improvements in the logical-reasoning accuracy of pretrained models, confirming the effectiveness of LFC-DA for LLM-guided logical data augmentation.[28] Segmentation Beyond Defaults: Asymmetrical Byte Pair Encoding for Optimal Machine Translation Performance
Saumitra Yadav,Manish Shrivastava
Main category: cs.CL
TL;DR: 本文研究了在机器翻译中使用非对称字节对编码(BPE)分词方法的效果,发现为源语言和目标语言设置不同的合并操作次数(NMO)能显著提升翻译性能,尤其是在低资源场景下。
Details
Motivation: 现有研究通常对源语言和目标语言使用相同的BPE合并操作次数,但这种对称方法未必最优,尤其在不同语言对和数据规模下表现受限。 Method: 通过在多种数据量和语言对上实验对比对称与非对称BPE配置,分析其对MT性能的影响。 Result: 非对称BPE在低资源条件下显著提升性能,在英语-印地语等语言对上取得最高5.32 CHRF++的增益,并在12个系统中的10个实现统计显著提升。 Conclusion: 采用高NMO的源语言和低NMO的目标语言配置可优化机器翻译效果,尤其有利于低资源语言对的翻译质量提升。 Abstract: Existing Machine Translation (MT) research often suggests a single, fixed set of hyperparameters for word segmentation models, symmetric Byte Pair Encoding (BPE), which applies the same number of merge operations (NMO) to train tokenizers for both source and target languages. However, we demonstrate that this uniform approach doesn't guarantee optimal MT performance across different language pairs and data sizes. This work investigates BPE segmentation recipes across various data volumes and language pairs to evaluate MT system performance. We find that utilizing asymmetric BPE, where the source and target languages have different NMOs, significantly improves results over the symmetric approach, especially in low-resource settings (50K, 100K, and 500K sentence pairs). Specifically, asymmetric BPE yield statistically significant ($p<0.05$) average gains of 5.32, 4.46, and 0.7 CHRF++ on English-Hindi in low-resource setups. We validated this trend across six additional language pairs (English and Telugu, Shona, Norwegian, Kyrgyz, Hausa, and Inuktitut), observing statistically significant improvement in 10 out of 12 systems compared to symmetric BPE. Our findings indicate a high NMO for the source (4K to 32K) and a low NMO for the target (0.5K to 2K) provides optimal results, particularly benefiting low-resource MT.[29] Overcoming the Generalization Limits of SLM Finetuning for Shape-Based Extraction of Datatype and Object Properties
Célian Ringwald,Fabien Gandon,Catherine Faron,Franck Michel,Hanna Abi Akl
Main category: cs.CL
TL;DR: 本文研究了小型语言模型(SLM)在提取包含数据类型和对象属性的完整RDF图时的表现,发现稀有属性的长尾分布是主要瓶颈,并通过多种策略评估,发现最有效的方法是确保每个属性在训练集中达到一定出现次数。
Details
Motivation: 探索小型语言模型在处理关系抽取任务中同时涵盖数据类型和对象属性的能力,特别是在面对稀有属性时的性能瓶颈。 Method: 评估了分层采样、加权损失、数据集扩展和基于模板的合成数据增强等多种策略,以解决属性不平衡问题。 Result: 结果显示,当训练集中每个属性的出现次数超过某一阈值时,模型在不平衡目标属性上的表现最佳。 Conclusion: 为训练感知形状的小型语言模型提供了实用指导,并指出了语义关系抽取未来研究的有前景方向。 Abstract: Small language models (SLMs) have shown promises for relation extraction (RE) when extracting RDF triples guided by SHACL shapes focused on common datatype properties. This paper investigates how SLMs handle both datatype and object properties for a complete RDF graph extraction. We show that the key bottleneck is related to long-tail distribution of rare properties. To solve this issue, we evaluate several strategies: stratified sampling, weighted loss, dataset scaling, and template-based synthetic data augmentation. We show that the best strategy to perform equally well over unbalanced target properties is to build a training set where the number of occurrences of each property exceeds a given threshold. To enable reproducibility, we publicly released our datasets, experimental results and code. Our findings offer practical guidance for training shape-aware SLMs and highlight promising directions for future work in semantic RE.[30] Efficient Reasoning via Thought-Training and Thought-Free Inference
Canhui Wu,Qiong Cao,Chao Xue,Wei Xi,Xiaodong He
Main category: cs.CL
TL;DR: 本文提出3TF框架,通过短到长的视角实现高效推理,使模型在推理时无需显式生成思维链,同时提升非推理输出的推理质量。
Details
Motivation: 现有方法主要依赖显式推理且侧重于压缩冗长的推理输出,缺乏对隐式推理能力的挖掘。 Method: 训练一个兼具推理与非推理模式的混合模型,并在CoT标注数据上进一步训练,使其内化结构化推理过程,在推理时强制使用简洁的无推理模式输出。 Result: 3TF在多种推理基准测试中显著提升了无思维推理模式下的性能,验证了高质量推理可被隐式学习和执行。 Conclusion: 3TF实现了隐式推理与显式输出的解耦,证明无需显式生成思维链也能完成复杂推理,为高效推理提供了新范式。 Abstract: Recent advances in large language models (LLMs) have leveraged explicit Chain-of-Thought (CoT) prompting to improve reasoning accuracy. However, most existing methods primarily compress verbose reasoning outputs. These Long-to-Short transformations aim to improve efficiency, but still rely on explicit reasoning during inference. In this work, we introduce \textbf{3TF} (\textbf{T}hought-\textbf{T}raining and \textbf{T}hought-\textbf{F}ree inference), a framework for efficient reasoning that takes a Short-to-Long perspective. We first train a hybrid model that can operate in both reasoning and non-reasoning modes, and then further train it on CoT-annotated data to internalize structured reasoning, while enforcing concise, thought-free outputs at inference time using the no-reasoning mode. Unlike compression-based approaches, 3TF improves the reasoning quality of non-reasoning outputs, enabling models to perform rich internal reasoning implicitly while keeping external outputs short. Empirically, 3TF-trained models obtain large improvements on reasoning benchmarks under thought-free inference, demonstrating that high quality reasoning can be learned and executed implicitly without explicit step-by-step generation.[31] Knowledge-Augmented Question Error Correction for Chinese Question Answer System with QuestionRAG
Longpeng Qiu,Ting Li,Shuai Mao,Nan Yang,Xiaohui Yan
Main category: cs.CL
TL;DR: QuestionRAG 是一种通过知识增强和强化学习对齐来提升大语言模型在问题纠错任务中表现的框架,有效减少误解和过度修正。
Details
Motivation: 大语言模型在处理输入错误的问题时容易误解用户意图或过度修改问题结构,影响问答系统的准确性。 Method: 提出 QuestionRAG 框架:1)通过引入外部知识(如搜索结果、相关实体)增强输入以缓解误解;2)采用强化学习使模型目标与精确纠错对齐,避免过度修正。 Result: 实验表明,知识增强显著提升对错误问题的理解能力,而基于强化学习的对齐方法在指令遵循和泛化能力上明显优于传统的监督微调(SFT)。 Conclusion: 结合知识增强和强化学习对齐,QuestionRAG 充分释放了大语言模型在问题纠错任务中的潜力。 Abstract: Input errors in question-answering (QA) systems often lead to incorrect responses. Large language models (LLMs) struggle with this task, frequently failing to interpret user intent (misinterpretation) or unnecessarily altering the original question's structure (over-correction). We propose QuestionRAG, a framework that tackles these problems. To address misinterpretation, it enriches the input with external knowledge (e.g., search results, related entities). To prevent over-correction, it uses reinforcement learning (RL) to align the model's objective with precise correction, not just paraphrasing. Our results demonstrate that knowledge augmentation is critical for understanding faulty questions. Furthermore, RL-based alignment proves significantly more effective than traditional supervised fine-tuning (SFT), boosting the model's ability to follow instructions and generalize. By integrating these two strategies, QuestionRAG unlocks the full potential of LLMs for the question correction task.[32] CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Doria Bonzi,Alexandre Guiggi,Frédéric Béchet,Carlos Ramisch,Benoit Favre
Main category: cs.CL
TL;DR: 本文介绍了CareMedEval,一个基于法国医学生真实考试的生物医学批判性评估数据集,包含534个问题,用于评估大语言模型在科学文献批判性阅读和推理任务中的表现。实验表明现有模型在此类任务上仍面临挑战,尤其是在研究局限性和统计分析方面。
Details
Motivation: 提升大语言模型在生物医学领域进行批判性评估和推理的能力,填补现有基准数据集在科学文献深度理解方面的空白。 Method: 构建了一个名为CareMedEval的新数据集,源自法国医学生的实际考试,涵盖37篇科研论文的534个问题,并对多种通用和生物医学专用大语言模型在不同上下文条件下的表现进行了基准测试。 Result: 现有模型在CareMedEval上的精确匹配率未超过0.5,生成中间推理步骤能显著提升性能,但模型在研究局限性和统计分析类问题上仍表现不佳。 Conclusion: CareMedEval为评估大语言模型在生物医学批判性推理方面提供了具有挑战性的新基准,揭示了当前模型的局限性,并为未来开发支持文献批判性评估的自动化工具指明方向。 Abstract: Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.[33] Kastor: Fine-tuned Small Language Models for Shape-based Active Relation Extraction
Ringwald Celian,Gandon Fabien,Faron Catherine,Michel Franck,Abi Akl Hanna
Main category: cs.CL
TL;DR: 本文提出了Kastor框架,通过改进基于RDF模式的提取方法,提升小语言模型在特定领域知识库补全和优化中的性能。
Details
Motivation: 为了满足特定领域知识库补全和精细化的需求,需要更高效地利用有限文本和RDF数据训练小语言模型。 Method: Kastor将传统单个SHACL形状验证任务重构为评估由形状导出的所有可能属性组合,并为每个训练样本选择最优组合;同时采用迭代学习过程以精炼噪声知识库。 Result: 该框架显著增强了模型的泛化能力和性能,能够生成更鲁棒的模型并发现新的相关事实。 Conclusion: Kastor有效提升了基于RDF模式提取的小语言模型训练效果,适用于资源受限场景下的知识库完善。 Abstract: RDF pattern-based extraction is a compelling approach for fine-tuning small language models (SLMs) by focusing a relation extraction task on a specified SHACL shape. This technique enables the development of efficient models trained on limited text and RDF data. In this article, we introduce Kastor, a framework that advances this approach to meet the demands for completing and refining knowledge bases in specialized domains. Kastor reformulates the traditional validation task, shifting from single SHACL shape validation to evaluating all possible combinations of properties derived from the shape. By selecting the optimal combination for each training example, the framework significantly enhances model generalization and performance. Additionally, Kastor employs an iterative learning process to refine noisy knowledge bases, enabling the creation of robust models capable of uncovering new, relevant facts[34] BanglaSTEM: A Parallel Corpus for Technical Domain Bangla-English Translation
Kazi Reyazul Hasan,Mubasshira Musarrat,A. B. M. Alim Al Islam,Muhammad Abdullah Adnan
Main category: cs.CL
TL;DR: 本文提出了BanglaSTEM数据集,旨在改善孟加拉语到英语的技术文本翻译质量,从而提升大语言模型在STEM领域中的应用效果。
Details
Motivation: 由于现有翻译系统在处理技术术语时表现不佳,导致孟加拉语用户难以有效使用以英语为中心的大语言模型解决技术问题。 Method: 构建了一个包含5000个高质量孟加拉语-英语句子对的STEM领域数据集BanglaSTEM,并基于T5模型训练了一个翻译模型。 Result: 在代码生成和数学问题解答任务中,该模型显著提高了技术内容的翻译准确性。 Conclusion: BanglaSTEM有助于弥合语言鸿沟,使孟加拉语使用者能更有效地利用英语大语言模型,且数据集与模型均已公开发布。 Abstract: Large language models work well for technical problem solving in English but perform poorly when the same questions are asked in Bangla. A simple solution would be to translate Bangla questions into English first and then use these models. However, existing Bangla-English translation systems struggle with technical terms. They often mistranslate specialized vocabulary, which changes the meaning of the problem and leads to wrong answers. We present BanglaSTEM, a dataset of 5,000 carefully selected Bangla-English sentence pairs from STEM fields including computer science, mathematics, physics, chemistry, and biology. We generated over 12,000 translations using language models and then used human evaluators to select the highest quality pairs that preserve technical terminology correctly. We train a T5-based translation model on BanglaSTEM and test it on two tasks: generating code and solving math problems. Our results show significant improvements in translation accuracy for technical content, making it easier for Bangla speakers to use English-focused language models effectively. Both the BanglaSTEM dataset and the trained translation model are publicly released at https://huggingface.co/reyazul/BanglaSTEM-T5.[35] HaluMem: Evaluating Hallucinations in Memory Systems of Agents
Ding Chen,Simin Niu,Kehang Li,Peng Liu,Xiangping Zheng,Bo Tang,Xinchi Li,Feiyu Xiong,Zhiyu Li
Main category: cs.CL
TL;DR: 本文提出了首个针对记忆系统的操作级幻觉评估基准HaluMem,通过三个任务全面揭示不同操作阶段的幻觉行为,并构建了大规模多轮人机交互数据集,发现现有记忆系统在提取和更新阶段易产生并累积幻觉,进而影响问答性能。
Details
Motivation: 现有对记忆幻觉的评估主要基于端到端问答,难以定位幻觉产生的具体操作阶段,因此需要一种细粒度的评估方法来识别和分析记忆系统在存储与检索过程中的幻觉来源。 Method: 提出HaluMem基准,包含记忆提取、更新和问答三项任务,构建两个大规模多轮对话数据集HaluMem-Medium和HaluMem-Long,涵盖约1.5万个记忆点和3.5千个问题,支持跨上下文规模和任务复杂度的幻觉评估。 Result: 实验表明现有记忆系统在提取和更新阶段易生成和累积幻觉,这些错误会传播至问答阶段;HaluMem实现了对记忆系统各操作阶段幻觉行为的细粒度定位与评估。 Conclusion: 应聚焦于开发可解释且受约束的记忆操作机制,以系统性抑制幻觉,提升记忆系统的可靠性。 Abstract: Memory systems are key components that enable AI systems such as LLMs and AI agents to achieve long-term learning and sustained interaction. However, during memory storage and retrieval, these systems frequently exhibit memory hallucinations, including fabrication, errors, conflicts, and omissions. Existing evaluations of memory hallucinations are primarily end-to-end question answering, which makes it difficult to localize the operational stage within the memory system where hallucinations arise. To address this, we introduce the Hallucination in Memory Benchmark (HaluMem), the first operation level hallucination evaluation benchmark tailored to memory systems. HaluMem defines three evaluation tasks (memory extraction, memory updating, and memory question answering) to comprehensively reveal hallucination behaviors across different operational stages of interaction. To support evaluation, we construct user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long. Both include about 15k memory points and 3.5k multi-type questions. The average dialogue length per user reaches 1.5k and 2.6k turns, with context lengths exceeding 1M tokens, enabling evaluation of hallucinations across different context scales and task complexities. Empirical studies based on HaluMem show that existing memory systems tend to generate and accumulate hallucinations during the extraction and updating stages, which subsequently propagate errors to the question answering stage. Future research should focus on developing interpretable and constrained memory operation mechanisms that systematically suppress hallucinations and improve memory reliability.[36] One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
Qi Jia,Kaiwei Zhang,Xiujie Song,Ye Shen,Xiangyang Zhu,Guangtao Zhai
Main category: cs.CL
TL;DR: 本文提出了一种可扩展的框架EvolIF,用于评估大语言模型在多轮对话中的指令跟随能力,通过解耦语言表层与用户意图模拟,动态构建包含状态变化和回溯的基准测试,并定义了一系列衡量交互质量的指标。实验结果表明GPT-5在多轮对话中表现最优,平均支持18.54轮对话,鲁棒性达70.31%,显著优于Gemini-2.5-Pro。
Details
Motivation: 现有基准测试通常局限于固定轮数,容易饱和且无法反映用户交互体验,难以真实评估大语言模型在多话题连续对话中遵循指令的能力。 Method: 提出一个三层机制框架,分离语言形式与用户意图,模拟约束、指令和话题的状态追踪,动态构建具有状态变化和回溯功能的多轮对话基准,仅当模型耗尽模拟用户的耐心时才终止对话。 Result: 构建了包含九种不同约束类型的EvolIF基准测试,结果显示GPT-5平均可持续18.54轮对话,指令跟随鲁棒性为70.31%,比Gemini-2.5-Pro高出11.41%,其他模型表现明显落后。 Conclusion: 该框架能更真实地评估大语言模型在复杂多轮对话中的指令跟随能力,EvolIF为未来研究提供了可扩展的评测基准,同时揭示了当前领先模型(如GPT-5)在持续交互中的优势。 Abstract: Understanding how well large language models can follow users' instructions throughout a dialogue spanning multiple topics is of great importance for data-intensive conversational applications. Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience. In this work, we propose an extensible framework for assessing multi-turn instruction-following ability. At its core, our framework decouples linguistic surface forms from user intent simulation through a three-layer mechanism that tracks constraints, instructions, and topics. This framework mimics User-LLM interaction by enabling the dynamic construction of benchmarks with state changes and tracebacks, terminating a conversation only when the model exhausts a simulated user's patience. We define a suite of metrics capturing the quality of the interaction process. Using this framework, we construct EvolIF, an evolving instruction-following benchmark incorporating nine distinct constraint types. Our results indicate that GPT-5 exhibits superior instruction-following performance. It sustains an average of 18.54 conversational turns and demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant margin of 11.41%, while other models lag far behind. All of the data and code will be made publicly available online.[37] SOLVE-Med: Specialized Orchestration for Leading Vertical Experts across Medical Specialties
Roberta Di Marino,Giovanni Dioguardi,Antonio Romano,Giuseppe Riccio,Mariano Barone,Marco Postiglione,Flora Amato,Vincenzo Moscato
Main category: cs.CL
TL;DR: SOLVE-Med是一个用于复杂医疗问答的多智能体架构,结合领域专用的小型语言模型,实现高性能和本地化部署。
Details
Motivation: 应对医疗问答系统在幻觉、偏见、计算需求、隐私问题和跨领域专业知识方面的部署挑战。 Method: 采用多智能体架构,包括路由智能体(Router Agent)动态选择专家模型、十个针对特定医学领域微调的小型语言模型(各1B参数),以及一个协调智能体(Orchestrator Agent)整合回答。 Result: 在意大利医疗论坛数据的十个专科上评估,SOLVE-Med取得ROUGE-1为0.301,BERTScore F1为0.697,性能优于高达14B参数的单一模型。 Conclusion: SOLVE-Med通过模块化、专业化的小模型协作,在保证高性能的同时支持本地部署,有效解决医疗问答系统的实际部署难题。 Abstract: Medical question answering systems face deployment challenges including hallucinations, bias, computational demands, privacy concerns, and the need for specialized expertise across diverse domains. Here, we present SOLVE-Med, a multi-agent architecture combining domain-specialized small language models for complex medical queries. The system employs a Router Agent for dynamic specialist selection, ten specialized models (1B parameters each) fine-tuned on specific medical domains, and an Orchestrator Agent that synthesizes responses. Evaluated on Italian medical forum data across ten specialties, SOLVE-Med achieves superior performance with ROUGE-1 of 0.301 and BERTScore F1 of 0.697, outperforming standalone models up to 14B parameters while enabling local deployment. Our code is publicly available on GitHub: https://github.com/PRAISELab-PicusLab/SOLVE-Med.[38] Bearing Syntactic Fruit with Stack-Augmented Neural Networks
Brian DuSell,Ryan Cotterell
Main category: cs.CL
TL;DR: 本文首次展示了无需特殊条件即可像人类一样泛化的堆栈增强型神经网络,在经典问句生成任务中表现出更强的层次化泛化能力。
Details
Motivation: 研究神经网络是否能在没有语法监督、大规模预训练或过度训练的情况下,像人类儿童一样偏好基于层次化句法规则的假设。 Method: 测试三种基础架构(Transformer、简单RNN、LSTM)结合两种堆栈(Joulin & Mikolov的叠加堆栈和Dusell & Chiang的非确定性堆栈)在问句形成任务中的表现,并提出改进堆栈RNN结构的方法。 Result: 带有非确定性堆栈的Transformer在任务中泛化效果最好,且所提出的堆栈RNN改进方法提升了层次化泛化能力。 Conclusion: 堆栈增强型神经网络比标准架构更接近人类语言习得模式,可能成为心理语言学研究的更好模型。 Abstract: Any finite set of training data is consistent with an infinite number of hypothetical algorithms that could have generated it. Studies have shown that when human children learn language, they consistently favor hypotheses based on hierarchical syntactic rules without ever encountering disambiguating examples. A recent line of work has inquired as to whether common neural network architectures share this bias, finding that they do so only under special conditions: when syntactically supervised, when pre-trained on massive corpora, or when trained long past convergence. In this paper, we demonstrate, for the first time, neural network architectures that are able to generalize in human-like fashion without any of the aforementioned requirements: stack-augmented neural networks. We test three base architectures (transformer, simple RNN, LSTM) augmented with two styles of stack: the superposition stack of Joulin & Mikolov (2015) and a nondeterministic generalization of it proposed by DuSell & Chiang (2023). We find that transformers with nondeterministic stacks generalize best out of these architectures on a classical question formation task. We also propose a modification to the stack RNN architecture that improves hierarchical generalization. These results suggest that stack-augmented neural networks may be more accurate models of human language acquisition than standard architectures, serving as useful objects of psycholinguistic study. Our code is publicly available.[39] MultiZebraLogic: A Multilingual Logical Reasoning Benchmark
Sofie Helene Bruun,Dan Saattrup Smart
Main category: cs.CL
TL;DR: 本文提出了MultiZebraLogic,一个包含九种日耳曼语系语言的逻辑推理数据集,用于评估大语言模型在多语言、多主题和不同难度下的表现。通过生成不同规模、主题和干扰线索的斑马谜题,研究发现增加红鲱鱼线索可显著降低模型准确率,而语言和主题变化影响较小。
Details
Motivation: 为了全面评估大语言模型的逻辑推理能力,需要跨语言、多任务且具有适当难度的高质量基准测试数据集。现有数据集在语言多样性与系统性难度控制方面存在不足。 Method: 生成多语言(九种日耳曼语)、多主题、不同规模(2x3 和 4x5)的斑马谜题,引入14种线索类型和8种红鲱鱼(干扰)线索以调控难度,并在GPT-4o mini和o3-mini等模型上进行评测。同时发布数据集与可扩展的生成代码。 Result: 2x3和4x5规模的谜题分别对非推理和推理模型具有足够挑战性;加入5个红鲱鱼使o3-mini在4x5谜题上的准确率下降15±7%;语言(英语vs丹麦语)和主题(房屋vs smørrebrød)变化对o3-mini成绩无显著影响;未发现线索类型与难度之间的相关性。每个语言下发布128+1024个谜题的数据集。 Conclusion: MultiZebraLogic是一个可扩展、高多样性的多语言逻辑推理评测基准,能有效区分不同推理能力的模型,且通过红鲱鱼线索可系统提升难度,为未来多语言推理研究提供了重要资源。 Abstract: Measuring the full abilities of large language models (LLMs) requires benchmarks representing multiple tasks. We aim to create large, high-quality datasets for comparison of logical reasoning skills across several languages and of suitable difficulty for LLMs of various reasoning ability. We explore multiple ways of increasing difficulty. We generate zebra puzzles in multiple languages, themes, sizes and including 14 different clue types and 8 red herring types (uninformative clues). We find puzzle sizes 2x3 and 4x5 are sufficiently challenging for GPT-4o mini (a non-reasoning model) and o3-mini (a reasoning model), respectively. Including 5 red herrings decreases o3-mini puzzle-level accuracy on 4x5 puzzles by 15$\pm$7 %. Scores of o3-mini on 4x5 puzzles are not significantly affected by use of English vs. Danish or the common houses theme vs. the country-specific smoerrebroed theme. We find no correlation between difficulty and the selected clue types. Datasets of 128+1024 puzzles are published as MultiZebraLogic in each of nine Germanic languages for sizes 2x3 and 4x5. We publish code for puzzle generation, designed for adaptablity into more languages and themes.[40] AILA--First Experiments with Localist Language Models
Joachim Diederich
Main category: cs.CL
TL;DR: 本文提出了可控局部性的新框架,通过可调参数实现语言模型中表示的连续控制,实验表明在解释性与性能之间存在权衡,中间局部性值最优。
Details
Motivation: 传统语言模型依赖分布式表示,缺乏对表示局部性的灵活控制,难以兼顾解释性与性能,因此需要一种可动态调节的新型架构。 Method: 提出一种带有可调局部性参数λ的变压器架构,在不重新训练的情况下实现局部主义与分布式表示之间的动态插值,并在WikiText语料库上进行系统实验。 Result: 局部配置显著降低注意力熵(λ=1.0时为5.36比特),提高指针保真度;λ=0.6时测试困惑度为4.65,准确率为84.7%,达到最佳权衡。 Conclusion: 局部主义语言模型通过显式惩罚阈值和信息论设计原则,为需要透明性和能力的领域提供了可调控解释性-性能权衡的实用框架。 Abstract: This paper presents the first empirical demonstration of controllable locality in transformer language models, a novel architectural framework that enables continuous control over the degree of representation localization through a tunable locality dial parameter. Unlike traditional language models that rely exclusively on distributed representations, our approach allows dynamic interpolation between highly interpretable localist encodings and efficient distributed representations without requiring model retraining. We conducted experiments on the WikiText corpus using a two-layer transformer architecture, systematically varying the locality parameter {\lambda} across the full spectrum from 1.0 (fully localist) to 0.0 (fully distributed). Our results demonstrate that localist configurations achieve dramatically lower attention entropy, with {\lambda} = 1.0 yielding 5.36 bits compared to 7.18 bits at {\lambda} = 0.0, while maintaining substantially higher pointer fidelity scores reflecting stronger alignment with rule-specified targets. Prediction experiments reveal that intermediate locality values optimize the tradeoff between interpretability and performance, with {\lambda} = 0.6 achieving test perplexity of 4.65 and accuracy of 84.7%. These findings establish that localist language models provide a practical framework for applications in regulated domains requiring both transparency and capability, offering precise mathematical control over the interpretability-performance spectrum through explicit penalty thresholds and information-theoretic design principles.[41] ASVRI-Legal: Fine-Tuning LLMs with Retrieval Augmented Generation for Enhanced Legal Regulation
One Octadion,Bondan Sapta Prakoso,Nanang Yudi Setiawan,Novanto Yudistira
Main category: cs.CL
TL;DR: 本研究通过微调大语言模型并结合检索增强生成(RAG)方法,构建了一个能有效辅助政策制定者理解和制定法律法规的工具。
Details
Motivation: 为了提升大语言模型在法律领域的适用性,帮助政策制定者更高效地分析和制定法规。 Method: 构建面向法律领域的监督训练数据集,并结合检索增强生成(RAG)技术引入最新的外部法律知识。 Result: 模型能够准确理解法律文本,并在法规解释与起草方面提供有效支持,显著提升法律研究与法规制定的效率。 Conclusion: 结合微调与RAG的方法能有效增强大语言模型在法律政策制定中的实用性,具有重要应用价值。 Abstract: In this study, we explore the fine-tuning of Large Language Models (LLMs) to better support policymakers in their crucial work of understanding, analyzing, and crafting legal regulations. To equip the model with a deep understanding of legal texts, we curated a supervised dataset tailored to the specific needs of the legal domain. Additionally, we integrated the Retrieval-Augmented Generation (RAG) method, enabling the LLM to access and incorporate up-to-date legal knowledge from external sources. This combination of fine-tuning and RAG-based augmentation results in a tool that not only processes legal information but actively assists policymakers in interpreting regulations and drafting new ones that align with current needs. The results demonstrate that this approach can significantly enhance the effectiveness of legal research and regulation development, offering a valuable resource in the ever-evolving field of law.[42] Step-Audio-EditX Technical Report
Chao Yan,Boyong Wu,Peng Yang,Pengfei Tan,Guoqiang Hu,Yuxin Zhang,Xiangyu,Zhang,Fei Tian,Xuerui Yang,Xiangyu Zhang,Daxin Jiang,Gang Yu
Main category: cs.CL
TL;DR: 提出Step-Audio-EditX,首个基于大语言模型的开源音频编辑模型,擅长表达性和迭代性音频编辑,并具备强大的零样本文本到语音能力。
Details
Motivation: 现有音频编辑模型在表达性、细粒度控制和迭代编辑方面存在不足,且多依赖嵌入先验或辅助模块,限制了灵活性和可扩展性。 Method: 仅使用大间隔合成数据进行训练,摒弃了基于嵌入的先验和辅助模块,通过大间隔学习实现跨声音的迭代控制和高表达性。 Result: 在情感编辑和其他细粒度控制任务上优于MiniMax-2.6-hd和Doubao-Seed-TTS-2.0。 Conclusion: 大间隔合成数据训练是实现高质量、高表达性音频编辑的有效路径,标志着从表征解耦范式的根本转变。 Abstract: We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.[43] A systematic review of relation extraction task since the emergence of Transformers
Ringwald Celian,Gandon,Fabien,Faron Catherine,Michel Franck,Abi Akl Hanna
Main category: cs.CL
TL;DR: 本文系统回顾了2019年至2024年基于Transformer模型的关系抽取(RE)研究进展,分析了34篇综述、64个数据集和104个模型,总结了方法论进展、基准资源及语义网技术的融合,指出了当前趋势、局限性和开放性挑战。
Details
Motivation: 随着Transformer模型的兴起,关系抽取领域迅速发展,但缺乏对近年来研究成果的系统性梳理,亟需全面总结以指导未来研究方向。 Method: 采用自动化框架收集和标注文献,对2019至2024年间发表的34篇综述、64个数据集和104个模型进行多维度综合分析。 Result: 梳理了关系抽取领域的方法论进展、常用数据集与模型发展趋势,揭示了当前研究的趋势、存在的局限性以及未解决的挑战。 Conclusion: 该研究为研究人员和实践者提供了关系抽取领域发展的全面参考,明确了未来研究的方向与关键问题。 Abstract: This article presents a systematic review of relation extraction (RE) research since the advent of Transformer-based models. Using an automated framework to collect and annotate publications, we analyze 34 surveys, 64 datasets, and 104 models published between 2019 and 2024. The review highlights methodological advances, benchmark resources, and the integration of semantic web technologies. By consolidating results across multiple dimensions, the study identifies current trends, limitations, and open challenges, offering researchers and practitioners a comprehensive reference for understanding the evolution and future directions of RE.[44] Towards Transparent Stance Detection: A Zero-Shot Approach Using Implicit and Explicit Interpretability
Apoorva Upadhyaya,Wolfgang Nejdl,Marco Fisichella
Main category: cs.CL
TL;DR: 本文提出了一种新的可解释零样本立场检测框架IRIS,通过隐式和显式理由实现对未见目标的立场识别,提升了模型的可解释性和泛化能力。
Details
Motivation: 现有零样本立场检测方法在泛化性、文本与目标的一致性以及推理过程的可解释性方面存在不足,尤其是对大语言模型的显式推理依赖过强且缺乏细粒度解释。 Method: 将立场检测视为信息检索排序任务,利用文本中的隐式理由(序列相关性)和基于语言学特征的显式理由(情感与认知维度)来指导预测,无需真实理由标注即可实现内在可解释性。 Result: 在VAST、EZ-STANCE、P-Stance和RFD数据集上使用50%、30%和10%训练数据进行实验,结果表明IRIS具有良好的泛化能力和性能表现。 Conclusion: IRIS通过结合隐式和显式理由,提供了一种无需显式推理标注的可解释ZSSD框架,在低资源设置下仍保持优异性能,增强了模型的透明度与可信度。 Abstract: Zero-Shot Stance Detection (ZSSD) identifies the attitude of the post toward unseen targets. Existing research using contrastive, meta-learning, or data augmentation suffers from generalizability issues or lack of coherence between text and target. Recent works leveraging large language models (LLMs) for ZSSD focus either on improving unseen target-specific knowledge or generating explanations for stance analysis. However, most of these works are limited by their over-reliance on explicit reasoning, provide coarse explanations that lack nuance, and do not explicitly model the reasoning process, making it difficult to interpret the model's predictions. To address these issues, in our study, we develop a novel interpretable ZSSD framework, IRIS. We provide an interpretable understanding of the attitude of the input towards the target implicitly based on sequences within the text (implicit rationales) and explicitly based on linguistic measures (explicit rationales). IRIS considers stance detection as an information retrieval ranking task, understanding the relevance of implicit rationales for different stances to guide the model towards correct predictions without requiring the ground-truth of rationales, thus providing inherent interpretability. In addition, explicit rationales based on communicative features help decode the emotional and cognitive dimensions of stance, offering an interpretable understanding of the author's attitude towards the given target. Extensive experiments on the benchmark datasets of VAST, EZ-STANCE, P-Stance, and RFD using 50%, 30%, and even 10% training data prove the generalizability of our model, benefiting from the proposed architecture and interpretable design.[45] ChiMDQA: Towards Comprehensive Chinese Document QA with Fine-grained Evaluation
Jing Gao,Shutiao Luo,Yumeng Liu,Yuanming Li,Hongji Zeng
Main category: cs.CL
TL;DR: 本文提出了一个面向多领域中文文档问答的高质量数据集ChiMDQA,包含6,068个问答对,覆盖学术、教育、金融、法律、医疗和新闻六个领域,并细分为十个类别,适用于文档理解、知识抽取和智能问答等NLP任务。
Details
Motivation: 随着自然语言处理技术的发展,中文文档问答系统在实际业务场景中的需求日益增长,但缺乏高质量、多领域的中文多文档问答数据集,因此需要构建一个覆盖面广、质量高且适用于多种下游任务的数据集。 Method: 通过严格的文档筛选和系统化的问题设计方法,从六个领域收集长文本文档,构建了包含6,068个高质量问答对的ChiMDQA数据集,并将其细分为十个类别,同时建立了细粒度的评估体系。 Result: ChiMDQA数据集具有良好的多样性与高质量特性,适用于文档理解、知识提取和智能问答等多种NLP任务,且已公开代码与数据,为中文问答研究提供了有力支持。 Conclusion: ChiMDQA是一个高质量、多领域、细粒度标注的中文多文档问答数据集,能够有效推动中文信息抽取与智能问答系统的研发与应用。 Abstract: With the rapid advancement of natural language processing (NLP) technologies, the demand for high-quality Chinese document question-answering datasets is steadily growing. To address this issue, we present the Chinese Multi-Document Question Answering Dataset(ChiMDQA), specifically designed for downstream business scenarios across prevalent domains including academic, education, finance, law, medical treatment, and news. ChiMDQA encompasses long-form documents from six distinct fields, consisting of 6,068 rigorously curated, high-quality question-answer (QA) pairs further classified into ten fine-grained categories. Through meticulous document screening and a systematic question-design methodology, the dataset guarantees both diversity and high quality, rendering it applicable to various NLP tasks such as document comprehension, knowledge extraction, and intelligent QA systems. Additionally, this paper offers a comprehensive overview of the dataset's design objectives, construction methodologies, and fine-grained evaluation system, supplying a substantial foundation for future research and practical applications in Chinese QA. The code and data are available at: https://anonymous.4open.science/r/Foxit-CHiMDQA/.[46] Do Androids Dream of Unseen Puppeteers? Probing for a Conspiracy Mindset in Large Language Models
Francesco Corso,Francesco Pierri,Gianmarco De Francisci Morales
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLM)是否具有阴谋论倾向、是否存在社会人口偏差,以及是否容易被引导接受阴谋论观点。通过心理测量调查发现,LLM对阴谋论有一定认同,且可通过提示轻易引导其趋向阴谋论,揭示了潜在偏见和操纵风险。
Details
Motivation: 阴谋信念在错误信息传播和对机构不信任中起关键作用,而LLM常被用作人类行为代理,但其是否再现如阴谋心态等高阶心理特征尚不清楚,因此需探究其社会保真度。 Method: 使用经过验证的心理测量量表,在不同提示和条件策略下对多个LLM进行测试,评估其阴谋心态表现及社会人口属性对其影响。 Result: LLM表现出对阴谋论要素的部分认同;加入社会人口属性条件后影响不均,暴露出潜在偏见;针对性提示可显著引导模型朝阴谋论方向回应,显示其易受操控。 Conclusion: LLM存在类似阴谋论的倾向且易被操纵,反映出其内在心理维度需被审慎评估,以推动计算社会科学并制定应对有害使用的缓解策略。 Abstract: In this paper, we investigate whether Large Language Models (LLMs) exhibit conspiratorial tendencies, whether they display sociodemographic biases in this domain, and how easily they can be conditioned into adopting conspiratorial perspectives. Conspiracy beliefs play a central role in the spread of misinformation and in shaping distrust toward institutions, making them a critical testbed for evaluating the social fidelity of LLMs. LLMs are increasingly used as proxies for studying human behavior, yet little is known about whether they reproduce higher-order psychological constructs such as a conspiratorial mindset. To bridge this research gap, we administer validated psychometric surveys measuring conspiracy mindset to multiple models under different prompting and conditioning strategies. Our findings reveal that LLMs show partial agreement with elements of conspiracy belief, and conditioning with socio-demographic attributes produces uneven effects, exposing latent demographic biases. Moreover, targeted prompts can easily shift model responses toward conspiratorial directions, underscoring both the susceptibility of LLMs to manipulation and the potential risks of their deployment in sensitive contexts. These results highlight the importance of critically evaluating the psychological dimensions embedded in LLMs, both to advance computational social science and to inform possible mitigation strategies against harmful uses.[47] Grounded Misunderstandings in Asymmetric Dialogue: A Perspectivist Annotation Scheme for MapTask
Nan Li,Albert Gatt,Massimo Poesio
Main category: cs.CL
TL;DR: 提出了一种面向HCRC MapTask语料库的视角化标注方案,用于捕捉对话中说话人和听者在指代表达上的理解差异,揭示表面共识下潜在的指称错位。
Details
Motivation: 在不对称对话场景中,参与者可能误以为达成共识,实则指向不同实体,难以准确追踪共同理解的建立过程。 Method: 设计了一种区分说话人与听者已 grounding 理解的标注体系,并通过受控的LLM标注流程对HCRC MapTask语料库中的13k个指代表达进行标注,分析理解状态的演变。 Result: 发现词汇变体统一后完全误解较少,但多重性差异会系统性引发理解分歧,表明表面对齐可能掩盖指称错位。 Conclusion: 该框架为研究协作对话中的 grounded 误解提供了新资源和分析视角,也可用于评估(V)LLM对视角依赖性grounding的建模能力。 Abstract: Collaborative dialogue relies on participants incrementally establishing common ground, yet in asymmetric settings they may believe they agree while referring to different entities. We introduce a perspectivist annotation scheme for the HCRC MapTask corpus (Anderson et al., 1991) that separately captures speaker and addressee grounded interpretations for each reference expression, enabling us to trace how understanding emerges, diverges, and repairs over time. Using a scheme-constrained LLM annotation pipeline, we obtain 13k annotated reference expressions with reliability estimates and analyze the resulting understanding states. The results show that full misunderstandings are rare once lexical variants are unified, but multiplicity discrepancies systematically induce divergences, revealing how apparent grounding can mask referential misalignment. Our framework provides both a resource and an analytic lens for studying grounded misunderstanding and for evaluating (V)LLMs' capacity to model perspective-dependent grounding in collaborative dialogue.cs.CV [Back]
[48] Cropland Mapping using Geospatial Embeddings
Ivan Zvonkov,Gabriel Tseng,Inbal Becker-Reshef,Hannah Kerner
Main category: cs.CV
TL;DR: 本研究评估了地理空间嵌入在多哥农田制图中的应用,发现其能简化工作流程并实现高精度分类,有助于更好地评估土地利用变化及其气候影响。
Details
Motivation: 准确且最新的土地覆盖图对于理解土地利用变化至关重要,而地理空间嵌入提供了一种更高效、更易获取的景观特征映射方法,但其在实际制图应用中的潜力尚未充分探索。 Method: 使用Presto和AlphaEarth生成的地理空间嵌入进行农田制图,并评估其在多哥的应用效果。 Result: 地理空间嵌入能够简化制图流程,并实现高精度的农田分类。 Conclusion: 地理空间嵌入可有效支持土地利用变化及其气候影响的评估,具有在现实世界制图应用中推广的潜力。 Abstract: Accurate and up-to-date land cover maps are essential for understanding land use change, a key driver of climate change. Geospatial embeddings offer a more efficient and accessible way to map landscape features, yet their use in real-world mapping applications remains underexplored. In this work, we evaluated the utility of geospatial embeddings for cropland mapping in Togo. We produced cropland maps using embeddings from Presto and AlphaEarth. Our findings show that geospatial embeddings can simplify workflows, achieve high-accuracy cropland classification and ultimately support better assessments of land use change and its climate impacts.[49] Generative Hints
Andy Dimnaku,Abdullah Yusuf Kavranoğlu,Yaser Abu-Mostafa
Main category: cs.CV
TL;DR: 提出生成式提示(generative hints)方法,利用生成模型在全输入空间中直接施加已知不变性,通过虚拟样例以半监督方式训练模型,在多个数据集上优于传统数据增强。
Details
Motivation: 传统数据增强仅在训练数据的变换上学习不变性,无法充分覆盖整个输入空间,导致模型对某些变换的泛化能力不足。 Method: 使用在训练集上训练的生成模型近似输入分布,生成无标签的虚拟图像;将这些虚拟样例用于半监督训练,结合分类目标和表示不变性的‘提示’(hints)目标,引导模型学习期望的函数性质。 Result: 在多个数据集、架构和损失函数下,生成式提示均优于标准数据增强方法;在细粒度视觉分类任务中平均提升0.63%(最高1.78%),在CheXpert X-ray数据集上平均提升1.286%。 Conclusion: 生成式提示能更有效地引入不变性先验知识,显著提升模型泛化性能,是一种比传统数据增强更强大的训练范式。 Abstract: Data augmentation is widely used in vision to introduce variation and mitigate overfitting, through enabling models to learn invariant properties, such as spatial invariance. However, these properties are not fully captured by data augmentation alone, since it attempts to learn the property on transformations of the training data only. We propose generative hints, a training methodology that directly enforces known invariances in the entire input space. Our approach leverages a generative model trained on the training set to approximate the input distribution and generate unlabeled images, which we refer to as virtual examples. These virtual examples are used to enforce functional properties known as hints. In generative hints, although the training dataset is fully labeled, the model is trained in a semi-supervised manner on both the classification and hint objectives, using the unlabeled virtual examples to guide the model in learning the desired hint. Across datasets, architectures, and loss functions, generative hints consistently outperform standard data augmentation when learning the same property. On popular fine-grained visual classification benchmarks, we achieved up to 1.78% top-1 accuracy improvement (0.63% on average) over fine-tuned models with data augmentation and an average performance boost of 1.286% on the CheXpert X-ray dataset.[50] ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
Srikumar Sastry,Subash Khanal,Aayush Dhakal,Jiayu Lin,Dan Cher,Phoenix Jarosz,Nathan Jacobs
Main category: cs.CV
TL;DR: 提出ProM3E,一种用于生态学中任意模态生成的概率性掩码多模态嵌入模型,支持模态反演与跨模态检索,并具备优越的表示学习能力。
Details
Motivation: 为了实现生态学中多模态数据的灵活融合与生成,解决缺失模态推断和模态融合可行性分析的问题。 Method: 基于嵌入空间中的掩码模态重建,采用概率性建模方法,支持模态反演,并结合跨模态与同模态相似性进行跨模态检索,通过线性探测评估表示能力。 Result: 在所有检索任务中实现了更优性能,线性探测结果表明模型具有强大的多模态表示学习能力。 Conclusion: ProM3E能够有效推断缺失模态、支持模态融合分析与跨模态检索,是一种强大且灵活的生态学多模态表示学习框架。 Abstract: We introduce ProM3E, a probabilistic masked multimodal embedding model for any-to-any generation of multimodal representations for ecology. ProM3E is based on masked modality reconstruction in the embedding space, learning to infer missing modalities given a few context modalities. By design, our model supports modality inversion in the embedding space. The probabilistic nature of our model allows us to analyse the feasibility of fusing various modalities for given downstream tasks, essentially learning what to fuse. Using these features of our model, we propose a novel cross-modal retrieval approach that mixes inter-modal and intra-modal similarities to achieve superior performance across all retrieval tasks. We further leverage the hidden representation from our model to perform linear probing tasks and demonstrate the superior representation learning capability of our model. All our code, datasets and model will be released at https://vishu26.github.io/prom3e.[51] EvtSlowTV -- A Large and Diverse Dataset for Event-Based Depth Estimation
Sadiq Layi Macaulay,Nimet Kaygusuz,Simon Hadfield
Main category: cs.CV
TL;DR: 本文提出了EvtSlowTV,一个大规模事件相机数据集,包含超过130亿个事件,用于自监督深度估计,显著提升了模型在复杂场景中的泛化能力。
Details
Motivation: 现有基于事件的深度估计方法受限于小规模标注数据集,难以推广到真实世界场景,因此需要一个更大、更自然的数据集来提升模型的泛化能力。 Method: 利用公开的YouTube视频片段构建EvtSlowTV数据集,并采用自监督学习框架直接利用原始事件流进行训练,无需帧级标注并保持事件数据的异步特性。 Result: EvtSlowTV比现有数据集大一个数量级,实验表明使用该数据集训练的模型在复杂环境和运动下具有更强的泛化能力。 Conclusion: EvtSlowTV为基于事件的深度学习提供了大规模、自然的真实场景数据支持,验证了其在自监督框架下挖掘事件数据高动态范围潜力的有效性。 Abstract: Event cameras, with their high dynamic range (HDR) and low latency, offer a promising alternative for robust depth estimation in challenging environments. However, many event-based depth estimation approaches are constrained by small-scale annotated datasets, limiting their generalizability to real-world scenarios. To bridge this gap, we introduce EvtSlowTV, a large-scale event camera dataset curated from publicly available YouTube footage, which contains more than 13B events across various environmental conditions and motions, including seasonal hiking, flying, scenic driving, and underwater exploration. EvtSlowTV is an order of magnitude larger than existing event datasets, providing an unconstrained, naturalistic setting for event-based depth learning. This work shows the suitability of EvtSlowTV for a self-supervised learning framework to capitalise on the HDR potential of raw event streams. We further demonstrate that training with EvtSlowTV enhances the model's ability to generalise to complex scenes and motions. Our approach removes the need for frame-based annotations and preserves the asynchronous nature of event data.[52] Hybrid Convolution and Vision Transformer NAS Search Space for TinyML Image Classification
Mikhael Djajapermana,Moritz Reiber,Daniel Mueller-Gritschneder,Ulf Schlichtmann
Main category: cs.CV
TL;DR: 提出了一种用于神经架构搜索(NAS)的新型混合CNN-ViT搜索空间,以在紧凑模型尺寸下实现高效图像分类。
Details
Motivation: 现有混合CNN与Vision Transformer(ViT)模型参数量大、计算成本高,难以部署于tinyML场景。 Method: 设计包含混合CNN-ViT块和可搜索池化层的新型搜索空间,结合局部与全局特征学习,并高效降维特征图。 Result: 在CIFAR10数据集上实验表明,所提出的搜索空间生成的模型在严格模型大小限制下,优于基于ResNet的tinyML模型,兼具更高精度和更快推理速度。 Conclusion: 该搜索空间能有效发现适用于tinyML的高效混合CNN-ViT架构,平衡精度与计算效率。 Abstract: Hybrids of Convolutional Neural Network (CNN) and Vision Transformer (ViT) have outperformed pure CNN or ViT architecture. However, since these architectures require large parameters and incur large computational costs, they are unsuitable for tinyML deployment. This paper introduces a new hybrid CNN-ViT search space for Neural Architecture Search (NAS) to find efficient hybrid architectures for image classification. The search space covers hybrid CNN and ViT blocks to learn local and global information, as well as the novel Pooling block of searchable pooling layers for efficient feature map reduction. Experimental results on the CIFAR10 dataset show that our proposed search space can produce hybrid CNN-ViT architectures with superior accuracy and inference speed to ResNet-based tinyML models under tight model size constraints.[53] SCALE-VLP: Soft-Weighted Contrastive Volumetric Vision-Language Pre-training with Spatial-Knowledge Semantics
Ailar Mahdizadeh,Puria Azadi Moghadam,Xiangteng He,Shahriar Mirabbasi,Panos Nasiopoulos,Leonid Sigal
Main category: cs.CV
TL;DR: SCALE-VLP是一种面向3D医学影像的软加权对比视觉-语言预训练框架,融合体素空间语义与领域知识(如放射学本体),在有限监督下实现结构一致且语义丰富的跨模态对齐,在检索、分类和报告生成任务中表现优越,并具备良好的零样本跨域泛化能力。
Details
Motivation: 现有视觉-语言模型多基于2D图像和二元监督,难以有效建模CT等3D医学影像中的连续空间结构和丰富临床语义,导致空间连贯性丢失和语义利用不足。 Method: 提出SCALE-VLP框架,引入(1)体素化空间语义以保持解剖结构完整性,(2)领域感知的知识增强语义(如放射学本体)指导视觉-语言对齐,采用软加权对比学习策略,在有限标注下进行预训练。 Result: 相比先前最优方法,CT-报告检索Top-1准确率最高提升4.3倍,异常分类性能提高10个百分点,报告生成达到ROUGE-L 0.44和BERT-F1 0.89;在外域数据集上零样本评估中仍保持稳定增益。 Conclusion: SCALE-VLP通过融合3D空间结构与领域知识,在弱监督下实现了更优的视觉-语言对齐,展现出强大的跨任务迁移能力和跨域泛化性,为医学视觉-语言理解提供了有效解决方案。 Abstract: Vision-language models (VLMs) have demonstrated strong cross-modal capabilities, yet most work remains limited to 2D data and assumes binary supervision (i.e., positive vs. negative pairs), overlooking the continuous and structured dependencies present in volumetric data such as CT. Existing approaches often treat volumetric scans as independent 2D slices, compromising spatial coherence and underutilizing rich clinical semantics. We propose SCALE-VLP, a soft-weighted contrastive vision-language pre-training framework that integrates (i) volumetric spatial semantics to preserve anatomical structure and (ii) domain-aware, knowledge-infused semantics (e.g., radiological ontologies) to guide alignment. This yields structurally consistent and semantically grounded representations under limited supervision, demonstrating strong cross-task transferability (retrieval, report generation, and classification), and cross-domain generalizability with consistent gains without further fine-tuning. In particular, compared to the previous state of the art, SCALE-VLP achieves up to 4.3x higher top-1 CT-report retrieval, improves abnormality classification by 10 points, and reaches ROUGE-L 0.44 and BERT-F1 0.89 for report generation. Further, in zero-shot evaluation on an out-of-domain external dataset, we observe consistent gains, indicating the cross-task and cross-domain generalization ability of SCALE-VLP.[54] Learning with less: label-efficient land cover classification at very high spatial resolution using self-supervised deep learning
Dakota Hester,Vitor S. Martins,Lucas B. Ferreira,Thainara M. A. Lima
Main category: cs.CV
TL;DR: 提出了一种基于自监督学习的标签高效方法,用于全州范围1米分辨率的土地覆盖分类,仅使用1000个标注样本即实现了较高的分类精度。
Details
Motivation: 深度学习语义分割在高分辨率土地覆盖分类中表现优异,但大规模标注数据的获取困难限制了其广泛应用。因此,需要一种减少对大量标注数据依赖的方法。 Method: 采用“Bootstrap Your Own Latent”(BYOL)策略,利用大量无标签彩色红外航拍图像预训练ResNet-101编码器,随后将预训练权重迁移至多种语义分割模型(如U-Net、DeepLabV3+等),并在极小标注数据集上进行微调和交叉验证。 Result: 在仅使用250–750个标注图像块的情况下,最佳U-Net模型集成达到了87.14%的整体准确率和75.58%的宏F1分数,成功实现了覆盖密西西比州超过1230亿像素的8类土地覆盖制图。 Conclusion: 自监督学习能有效降低对人工标注数据的需求,为大范围、高分辨率土地覆盖制图提供了一种可行且高效的解决方案。 Abstract: Deep learning semantic segmentation methods have shown promising performance for very high 1-m resolution land cover classification, but the challenge of collecting large volumes of representative training data creates a significant barrier to widespread adoption of such models for meter-scale land cover mapping over large areas. In this study, we present a novel label-efficient approach for statewide 1-m land cover classification using only 1,000 annotated reference image patches with self-supervised deep learning. We use the "Bootstrap Your Own Latent" pre-training strategy with a large amount of unlabeled color-infrared aerial images (377,921 256x256 1-m pixel patches) to pre-train a ResNet-101 convolutional encoder. The learned encoder weights were subsequently transferred into multiple deep semantic segmentation architectures (FCN, U-Net, Attention U-Net, DeepLabV3+, UPerNet, PAN), which were then fine-tuned using very small training dataset sizes with cross-validation (250, 500, 750 patches). Among the fine-tuned models, we obtained the 87.14% overall accuracy and 75.58% macro F1 score using an ensemble of the best performing U-Net models for comprehensive 1-m, 8-class land cover mapping, covering more than 123 billion pixels over the state of Mississippi, USA. Detailed qualitative and quantitative analysis revealed accurate mapping of open water and forested areas, while highlighting challenges in accurate delineation between cropland, herbaceous, and barren land cover types. These results show that self-supervised learning is an effective strategy for reducing the need for large volumes of manually annotated data, directly addressing a major limitation to high spatial resolution land cover mapping at scale.[55] A Foundation Model for Brain MRI with Dynamic Modality Integration
Minh Sao Khue Luu,Bair N. Tuchinov
Main category: cs.CV
TL;DR: 提出一种用于脑MRI的通用基础模型,能够处理不同成像序列组合,通过可学习的模态嵌入和掩码自编码目标实现多模态缺失下的灵活表征学习。
Details
Motivation: 传统方法需为每种模态单独建模,难以应对临床中常见模态缺失或新模态出现的情况,限制了模型的泛化能力与实用性。 Method: 采用单一编码器结合可学习模态嵌入和条件层归一化,设计掩码自编码目标并引入方差-协方差正则项以提升特征多样性与稳定性,实现对多模态及缺失模态的统一建模。 Result: 在约6万个多中心MRI数据上完成自监督训练,初步结果显示模型在多种模态配置下具有可行性,支持后续在肿瘤、多发性硬化分割及病灶分类任务中的评估。 Conclusion: 该模型能有效适应不同输入模态组合,无需为每种模态单独训练模型,在模态缺失或未见模态场景下仍保持良好表现,具备临床应用潜力。 Abstract: We present a foundation model for brain MRI that can work with different combinations of imaging sequences. The model uses one encoder with learnable modality embeddings, conditional layer normalization, and a masked autoencoding objective that accounts for missing modalities. A variance-covariance regularizer is applied to stabilize feature learning and improve representation diversity. This design removes the need for separate models for each modality and allows the network to adapt when some sequences are missing or unseen. It is trained on about 60,000 multi-center MRIs using self-supervised reconstruction and modality imputation to learn flexible representations. A learnable modality embedding guides feature extraction so the encoder can adjust to different inputs. We describe our planned evaluation on brain tumor and multiple sclerosis segmentation, as well as lesion classification, under various modality settings. Preliminary results show that the method works feasibly, and further experiments are planned to study its performance in more detail. All code and pretrained models are available at https://github.com/BrainFM/brainfm[56] SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
Wenbo Lu
Main category: cs.CV
TL;DR: 本文提出了结构感知的语言-图像预训练方法SLIP,通过引入结构对比损失来建模实体间的关系,利用大规模亚马逊商品共购图数据集,在跨模态检索和分类任务中优于CLIP。
Details
Motivation: 现有视觉语言预训练方法忽视了数据中自然存在的丰富关系结构,如电商中的商品共购关系,而人类认知知识是基于关系的,因此需要引入结构信息提升模型性能。 Method: 提出SLIP框架,结合结构对比损失,在对齐多模态的同时建模结构图中邻近实体之间的关系,并构建了大规模亚马逊商品共购多模态图数据集。 Result: 在零样本和少样本设置下的跨模态检索与分类任务中,SLIP consistently优于CLIP,验证了关系监督对跨模态对齐的有效性。 Conclusion: 引入结构化关系信息有助于提升视觉语言预训练模型的性能,结构感知的预训练范式为未来多模态学习提供了新方向。 Abstract: Vision-Language Pretraining (VLP) has achieved remarkable success across various downstream tasks, but such gains are largely driven by scaling up on training data. Yet, literature methods treat image-text pairs as isolated training examples; this neglects the rich relational structure naturally present in many domains, such as e-commerce product co-purchase graphs and social recommendation networks. Inspired by neuroscientific evidence that human encodes knowledge as relationship cognitive maps, we introduce Structure-aware Language-Image Pretraining (SLIP). SLIP integrates a structural contrastive loss to align modalities while also modeling relationships between neighboring entities in a structured graph. To support this paradigm, we construct a large-scale Amazon Product Co-purchase Multimodal Graph Dataset, enabling structured cross-modality supervision at scale. Experiment results show that SLIP consistently outperforms CLIP on cross-modal retrieval and classification tasks in both zero-shot and few-shot settings, showing the value of relational supervision for cross-modal alignment.[57] From Propagation to Prediction: Point-level Uncertainty Evaluation of MLS Point Clouds under Limited Ground Truth
Ziyang Xu,Olaf Wysocki,Christoph Holst
Main category: cs.CV
TL;DR: 提出了一种基于学习的移动激光扫描点云不确定性评估框架,结合最优邻域估计与几何特征提取,减少了对真实数据的依赖。
Details
Motivation: 减少在不确定性评估研究中对昂贵且难以获取的真实数据(GT)的依赖。 Method: 结合最优邻域估计与几何特征提取,使用XGBoost和随机森林模型进行实验。 Result: XGBoost模型精度与随机森林相当,但效率提高约3倍,验证了几何特征可用于预测点级不确定性。 Conclusion: MLS点云的不确定性是可学习的,为不确定性评估研究提供了新的基于学习的视角。 Abstract: Evaluating uncertainty is critical for reliable use of Mobile Laser Scanning (MLS) point clouds in many high-precision applications such as Scan-to-BIM, deformation analysis, and 3D modeling. However, obtaining the ground truth (GT) for evaluation is often costly and infeasible in many real-world applications. To reduce this long-standing reliance on GT in uncertainty evaluation research, this study presents a learning-based framework for MLS point clouds that integrates optimal neighborhood estimation with geometric feature extraction. Experiments on a real-world dataset show that the proposed framework is feasible and the XGBoost model delivers fully comparable accuracy to Random Forest while achieving substantially higher efficiency (about 3 times faster), providing initial evidence that geometric features can be used to predict point-level uncertainty quantified by the C2C distance. In summary, this study shows that MLS point clouds' uncertainty is learnable, offering a novel learning-based viewpoint towards uncertainty evaluation research.[58] A Plug-and-Play Framework for Volumetric Light-Sheet Image Reconstruction
Yi Gong,Xinyuan Zhang,Jichen Chai,Yichen Ding,Yifei Lou
Main category: cs.CV
TL;DR: 提出了一种结合压缩感知与光片显微镜的计算成像框架,用于高效、低光毒性的高速心脏成像,通过随机二值掩码编码和Plug-and-Play算法实现高质量图像重建。
Details
Motivation: 传统光学成像在时空分辨率之间存在权衡,难以捕捉跳动心脏中的动态细胞结构,因此需要一种能突破该限制的高效成像方法。 Method: 将压缩感知(CS)与光片显微镜(LSM)结合,利用数字微镜器件(DMD)进行荧光信号的压缩采集,并采用基于ADMM的Plug-and-Play框架,集成Tikhonov、TV和BM3D等先进去噪器,引入时间正则化以保持z切片间的结构连续性。 Result: 在斑马鱼心脏成像实验中,即使在高压缩比下,该方法仍能成功重建出清晰的细胞结构,表现出优异的去噪效果和图像质量。 Conclusion: 所提方法在真实高速、低光照生物成像场景中具有有效性与鲁棒性,为活体心脏动态成像提供了强有力的技术支持。 Abstract: Cardiac contraction is a rapid, coordinated process that unfolds across three-dimensional tissue on millisecond timescales. Traditional optical imaging is often inadequate for capturing dynamic cellular structure in the beating heart because of a fundamental trade-off between spatial and temporal resolution. To overcome these limitations, we propose a high-performance computational imaging framework that integrates Compressive Sensing (CS) with Light-Sheet Microscopy (LSM) for efficient, low-phototoxic cardiac imaging. The system performs compressed acquisition of fluorescence signals via random binary mask coding using a Digital Micromirror Device (DMD). We propose a Plug-and-Play (PnP) framework, solved using the alternating direction method of multipliers (ADMM), which flexibly incorporates advanced denoisers, including Tikhonov, Total Variation (TV), and BM3D. To preserve structural continuity in dynamic imaging, we further introduce temporal regularization enforcing smoothness between adjacent z-slices. Experimental results on zebrafish heart imaging under high compression ratios demonstrate that the proposed method successfully reconstructs cellular structures with excellent denoising performance and image clarity, validating the effectiveness and robustness of our algorithm in real-world high-speed, low-light biological imaging scenarios.[59] ISC-Perception: A Hybrid Computer Vision Dataset for Object Detection in Novel Steel Assembly
Miftahur Rahman,Samuel Adebayo,Dorian A. Acevedo-Mejia,David Hester,Daniel McPolin,Karen Rafferty,Debra F. Laefer
Main category: cs.CV
TL;DR: 本文提出了ISC-Perception,首个专用于ISC构件检测的混合数据集,结合了程序化渲染CAD图像、游戏引擎生成的逼真场景和少量真实照片,显著降低人工标注时间,并提升了检测模型性能。
Details
Motivation: 由于在施工现场采集图像存在物流、安全和隐私问题,缺乏专用图像数据集阻碍了ISC感知机器人的发展。因此需要构建一个高效且可扩展的数据集来推动相关研究。 Method: 构建了一个包含合成图像(CAD渲染与游戏引擎生成)和少量真实图像的混合数据集ISC-Perception,实现全自动标注;同时量化了整个数据集构建过程中的人工耗时,并与手动标注成本进行对比。 Result: 在10,000张图像的数据集上,人工耗时从166.7小时降至30.5小时,减少81.7%;训练的检测器在IoU 0.5下的mAP达到0.756,优于仅使用合成或逼真图像的模型;在1,200帧测试集上mAP@0.50达0.943,mAP@[0.50:0.95]为0.823。 Conclusion: ISC-Perception有效填补了建筑机器人感知领域的数据空白,支持快速开发定制化目标检测器,且可免费用于科研与工业应用。 Abstract: The Intermeshed Steel Connection (ISC) system, when paired with robotic manipulators, can accelerate steel-frame assembly and improve worker safety by eliminating manual assembly. Dependable perception is one of the initial stages for ISC-aware robots. However, this is hampered by the absence of a dedicated image corpus, as collecting photographs on active construction sites is logistically difficult and raises safety and privacy concerns. In response, we introduce ISC-Perception, the first hybrid dataset expressly designed for ISC component detection. It blends procedurally rendered CAD images, game-engine photorealistic scenes, and a limited, curated set of real photographs, enabling fully automatic labelling of the synthetic portion. We explicitly account for all human effort to produce the dataset, including simulation engine and scene setup, asset preparation, post-processing scripts and quality checks; our total human time to generate a 10,000-image dataset was 30.5,h versus 166.7,h for manual labelling at 60,s per image (-81.7%). A manual pilot on a representative image with five instances of ISC members took 60,s (maximum 80,s), anchoring the manual baseline. Detectors trained on ISC-Perception achieved a mean Average Precision at IoU 0.50 of 0.756, substantially surpassing models trained on synthetic-only or photorealistic-only data. On a 1,200-frame bench test, we report mAP@0.50/mAP@[0.50:0.95] of 0.943/0.823. By bridging the data gap for construction-robotics perception, ISC-Perception facilitates rapid development of custom object detectors and is freely available for research and industrial use upon request.[60] DentalSplat: Dental Occlusion Novel View Synthesis from Sparse Intra-Oral Photographs
Yiyi Miao,Taoyu Wu,Tong Chen,Sihao Li,Ji Jiang,Youpeng Yang,Angelos Stefanidis,Limin Yu,Jionglong Su
Main category: cs.CV
TL;DR: 本文提出DentalSplat,一种针对稀疏正畸图像的3D重建框架,利用先验引导的立体重建和自适应剪枝策略,结合光流几何约束,在仅有三视图的情况下实现高质量的牙齿咬合三维可视化。
Details
Motivation: 传统3D高斯溅射方法依赖多视角密集输入和精确相机位姿,难以适用于仅含前视和双侧颊视三张稀疏图像的正畸远程医疗场景。 Method: 提出DentalSplat框架:首先采用先验引导的稠密立体重建模型初始化点云,然后引入尺度自适应剪枝策略提升训练效率与重建质量;在视图极稀疏情况下,结合光流作为几何约束和梯度正则化以增强渲染保真度。 Result: 在包含950个临床病例的大规模数据集和195个视频测试案例上验证,该方法在极端稀疏视角下显著优于现有最先进方法,实现了高质量的新视角合成。 Conclusion: DentalSplat有效解决了稀疏输入和无相机位姿条件下的正畸3D重建难题,提升了远程正畸诊疗中牙齿咬合可视化的实用性与准确性。 Abstract: In orthodontic treatment, particularly within telemedicine contexts, observing patients' dental occlusion from multiple viewpoints facilitates timely clinical decision-making. Recent advances in 3D Gaussian Splatting (3DGS) have shown strong potential in 3D reconstruction and novel view synthesis. However, conventional 3DGS pipelines typically rely on densely captured multi-view inputs and precisely initialized camera poses, limiting their practicality. Orthodontic cases, in contrast, often comprise only three sparse images, specifically, the anterior view and bilateral buccal views, rendering the reconstruction task especially challenging. The extreme sparsity of input views severely degrades reconstruction quality, while the absence of camera pose information further complicates the process. To overcome these limitations, we propose DentalSplat, an effective framework for 3D reconstruction from sparse orthodontic imagery. Our method leverages a prior-guided dense stereo reconstruction model to initialize the point cloud, followed by a scale-adaptive pruning strategy to improve the training efficiency and reconstruction quality of 3DGS. In scenarios with extremely sparse viewpoints, we further incorporate optical flow as a geometric constraint, coupled with gradient regularization, to enhance rendering fidelity. We validate our approach on a large-scale dataset comprising 950 clinical cases and an additional video-based test set of 195 cases designed to simulate real-world remote orthodontic imaging conditions. Experimental results demonstrate that our method effectively handles sparse input scenarios and achieves superior novel view synthesis quality for dental occlusion visualization, outperforming state-of-the-art techniques.[61] Image-Intrinsic Priors for Integrated Circuit Defect Detection and Novel Class Discovery via Self-Supervised Learning
Botong. Zhao,Xubin. Wang,Shujing. Lyu,Yue. Lu
Main category: cs.CV
TL;DR: 提出IC DefectNCD框架,利用IC SEM图像的内在先验实现无需支持集的缺陷检测与新类别发现,通过自监督方法在真实数据上实现了对缺陷的精确定位与分类。
Details
Motivation: 集成电路制造中缺陷检测面临标注成本高、新兴缺陷和罕见缺陷难以识别的问题,现有监督和无监督方法存在局限性。 Method: 提出IC DefectNCD:1)基于自归一化信息引导的缺陷检测,通过可学习的正常信息提取器聚合特征并利用重构残差定位缺陷;2)自适应二值化策略稳定突出核心缺陷区域;3)基于软掩码引导注意力机制的缺陷分类,将空间缺陷先验注入师生模型,提升对缺陷区域的敏感性。 Result: 在涵盖三个关键制造阶段、15种缺陷类型的真实世界数据集上验证,该方法在缺陷检测和未见缺陷分类任务上均表现出鲁棒性能。 Conclusion: IC DefectNCD通过利用图像内在先验,有效实现了无需人工标注支持集的缺陷检测与新类发现,具有良好的实际应用潜力。 Abstract: Integrated circuit manufacturing is highly complex, comprising hundreds of process steps. Defects can arise at any stage, causing yield loss and ultimately degrading product reliability. Supervised methods require extensive human annotation and struggle with emergent categories and rare, data scarce defects. Clustering-based unsupervised methods often exhibit unstable performance due to missing priors. We propose IC DefectNCD, a support set free framework that leverages Image Intrinsic Priors in IC SEM images for defect detection and novel class discovery. We first develop Self Normal Information Guided IC Defect Detection, aggregating representative normal features via a learnable normal information extractor and using reconstruction residuals to coarsely localize defect regions. To handle saliency variations across defects, we introduce an adaptive binarization strategy that produces stable subimages focused on core defective areas. Finally, we design Self Defect Information Guided IC Defect Classification, which incorporates a soft mask guided attention mechanism to inject spatial defect priors into the teacher student model. This enhances sensitivity to defective regions, suppresses background interference, and enables recognition and classification of unseen defects. We validate the approach on a real world dataset spanning three key fabrication stages and covering 15 defect types. Experiments demonstrate robust performance on both defect detection and unseen defect classification.[62] Accelerating Physical Property Reasoning for Augmented Visual Cognition
Hongbo Lan,Zhenlin An,Haoyu Li,Vaibhav Singh,Longfei Shangguan
Main category: cs.CV
TL;DR: 本文提出了一种名为\sysname的系统,通过算法和系统优化将视觉引导的物理属性推理延迟从10-20分钟降低到6秒以内,并在准确性和性能上优于现有方法。
Details
Motivation: 为了实现增强的视觉认知,需要加速视觉引导的物理属性推理过程,减少运行时延迟。 Method: 采用快速几何3D重建、高效的语义特征融合和并行视图编码等算法和系统优化方法。 Result: \sysname在ABO数据集上的端到端延迟减少了62.9倍至287.2倍,同时在物体级物理属性估计精度(如质量)方面达到或略优于现有最先进方法,并在材料分割和体素级推断方面表现更优;结合眼动追踪技术,在Meta Aria眼镜的实际场景测试中表现出高鲁棒性。 Conclusion: \sysname显著提升了视觉引导的物理属性推理效率与实用性,适用于智能眼镜等实时应用场合。 Abstract: This paper introduces \sysname, a system that accelerates vision-guided physical property reasoning to enable augmented visual cognition. \sysname minimizes the run-time latency of this reasoning pipeline through a combination of both algorithmic and systematic optimizations, including rapid geometric 3D reconstruction, efficient semantic feature fusion, and parallel view encoding. Through these simple yet effective optimizations, \sysname reduces the end-to-end latency of this reasoning pipeline from 10--20 minutes to less than 6 seconds. A head-to-head comparison on the ABO dataset shows that \sysname achieves this 62.9$\times$--287.2$\times$ speedup while not only reaching on-par (and sometimes slightly better) object-level physical property estimation accuracy(e.g. mass), but also demonstrating superior performance in material segmentation and voxel-level inference than two SOTA baselines. We further combine gaze-tracking with \sysname to localize the object of interest in cluttered, real-world environments, streamlining the physical property reasoning on smart glasses. The case study with Meta Aria Glasses conducted at an IKEA furniture store demonstrates that \sysname achives consistently high performance compared to controlled captures, providing robust property estimations even with fewer views in real-world scenarios.[63] Deploying Rapid Damage Assessments from sUAS Imagery for Disaster Response
Thomas Manzini,Priyankari Perali,Robin R. Murphy
Main category: cs.CV
TL;DR: 本文介绍了首个在联邦灾害响应中实际部署的基于无人机影像的建筑物损毁评估AI/ML系统,并在飓风Debby和Helene响应中投入使用。
Details
Motivation: 应对灾害后无人机团队采集的海量影像数据超出人力分析能力,导致响应延迟,亟需自动化工具缓解数据处理压力。 Method: 利用包含21,716个建筑损毁标签的最大已知数据集训练计算机视觉与机器学习模型,并对91名灾害从业人员进行操作培训,最终部署最优模型于实际灾害响应中。 Result: 该系统在飓风Debby和Helene响应中,用约18分钟完成了415栋建筑的损毁评估,显著提升分析效率。 Conclusion: 本研究建立了基于无人机影像的损毁评估实际应用标准,推动了AI/ML技术在灾害响应中的落地,并为研究与应用社区提供了宝贵经验。 Abstract: This paper presents the first AI/ML system for automating building damage assessment in uncrewed aerial systems (sUAS) imagery to be deployed operationally during federally declared disasters (Hurricanes Debby and Helene). In response to major disasters, sUAS teams are dispatched to collect imagery of the affected areas to assess damage; however, at recent disasters, teams collectively delivered between 47GB and 369GB of imagery per day, representing more imagery than can reasonably be transmitted or interpreted by subject matter experts in the disaster scene, thus delaying response efforts. To alleviate this data avalanche encountered in practice, computer vision and machine learning techniques are necessary. While prior work has been deployed to automatically assess damage in satellite imagery, there is no current state of practice for sUAS-based damage assessment systems, as all known work has been confined to academic settings. This work establishes the state of practice via the development and deployment of models for building damage assessment with sUAS imagery. The model development involved training on the largest known dataset of post-disaster sUAS aerial imagery, containing 21,716 building damage labels, and the operational training of 91 disaster practitioners. The best performing model was deployed during the responses to Hurricanes Debby and Helene, where it assessed a combined 415 buildings in approximately 18 minutes. This work contributes documentation of the actual use of AI/ML for damage assessment during a disaster and lessons learned to the benefit of the AI/ML research and user communities.[64] Finetuning-Free Personalization of Text to Image Generation via Hypernetworks
Sagar Shrestha,Gopal Sharma,Luowei Zhou,Suren Kumar
Main category: cs.CV
TL;DR: 本文提出了一种无需微调的个性化文本到图像扩散模型方法,通过超网络直接从主体图像预测LoRA适配权重,并引入HM-CFG提升组合泛化能力。
Details
Motivation: 传统个性化方法如DreamBooth依赖主题特定的微调,计算成本高且推理慢;现有适配器或编码器方法仍需额外微调或大模型支持。 Method: 设计端到端训练的超网络来预测LoRA权重,结合简单的输出正则化稳定训练,并提出混合模型无分类器引导(HM-CFG)以增强推理时的组合泛化能力。 Result: 在CelebA-HQ、AFHQ-v2和DreamBench上实验表明,该方法在保持主体保真度和提示对齐的同时,实现了强个性化的性能。 Conclusion: 超网络提供了一种可扩展且有效的开放类别个性化新方向,无需测试时的逐主题优化。 Abstract: Personalizing text-to-image diffusion models has traditionally relied on subject-specific fine-tuning approaches such as DreamBooth~\cite{ruiz2023dreambooth}, which are computationally expensive and slow at inference. Recent adapter- and encoder-based methods attempt to reduce this overhead but still depend on additional fine-tuning or large backbone models for satisfactory results. In this work, we revisit an orthogonal direction: fine-tuning-free personalization via Hypernetworks that predict LoRA-adapted weights directly from subject images. Prior hypernetwork-based approaches, however, suffer from costly data generation or unstable attempts to mimic base model optimization trajectories. We address these limitations with an end-to-end training objective, stabilized by a simple output regularization, yielding reliable and effective hypernetworks. Our method removes the need for per-subject optimization at test time while preserving both subject fidelity and prompt alignment. To further enhance compositional generalization at inference time, we introduce Hybrid-Model Classifier-Free Guidance (HM-CFG), which combines the compositional strengths of the base diffusion model with the subject fidelity of personalized models during sampling. Extensive experiments on CelebA-HQ, AFHQ-v2, and DreamBench demonstrate that our approach achieves strong personalization performance and highlights the promise of hypernetworks as a scalable and effective direction for open-category personalization.[65] Subsampled Randomized Fourier GaLore for Adapting Foundation Models in Depth-Driven Liver Landmark Segmentation
Yun-Chen Lin,Jiayuan Huang,Hanyuan Zhang,Sergi Kavtaradze,Matthew J. Clarkson,Mobarak I. Hoque
Main category: cs.CV
TL;DR: 提出一种深度引导的肝脏解剖标志分割框架,结合RGB和深度特征,通过改进的低秩梯度投影方法SRFT-GaLore高效微调SAM2,在L3D和自建LLSD数据集上表现出优越的分割精度和跨数据集鲁棒性。
Details
Motivation: 在腹腔镜肝手术中,二维视频缺乏深度感知,导致解剖标志定位困难;现有方法在融合RGB与深度特征及适应大规模视觉模型到手术场景方面仍存在挑战。 Method: 采用SAM2编码器提取RGB特征,DA2编码器提取深度感知特征,设计SRFT-GaLore方法替代SVD实现高效低秩梯度投影以微调SAM2,并通过交叉注意力模块融合多模态特征。 Result: 在L3D数据集上Dice系数提升4.85%,平均对称表面距离减少11.78;在新建的LLSD数据集上显著优于SAM基线方法,展现良好泛化能力。 Conclusion: 所提出的SRFT-GaLore增强双编码器框架能有效融合语义与几何线索,实现在实时、深度受限手术环境下的可扩展且精确的肝脏标志分割。 Abstract: Accurate detection and delineation of anatomical structures in medical imaging are critical for computer-assisted interventions, particularly in laparoscopic liver surgery where 2D video streams limit depth perception and complicate landmark localization. While recent works have leveraged monocular depth cues for enhanced landmark detection, challenges remain in fusing RGB and depth features and in efficiently adapting large-scale vision models to surgical domains. We propose a depth-guided liver landmark segmentation framework integrating semantic and geometric cues via vision foundation encoders. We employ Segment Anything Model V2 (SAM2) encoder to extract RGB features and Depth Anything V2 (DA2) encoder to extract depth-aware features. To efficiently adapt SAM2, we introduce SRFT-GaLore, a novel low-rank gradient projection method that replaces the computationally expensive SVD with a Subsampled Randomized Fourier Transform (SRFT). This enables efficient fine-tuning of high-dimensional attention layers without sacrificing representational power. A cross-attention fusion module further integrates RGB and depth cues. To assess cross-dataset generalization, we also construct a new Laparoscopic Liver Surgical Dataset (LLSD) as an external validation benchmark. On the public L3D dataset, our method achieves a 4.85% improvement in Dice Similarity Coefficient and a 11.78-point reduction in Average Symmetric Surface Distance compared to the D2GPLand. To further assess generalization capability, we evaluate our model on LLSD dataset. Our model maintains competitive performance and significantly outperforms SAM-based baselines, demonstrating strong cross-dataset robustness and adaptability to unseen surgical environments. These results demonstrate that our SRFT-GaLore-enhanced dual-encoder framework enables scalable and precise segmentation under real-time, depth-constrained surgical settings.[66] SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Shreyas C. Dhake,Jiayuan Huang,Runlong He,Danyal Z. Khan,Evangelos B. Mazomenos,Sophia Bano,Hani J. Marcus,Danail Stoyanov,Matthew J. Clarkson,Mobarak I. Hoque
Main category: cs.CV
TL;DR: 本文提出了PitVQA-Anticipation,首个面向前向手术推理的视觉问答数据集,以及SurgAnt-ViVQA模型,通过时序感知和门控跨模态注意力实现手术过程的主动预测。
Details
Motivation: 现有手术视觉问答系统多基于静态帧分析,缺乏对未来步骤的预测能力,且数据集集中于当前场景而非未来事件,难以支持实时手术辅助。 Method: 构建包含33.5小时手术视频和73.4万问答对的PitVQA-Anticipation数据集;提出SurgAnt-ViVQA模型,采用GRU编码帧间动态,通过门控机制在token级别融合视觉与语言信息,并进行参数高效微调。 Result: SurgAnt-ViVQA在PitVQA-Anticipation和EndoVis数据集上优于多种基线模型,消融实验表明时序循环和门控融合是性能提升关键;8帧利于流畅生成,32帧更优时间估计。 Conclusion: 结合时序建模与细粒度跨模态融合,可推动手术视觉问答从回顾性描述转向前瞻性预判,PitVQA-Anticipation为未来感知手术辅助提供了重要基准。 Abstract: Anticipating forthcoming surgical events is vital for real-time assistance in endonasal transsphenoidal pituitary surgery, where visibility is limited and workflow changes rapidly. Most visual question answering (VQA) systems reason on isolated frames with static vision language alignment, providing little support for forecasting next steps or instrument needs. Existing surgical VQA datasets likewise center on the current scene rather than the near future. We introduce PitVQA-Anticipation, the first VQA dataset designed for forward looking surgical reasoning. It comprises 33.5 hours of operative video and 734,769 question answer pairs built from temporally grouped clips and expert annotations across four tasks: predicting the future phase, next step, upcoming instrument, and remaining duration. We further propose SurgAnt-ViVQA, a video language model that adapts a large language model using a GRU Gated Temporal Cross-Attention module. A bidirectional GRU encodes frame to frame dynamics, while an adaptive gate injects visual context into the language stream at the token level. Parameter efficient fine tuning customizes the language backbone to the surgical domain. SurgAnt-ViVQA tested upon on PitVQA-Anticipation and EndoVis datasets, surpassing strong image and video based baselines. Ablations show that temporal recurrence and gated fusion drive most of the gains. A frame budget study indicates a trade-off: 8 frames maximize fluency, whereas 32 frames slightly reduce BLEU but improve numeric time estimation. By pairing a temporally aware encoder with fine grained gated cross-attention, SurgAnt-ViVQA advances surgical VQA from retrospective description to proactive anticipation. PitVQA-Anticipation offers a comprehensive benchmark for this setting and highlights the importance of targeted temporal modeling for reliable, future aware surgical assistance.[67] PETWB-REP: A Multi-Cancer Whole-Body FDG PET/CT and Radiology Report Dataset for Medical Imaging Research
Le Xue,Gang Feng,Wenbo Zhang,Yichi Zhang,Lanlan Li,Shuqi Wang,Liling Peng,Sisi Peng,Xin Gao
Main category: cs.CV
TL;DR: PETWB-REP是一个包含490名多种恶性肿瘤患者全身FDG PET/CT扫描图像和对应放射学报告的大型医学影像数据集,旨在支持医学影像、放射组学、人工智能和多模态学习研究。
Details
Motivation: 目前缺乏同时包含功能成像、解剖成像和详细临床报告的多癌种大规模医学影像数据集,限制了AI模型开发和回顾性临床研究的发展。 Method: 收集并整理490例多种癌症患者的配对PET/CT图像、去标识化文本报告和结构化临床元数据,构建名为PETWB-REP的公开数据集。 Result: 发布了一个高质量、多癌种、多模态的医学影像数据集PETWB-REP,涵盖肺癌、肝癌、乳腺癌、前列腺癌和卵巢癌等常见癌症类型。 Conclusion: PETWB-REP数据集为医学影像分析、人工智能模型训练和多模态临床研究提供了重要资源,有助于推动癌症诊断和评估的技术进步。 Abstract: Publicly available, large-scale medical imaging datasets are crucial for developing and validating artificial intelligence models and conducting retrospective clinical research. However, datasets that combine functional and anatomical imaging with detailed clinical reports across multiple cancer types remain scarce. Here, we present PETWB-REP, a curated dataset comprising whole-body 18F-Fluorodeoxyglucose (FDG) Positron Emission Tomography/Computed Tomography (PET/CT) scans and corresponding radiology reports from 490 patients diagnosed with various malignancies. The dataset primarily includes common cancers such as lung cancer, liver cancer, breast cancer, prostate cancer, and ovarian cancer. This dataset includes paired PET and CT images, de-identified textual reports, and structured clinical metadata. It is designed to support research in medical imaging, radiomics, artificial intelligence, and multi-modal learning.[68] QG-CoC: Question-Guided Chain-of-Captions for Large Multimodal Models
Kuei-Chun Kao,Hsu Tzu-Yin,Yunqi Hong,Ruochen Wang,Cho-Jui Hsieh
Main category: cs.CV
TL;DR: 本文提出了一种新的零样本提示方法QG-CoC,用于提升多图像场景下多模态大语言模型的细粒度感知与推理能力。
Details
Motivation: 现有提示方法在多图像情境下缺乏细粒度感知和有效推理能力,且多集中于单图像或受限场景,难以应对复杂的多图像推理任务。 Method: 提出Question-Guided Chain-of-Captions (QG-CoC),通过问题引导的连贯描述生成机制,实现对任意数量图像的感知与推理整合。 Result: 在多个开源和闭源MLLM上验证了QG-CoC的有效性,在多图像和单图像基准上均表现出竞争力,并在挑战性场景中显著优于现有方法。 Conclusion: QG-CoC是一种通用且有效的零样本提示方法,能够显著提升MLLM在复杂多图像任务中的感知与推理能力。 Abstract: Recently, Multimodal Large Language Models (MLLMs) encounter two key issues in multi-image contexts: (1) a lack of fine-grained perception across disparate images, and (2) a diminished capability to effectively reason over and synthesize information from multiple visual inputs. However, while various prompting methods aim to describe visual content, many existing studies focus primarily on single-image settings or specific, constrained scenarios. This leaves a critical gap in understanding and addressing how MLLMs tackle more general and complex multi-image reasoning tasks. Thus, we first extensively investigate how current prompting methods perceive fine-grained visual details and process visual information when dealing with multiple images. Our findings reveal that existing prompting methods fall short in attending to needed clues and seamlessly integrating perception and reasoning. Inspired by the findings, we propose a new zero-shot prompting method, Question-Guided Chain-of-Captions (QG-CoC), a generalized prompting approach that effectively handles problems with an arbitrary number of images. We evaluate our method on various open-source and closed-source MLLMs for multi-image and single-image benchmarks. Experimental results indicate that QG-CoC demonstrates competitive performance across tasks and exhibits robust improvements in the challenging scenarios where existing prompting methods fail.[69] MvBody: Multi-View-Based Hybrid Transformer Using Optical 3D Body Scan for Explainable Cesarean Section Prediction
Ruting Cheng,Boyuan Feng,Yijiang Zheng,Chuhui Qiu,Aizierjiang Aiersilan,Joaquin A. Calderon,Wentao Zhao,Qing Pan,James K. Hahn
Main category: cs.CV
TL;DR: 本研究提出了一种基于多视角Transformer网络MvBody的方法,利用孕晚期的自我报告医疗数据和3D光学身体扫描来预测剖宫产风险,在资源有限的环境中展现出可行性,性能优于现有模型。
Details
Motivation: 现有剖宫产风险预测模型多依赖产时院内数据,在资源匮乏或家庭环境中难以应用,因此需要一种可在早期、非临床环境下使用的低门槛预测方法。 Method: 提出MvBody模型,结合自我报告的医疗信息与孕31至38周的3D身体扫描数据,采用多视角Transformer架构,并引入度量学习损失以提升小样本下的训练效率与泛化能力,使用集成梯度算法进行可解释性分析。 Result: 在独立测试集上达到84.62%的准确率和0.724的AUC-ROC,优于传统机器学习模型和最新3D分析方法;关键预测因素包括孕前体重、年龄、产科史、既往剖宫产史及头肩部身体形态。 Conclusion: 基于3D体形与基本医疗信息的剖宫产风险预测是可行的,MvBody为资源受限环境下的早期风险评估提供了有前景的解决方案。 Abstract: Accurately assessing the risk of cesarean section (CS) delivery is critical, especially in settings with limited medical resources, where access to healthcare is often restricted. Early and reliable risk prediction allows better-informed prenatal care decisions and can improve maternal and neonatal outcomes. However, most existing predictive models are tailored for in-hospital use during labor and rely on parameters that are often unavailable in resource-limited or home-based settings. In this study, we conduct a pilot investigation to examine the feasibility of using 3D body shape for CS risk assessment for future applications with more affordable general devices. We propose a novel multi-view-based Transformer network, MvBody, which predicts CS risk using only self-reported medical data and 3D optical body scans obtained between the 31st and 38th weeks of gestation. To enhance training efficiency and model generalizability in data-scarce environments, we incorporate a metric learning loss into the network. Compared to widely used machine learning models and the latest advanced 3D analysis methods, our method demonstrates superior performance, achieving an accuracy of 84.62% and an Area Under the Receiver Operating Characteristic Curve (AUC-ROC) of 0.724 on the independent test set. To improve transparency and trust in the model's predictions, we apply the Integrated Gradients algorithm to provide theoretically grounded explanations of the model's decision-making process. Our results indicate that pre-pregnancy weight, maternal age, obstetric history, previous CS history, and body shape, particularly around the head and shoulders, are key contributors to CS risk prediction.[70] Diffusion-Guided Mask-Consistent Paired Mixing for Endoscopic Image Segmentation
Pengyu Jie,Wanquan Liu,Rui He,Yihui Wen,Deyu Meng,Chenqiang Gao
Main category: cs.CV
TL;DR: 提出了一种结合扩散生成与标签保持混合的配对混合方法(MCPMix),通过掩码一致性与自适应重锚定(RLA)机制,在保持像素级语义的同时增强数据多样性,显著提升内窥镜图像分割性能。
Details
Motivation: 现有数据增强方法在密集预测任务中存在标签模糊或合成-真实域偏移问题,难以兼顾数据多样性和语义一致性。 Method: 采用配对扩散生成,在相同掩码条件下为真实图像生成对应合成图像;提出MCPMix仅混合图像外观而保留原始硬掩码监督,并通过RLA自适应调整混合强度与损失权重,逐步回归真实数据分布。 Result: 在Kvasir-SEG、PICCOLO、CVC-ClinicDB、私有NPC-LES队列和ISIC 2017等多个数据集上实现了最先进的分割性能, consistently优于基线方法。 Conclusion: 结合标签保持混合、扩散驱动多样性和自适应重锚定的策略能有效提升内窥镜图像分割的鲁棒性与泛化能力。 Abstract: Augmentation for dense prediction typically relies on either sample mixing or generative synthesis. Mixing improves robustness but misaligned masks yield soft label ambiguity. Diffusion synthesis increases apparent diversity but, when trained as common samples, overlooks the structural benefit of mask conditioning and introduces synthetic-real domain shift. We propose a paired, diffusion-guided paradigm that fuses the strengths of both. For each real image, a synthetic counterpart is generated under the same mask and the pair is used as a controllable input for Mask-Consistent Paired Mixing (MCPMix), which mixes only image appearance while supervision always uses the original hard mask. This produces a continuous family of intermediate samples that smoothly bridges synthetic and real appearances under shared geometry, enlarging diversity without compromising pixel-level semantics. To keep learning aligned with real data, Real-Anchored Learnable Annealing (RLA) adaptively adjusts the mixing strength and the loss weight of mixed samples over training, gradually re-anchoring optimization to real data and mitigating distributional bias. Across Kvasir-SEG, PICCOLO, CVC-ClinicDB, a private NPC-LES cohort, and ISIC 2017, the approach achieves state-of-the-art segmentation performance and consistent gains over baselines. The results show that combining label-preserving mixing with diffusion-driven diversity, together with adaptive re-anchoring, yields robust and generalizable endoscopic segmentation.[71] Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution
Sichen Guo,Wenjie Li,Yuanyang Liu,Guangwei Gao,Jian Yang,Chia-Wen Lin
Main category: cs.CV
TL;DR: 提出T-PMambaSR,一种结合窗口自注意力与渐进式Mamba的轻量级超分辨率框架,通过多尺度感受野交互和自适应高频细化模块,在线性复杂度下实现更优性能。
Details
Motivation: 现有Mamba-based超分辨率方法缺乏跨尺度的细粒度建模,限制了特征表达效率。 Method: 提出T-PMambaSR,融合窗口自注意力与渐进式Mamba以增强多尺度感受野交互,并设计自适应高频细化模块(AHFRM)恢复丢失的高频细节。 Result: 实验证明T-PMambaSR在更低计算成本下,感受野和表达能力逐步增强,性能优于当前Transformer和Mamba-based方法。 Conclusion: T-PMambaSR通过细粒度建模和高频信息恢复,在线性复杂度下实现了高效且高性能的图像超分辨率。 Abstract: Recently, Mamba-based super-resolution (SR) methods have demonstrated the ability to capture global receptive fields with linear complexity, addressing the quadratic computational cost of Transformer-based SR approaches. However, existing Mamba-based methods lack fine-grained transitions across different modeling scales, which limits the efficiency of feature representation. In this paper, we propose T-PMambaSR, a lightweight SR framework that integrates window-based self-attention with Progressive Mamba. By enabling interactions among receptive fields of different scales, our method establishes a fine-grained modeling paradigm that progressively enhances feature representation with linear complexity. Furthermore, we introduce an Adaptive High-Frequency Refinement Module (AHFRM) to recover high-frequency details lost during Transformer and Mamba processing. Extensive experiments demonstrate that T-PMambaSR progressively enhances the model's receptive field and expressiveness, yielding better performance than recent Transformer- or Mamba-based methods while incurring lower computational cost. Our codes will be released after acceptance.[72] Decoupled Multi-Predictor Optimization for Inference-Efficient Model Tuning
Liwei Luo,Shuaitengyuan Li,Dongwei Ren,Qilong Wang,Pengfei Zhu,Qinghua Hu
Main category: cs.CV
TL;DR: 提出了一种解耦多预测器优化(DMPO)方法,通过架构设计和优化策略,在早期阶段有效解耦表征能力和判别能力,提升推理效率。
Details
Motivation: 解决早期阶段难以同时为深层提供基础特征并为早期预测器提供高阶判别特征的问题。 Method: 引入轻量级旁路模块进行浅层特征功能分解,并采用基于高阶统计的预测器增强早期阶段判别能力;设计两阶段解耦优化策略分配损失权重。 Result: 在多个数据集和预训练模型上验证了DMPO在降低计算成本的同时优于现有方法。 Conclusion: DMPO能有效解耦早期阶段的表征与判别能力,显著提升模型推理效率和性能。 Abstract: Recently, remarkable progress has been made in large-scale pre-trained model tuning, and inference efficiency is becoming more crucial for practical deployment. Early exiting in conjunction with multi-stage predictors, when cooperated with a parameter-efficient fine-tuning strategy, offers a straightforward way to achieve an inference-efficient model. However, a key challenge remains unresolved: How can early stages provide low-level fundamental features to deep stages while simultaneously supplying high-level discriminative features to early-stage predictors? To address this problem, we propose a Decoupled Multi-Predictor Optimization (DMPO) method to effectively decouple the low-level representative ability and high-level discriminative ability in early stages. First, in terms of architecture, we introduce a lightweight bypass module into multi-stage predictors for functional decomposition of shallow features from early stages, while a high-order statistics-based predictor is developed for early stages to effectively enhance their discriminative ability. To reasonably train our multi-predictor architecture, a decoupled optimization is proposed to allocate two-phase loss weights for multi-stage predictors during model tuning, where the initial training phase enables the model to prioritize the acquisition of discriminative ability of deep stages via emphasizing representative ability of early stages, and the latter training phase drives discriminative ability towards earlier stages as much as possible. As such, our DMPO can effectively decouple representative and discriminative abilities in early stages in terms of architecture design and model optimization. Experiments across various datasets and pre-trained backbones demonstrate that DMPO clearly outperforms its counterparts when reducing computational cost.[73] Generative deep learning for foundational video translation in ultrasound
Nikolina Tomic Roshni Bhatnagar,Sarthak Jain,Connor Lau,Tien-Yu Liu,Laura Gambini,Rima Arnaout
Main category: cs.CV
TL;DR: 本文提出了一种用于超声彩色多普勒-灰阶视频转换的生成方法,通过双网络架构实现解剖结构重建与去噪,生成的合成视频在定量指标、深度学习任务和临床专家评估中均与真实视频难以区分,且具有跨器官泛化能力。
Details
Motivation: 超声数据在临床研究中常存在子模态不平衡(如灰阶与彩色多普勒)及缺失问题,影响深度学习应用,因此需要有效的方法来平衡数据集。 Method: 提出一种基于像素级、对抗性和感知损失的生成模型,采用两个网络:一个用于重建解剖结构,另一个用于去噪,实现CFD与灰阶超声视频间的转换;模型在54,975个视频上训练,在8,368个视频上测试。 Result: 合成视频与真实视频的平均SSIM为0.91±0.04;在分类与分割任务中表现相当(F1分数0.9 vs 0.89,Dice系数0.97);临床专家区分真实与合成视频的准确率仅为54±6%,表明其高度逼真;模型在心脏外其他器官也表现良好(平均SSIM 0.91±0.05)。 Conclusion: 该方法能生成高质量、临床可信的超声视频,有效解决数据不平衡问题,扩展了回顾性影像数据的使用价值,具备成为医学图像数据增强基础工具的潜力。 Abstract: Deep learning (DL) has the potential to revolutionize image acquisition and interpretation across medicine, however, attention to data imbalance and missingness is required. Ultrasound data presents a particular challenge because in addition to different views and structures, it includes several sub-modalities-such as greyscale and color flow doppler (CFD)-that are often imbalanced in clinical studies. Image translation can help balance datasets but is challenging for ultrasound sub-modalities to date. Here, we present a generative method for ultrasound CFD-greyscale video translation, trained on 54,975 videos and tested on 8,368. The method developed leveraged pixel-wise, adversarial, and perceptual loses and utilized two networks: one for reconstructing anatomic structures and one for denoising to achieve realistic ultrasound imaging. Average pairwise SSIM between synthetic videos and ground truth was 0.91+/-0.04. Synthetic videos performed indistinguishably from real ones in DL classification and segmentation tasks and when evaluated by blinded clinical experts: F1 score was 0.9 for real and 0.89 for synthetic videos; Dice score between real and synthetic segmentation was 0.97. Overall clinician accuracy in distinguishing real vs synthetic videos was 54+/-6% (42-61%), indicating realistic synthetic videos. Although trained only on heart videos, the model worked well on ultrasound spanning several clinical domains (average SSIM 0.91+/-0.05), demonstrating foundational abilities. Together, these data expand the utility of retrospectively collected imaging and augment the dataset design toolbox for medical imaging.[74] Enhancing Medical Image Segmentation via Heat Conduction Equation
Rong Wu,Yim-Sang Yu
Main category: cs.CV
TL;DR: 提出一种结合Mamba状态空间模块和热传导算子的混合架构U-Mamba,用于医学图像分割,有效提升全局上下文建模与语义抽象能力。
Details
Motivation: 现有模型难以在有限计算资源下同时实现高效的全局上下文建模和长距离依赖推理。 Method: 设计U-Mamba架构,融合基于Mamba的状态空间模块和瓶颈层中的热传导算子(HCO),通过模拟频域热扩散增强语义抽象。 Result: 在多模态腹部CT和MRI数据集上的实验表明,该模型 consistently 优于强基线方法,具有良好的有效性与泛化性。 Conclusion: 结合状态空间动力学与基于热传导的全局扩散是一种可扩展且可解释的医学图像分割解决方案。 Abstract: Medical image segmentation has been significantly advanced by deep learning architectures, notably U-Net variants. However, existing models struggle to achieve efficient global context modeling and long-range dependency reasoning under practical computational budgets simultaneously. In this work, we propose a novel hybrid architecture utilizing U-Mamba with Heat Conduction Equation. Our model combines Mamba-based state-space modules for efficient long-range reasoning with Heat Conduction Operators (HCOs) in the bottleneck layers, simulating frequency-domain thermal diffusion for enhanced semantic abstraction. Experimental results on multimodal abdominal CT and MRI datasets demonstrate that the proposed model consistently outperforms strong baselines, validating its effectiveness and generalizability. It suggest that blending state-space dynamics with heat-based global diffusion offers a scalable and interpretable solution for medical segmentation tasks.[75] IEC3D-AD: A 3D Dataset of Industrial Equipment Components for Unsupervised Point Cloud Anomaly Detection
Bingyang Guo,Hongjie Li,Ruiyun Yu,Hanzhe Liang,Jinbao Wang
Main category: cs.CV
TL;DR: 本文提出了一种针对真实工业场景的点云异常检测数据集IEC3D-AD,并引入了一种新的3D异常检测范式GMANet,通过几何形态分析生成合成样本并优化空间差异性来提升检测性能。
Details
Motivation: 现有3D异常检测数据集难以捕捉真实工业环境中复杂的细微缺陷,限制了对工业设备组件精确异常检测的研究。 Method: 构建了真实产线采集的高分辨率点云数据集IEC3D-AD,提出GMANet方法,基于几何形态分析生成合成点云样本,并通过空间差异优化缩小正常与异常点级特征的差距。 Result: 实验表明,所提方法在IEC3D-AD及其他数据集上均表现出色,有效提升了3D异常检测的性能。 Conclusion: IEC3D-AD数据集和GMANet方法为工业设备组件的高精度3D异常检测提供了更优的解决方案。 Abstract: 3D anomaly detection (3D-AD) plays a critical role in industrial manufacturing, particularly in ensuring the reliability and safety of core equipment components. Although existing 3D datasets like Real3D-AD and MVTec 3D-AD offer broad application support, they fall short in capturing the complexities and subtle defects found in real industrial environments. This limitation hampers precise anomaly detection research, especially for industrial equipment components (IEC) such as bearings, rings, and bolts. To address this challenge, we have developed a point cloud anomaly detection dataset (IEC3D-AD) specific to real industrial scenarios. This dataset is directly collected from actual production lines, ensuring high fidelity and relevance. Compared to existing datasets, IEC3D-AD features significantly improved point cloud resolution and defect annotation granularity, facilitating more demanding anomaly detection tasks. Furthermore, inspired by generative 2D-AD methods, we introduce a novel 3D-AD paradigm (GMANet) on IEC3D-AD. This paradigm generates synthetic point cloud samples based on geometric morphological analysis, then reduces the margin and increases the overlap between normal and abnormal point-level features through spatial discrepancy optimization. Extensive experiments demonstrate the effectiveness of our method on both IEC3D-AD and other datasets.[76] Unified Long Video Inpainting and Outpainting via Overlapping High-Order Co-Denoising
Shuangquan Lyu,Steven Mao,Yue Ma
Main category: cs.CV
TL;DR: 提出一种基于文本到视频扩散模型的统一方法,通过LoRA微调和重叠-融合时序去噪策略,实现任意长度视频的高保真、可控的修复与外绘。
Details
Motivation: 解决长视频生成中修复与外绘的可控性差、拼接伪影和长度受限等问题。 Method: 采用LoRA高效微调预训练视频扩散模型(如Wan 2.1),结合重叠-融合的时序协同去噪策略与高阶求解器,实现对遮罩区域的视频合成。 Result: 在数百帧的复杂编辑任务中表现优异,相比Wan 2.1和VACE等基线方法,在PSNR/SSIM和LPIPS指标上均取得更优结果,无明显拼接痕迹或漂移。 Conclusion: 该方法实现了参数高效且性能优越的长视频修复与外绘,支持任意长度生成,具有实用性和高质量输出。 Abstract: Generating long videos remains a fundamental challenge, and achieving high controllability in video inpainting and outpainting is particularly demanding. To address both of these challenges simultaneously and achieve controllable video inpainting and outpainting for long video clips, we introduce a novel and unified approach for long video inpainting and outpainting that extends text-to-video diffusion models to generate arbitrarily long, spatially edited videos with high fidelity. Our method leverages LoRA to efficiently fine-tune a large pre-trained video diffusion model like Alibaba's Wan 2.1 for masked region video synthesis, and employs an overlap-and-blend temporal co-denoising strategy with high-order solvers to maintain consistency across long sequences. In contrast to prior work that struggles with fixed-length clips or exhibits stitching artifacts, our system enables arbitrarily long video generation and editing without noticeable seams or drift. We validate our approach on challenging inpainting/outpainting tasks including editing or adding objects over hundreds of frames and demonstrate superior performance to baseline methods like Wan 2.1 model and VACE in terms of quality (PSNR/SSIM), and perceptual realism (LPIPS). Our method enables practical long-range video editing with minimal overhead, achieved a balance between parameter efficient and superior performance.[77] Diffusion-SDPO: Safeguarded Direct Preference Optimization for Diffusion Models
Minghao Fu,Guo-Hua Wang,Tianyu Cui,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang
Main category: cs.CV
TL;DR: 提出Diffusion-SDPO,一种改进的扩散模型偏好优化方法,通过自适应缩放损失梯度来避免生成质量下降,提升文本到图像生成的对齐性能。
Details
Motivation: 标准Diffusion-DPO在扩大偏好边距时可能导致优劣样本的重建误差同时增加,从而损害高质量输出的生成效果。 Method: 引入Diffusion-SDPO,采用带自适应缩放系数的更新规则,根据胜者与败者梯度的一致性调整败者梯度,确保胜者输出误差不增加。 Result: 在多个文本到图像基准上,Diffusion-SDPO在自动偏好、美学质量和提示对齐指标上均优于现有偏好学习方法。 Conclusion: Diffusion-SDPO有效解决了传统DPO中偏好边距增大导致生成质量退化的问题,具备通用性、低计算开销和稳定提升效果。 Abstract: Text-to-image diffusion models deliver high-quality images, yet aligning them with human preferences remains challenging. We revisit diffusion-based Direct Preference Optimization (DPO) for these models and identify a critical pathology: enlarging the preference margin does not necessarily improve generation quality. In particular, the standard Diffusion-DPO objective can increase the reconstruction error of both winner and loser branches. Consequently, degradation of the less-preferred outputs can become sufficiently severe that the preferred branch is also adversely affected even as the margin grows. To address this, we introduce Diffusion-SDPO, a safeguarded update rule that preserves the winner by adaptively scaling the loser gradient according to its alignment with the winner gradient. A first-order analysis yields a closed-form scaling coefficient that guarantees the error of the preferred output is non-increasing at each optimization step. Our method is simple, model-agnostic, broadly compatible with existing DPO-style alignment frameworks and adds only marginal computational overhead. Across standard text-to-image benchmarks, Diffusion-SDPO delivers consistent gains over preference-learning baselines on automated preference, aesthetic, and prompt alignment metrics. Code is publicly available at https://github.com/AIDC-AI/Diffusion-SDPO.[78] SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding
Mauro Orazio Drago,Luca Carlini,Pelinsu Celebi Balyemez,Dennis Pierantozzi,Chiara Lena,Cesare Hassan,Danail Stoyanov,Elena De Momi,Sophia Bano,Mobarak I. Hoque
Main category: cs.CV
TL;DR: 提出SurgViVQA,一种用于手术视频问答(VideoQA)的模型,通过融合视频与文本特征并利用大语言模型解码,实现对动态手术场景的时序理解,在新构建的REAL-Colon-VQA和公开EndoVis18-VQA数据集上显著优于现有方法。
Details
Motivation: 现有手术VideoQA方法多基于静态图像特征,缺乏对时序动态(如运动和器械-组织交互)的建模,且数据集缺少时间相关标注,限制了对手术过程的准确理解。 Method: 提出SurgViVQA模型,采用掩码视频-文本编码器融合视频片段与问题特征,捕获运动等时序线索,并由微调的大语言模型解码生成答案;同时构建REAL-Colon-VQA数据集,包含与运动相关的问题、诊断属性及重述/语义变化的非模板问题,以评估模型鲁棒性。 Result: 在REAL-Colon-VQA和EndoVis18-VQA数据集上的实验表明,SurgViVQA在关键词准确率上分别比PitVQA提升+11%和+9%,且消融研究验证了其对问题表述变化具有更强的泛化能力与鲁棒性。 Conclusion: SurgViVQA通过引入时序建模能力,推动了手术VideoQA从静态图像向动态场景的理解转变,结合REAL-Colon-VQA数据集为手术过程的时序感知理解提供了新框架。 Abstract: Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video--Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool--tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11\% on REAL-Colon-VQA and +9\% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.[79] Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge
Yi Yang,Yiming Xu,Timo Kaiser,Hao Cheng,Bodo Rosenhahn,Michael Ying Yang
Main category: cs.CV
TL;DR: 本文提出了一种用于MOT25-Spatiotemporal Action Grounding挑战的两阶段零样本方法,结合FastTracker和LLaVA-Video模型,在测试集上取得了第二名的成绩。
Details
Motivation: 准确地在复杂真实场景视频中根据自由形式语言查询定位并跟踪多个目标对象。 Method: 将任务建模为视频检索问题,采用两阶段零样本方法,结合SOTA追踪模型FastTracker与多模态大语言模型LLaVA-Video。 Result: 在MOT25-StAG测试集上达到20.68的m-HIoU和10.73的HOTA分数。 Conclusion: 所提方法有效结合了先进追踪与多模态语言模型,在挑战中表现优异,获得第二名。 Abstract: In this report, we present our solution to the MOT25-Spatiotemporal Action Grounding (MOT25-StAG) Challenge. The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries, using video data of complex real-world scenes as input. We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach, combining the advantages of the SOTA tracking model FastTracker and Multi-modal Large Language Model LLaVA-Video. On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.[80] UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
Guozhen Zhang,Zixiang Zhou,Teng Hu,Ziqiao Peng,Youliang Zhang,Yi Chen,Yuan Zhou,Qinglin Lu,Limin Wang
Main category: cs.CV
TL;DR: 本文提出UniAVGen,一种用于联合音频和视频生成的统一框架,通过双分支扩散Transformer架构和不对称跨模态交互机制,实现精确的音视频同步与语义一致性。
Details
Motivation: 现有开源音视频生成方法因缺乏有效的跨模态建模,常出现口型不同步和语义不一致问题。 Method: 采用双分支联合合成架构,结合两个并行的Diffusion Transformer构建统一的跨模态潜在空间;引入不对称跨模态交互机制实现双向时序对齐的交叉注意力,并结合面部感知调制模块动态关注显著区域;提出模态感知的无分类器引导策略以增强生成保真度。 Result: 实验证明,UniAVGen在远少于现有方法的训练样本下(1.3M vs 30.1M),在音视频同步、音色一致性和情感一致性方面均取得更优表现,并支持多种关键音视频任务的统一建模。 Conclusion: UniAVGen通过创新的跨模态交互设计,在低数据量条件下实现了高质量的联合音视频生成,具备良好的多任务兼容性与应用潜力。 Abstract: Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.[81] Decoupling Augmentation Bias in Prompt Learning for Vision-Language Models
Gahyeon Kim,Sohee Kim,Seokju Lee
Main category: cs.CV
TL;DR: 本文提出了一种新的提示学习方法AAPL,通过引入对抗性token嵌入来解耦数据增强中的表面视觉变化与类别相关的语义表示,从而提升模型在未见类别上的泛化能力。
Details
Motivation: 现有提示学习方法(如CoCoOp)在图像增强方面探索不足,且缺乏对语义相关视觉特征的显式引导,导致在完全未见类别上泛化能力有限。 Method: 提出AAPL方法,利用属性特定的图像级增强,并引入对抗性token嵌入,在软提示框架中分离无关的视觉变化,使提示聚焦于语义上有意义的视觉特征。 Result: 在11个基准数据集上进行了实验,AAPL在少样本、零样本、跨数据集和领域泛化设置下均优于现有方法。 Conclusion: AAPL能有效提升提示学习的泛化性能,揭示了图像增强与提示学习结合的潜力,并为学习更具判别性的语义特征提供了新思路。 Abstract: Recent advances in large-scale vision and language models have led to significant progress in zero-shot learning tasks. Methods such as CoOp and CoCoOp have shown that replacing handcrafted prompts with learnable vectors, known as prompt learning, can result in improved performance. However, these models often struggle to generalize to entirely unseen categories. While traditional zero-shot learning techniques benefit from various data augmentation strategies, prompt learning has primarily focused on text-based modifications, leaving the potential of image-based augmentation largely unexplored. In this work, we explore how image-level augmentations, particularly those that introduce attribute-specific variations, can support and enhance prompt learning. Our analysis examines the interaction between these augmentations and soft prompt frameworks, revealing their potential to improve generalization. We also identify a limitation in existing methods, such as CoCoOp, which do not provide explicit guidance for learning prompts that focus on semantically meaningful visual features. To address this, we propose Adding Attributes to Prompt Learning, AAPL, a novel method that introduces adversarial token embeddings to decouple superficial visual variations introduced by augmentation from class-relevant semantic representations. This decoupling enables the learned prompts to concentrate on visually discriminative features that align with the target categories. We conduct comprehensive experiments on eleven benchmark datasets, and AAPL consistently outperforms existing methods across few-shot, zero-shot, cross-dataset, and domain generalization settings. Our source code is publicly available at: https://github.com/Gahyeonkim09/AAPL[82] Robust Alignment of the Human Embryo in 3D Ultrasound using PCA and an Ensemble of Heuristic, Atlas-based and Learning-based Classifiers Evaluated on the Rotterdam Periconceptional Cohort
Nikolai Herrmann,Marcella C. Zijta,Stefan Klein,Régine P. M. Steegers-Theunissen,Rene M. H. Wijnen,Bernadette S. de Bakker,Melek Rousian,Wietske A. P. Bastiaansen
Main category: cs.CV
TL;DR: 提出了一种基于PCA的自动化方法,用于在3D超声图像中标准化胚胎对齐,结合三种策略选择标准方向,在大规模数据上实现了高达98.5%的准确率。
Details
Motivation: 为了提高产前生长监测的一致性和可比性,需要对3D超声图像中的胚胎进行标准化对齐,以促进标准切面检测、改善解剖标志可视化并减少不同扫描间的差异。 Method: 给定胚胎的分割掩码,使用主成分分析(PCA)提取胚胎的主轴,并生成四个候选方向;通过Pearson相关性启发式、基于图谱的归一化互相关匹配或随机森林分类器选择最符合标准方向的候选。 Result: 在2166张来自1043例妊娠的3D超声图像上测试,PCA在99.0%的图像中正确提取主轴;三种选择策略的准确率分别为97.4%、95.8%和98.4%,多数投票策略达到98.5%的准确率。 Conclusion: 该方法能够高效、准确地实现早孕期胚胎的3D图像标准化对齐,适用于临床和科研中的大规模数据分析,且代码已公开。 Abstract: Standardized alignment of the embryo in three-dimensional (3D) ultrasound images aids prenatal growth monitoring by facilitating standard plane detection, improving visualization of landmarks and accentuating differences between different scans. In this work, we propose an automated method for standardizing this alignment. Given a segmentation mask of the embryo, Principal Component Analysis (PCA) is applied to the mask extracting the embryo's principal axes, from which four candidate orientations are derived. The candidate in standard orientation is selected using one of three strategies: a heuristic based on Pearson's correlation assessing shape, image matching to an atlas through normalized cross-correlation, and a Random Forest classifier. We tested our method on 2166 images longitudinally acquired 3D ultrasound scans from 1043 pregnancies from the Rotterdam Periconceptional Cohort, ranging from 7+0 to 12+6 weeks of gestational age. In 99.0% of images, PCA correctly extracted the principal axes of the embryo. The correct candidate was selected by the Pearson Heuristic, Atlas-based and Random Forest in 97.4%, 95.8%, and 98.4% of images, respectively. A Majority Vote of these selection methods resulted in an accuracy of 98.5%. The high accuracy of this pipeline enables consistent embryonic alignment in the first trimester, enabling scalable analysis in both clinical and research settings. The code is publicly available at: https://gitlab.com/radiology/prenatal-image-analysis/pca-3d-alignment.[83] Generalizing Shape-from-Template to Topological Changes
Kevin Manogue,Tomasz M Schang,Dilara Kuş,Jonas Müller,Stefan Zachow,Agniva Sengupta
Main category: cs.CV
TL;DR: 提出了一种能够处理拓扑变化的Shape-from-Template(SfT)扩展方法,通过迭代优化能量函数实现对可变形物体表面的鲁棒重建。
Details
Motivation: 现有SfT方法在面对伴随拓扑变化的形变时失效,无法处理如撕裂或切割等情况。 Method: 基于经典SfT初始化,通过划分模板空间域并最小化结合物理合理性和重投影一致性的能量泛函,迭代调整模板。 Result: 成功实现了包括撕裂和切割在内的多种拓扑事件的表面重建,在合成和真实数据上均优于基线方法。 Conclusion: 建立了首个支持拓扑变化感知的通用SfT框架,显著提升了可变形物体表面重建的适用范围和鲁棒性。 Abstract: Reconstructing the surfaces of deformable objects from correspondences between a 3D template and a 2D image is well studied under Shape-from-Template (SfT) methods; however, existing approaches break down when topological changes accompany the deformation. We propose a principled extension of SfT that enables reconstruction in the presence of such changes. Our approach is initialized with a classical SfT solution and iteratively adapts the template by partitioning its spatial domain so as to minimize an energy functional that jointly encodes physical plausibility and reprojection consistency. We demonstrate that the method robustly captures a wide range of practically relevant topological events including tears and cuts on bounded 2D surfaces, thereby establishing the first general framework for topological-change-aware SfT. Experiments on both synthetic and real data confirm that our approach consistently outperforms baseline methods.[84] Human Mesh Modeling for Anny Body
Romain Brégier,Guénolé Fiche,Laura Bravo-Sánchez,Thomas Lucas,Matthieu Armando,Philippe Weinzaepfel,Grégory Rogez,Fabien Baradel
Main category: cs.CV
TL;DR: 本文提出了Anny,一种基于人体测量学知识的简单、完全可微分且无需3D扫描的人体模型,支持跨年龄、体型和比例的可解释形态控制,并通过WHO统计数据校准,提供开放、代表性强且语义可控的3D人体建模基础。
Details
Motivation: 现有参数化人体模型依赖昂贵的3D扫描和狭窄的人口统计范围,且多为专有模型,缺乏开放性和代表性。 Method: 利用MakeHuman社区的人体测量学知识构建一个连续、可解释的形态空间,使用表型参数(如性别、年龄、身高、体重)控制blendshapes,并通过WHO人口统计数据进行校准。 Result: Anny能够生成涵盖广泛人群形态的逼真人体形状,支持毫米级精度的扫描拟合、合成数据生成和人体网格恢复(HMR);基于Anny生成的Anny-One包含80万张逼真人像,实验表明其训练的HMR模型性能与基于扫描的模型相当。 Conclusion: Anny是一个开放、简洁、可解释且具有广泛代表性的参数化人体模型,为以人为中心的3D建模任务提供了可访问的基础,代码以Apache 2.0许可发布。 Abstract: Parametric body models are central to many human-centric tasks, yet existing models often rely on costly 3D scans and learned shape spaces that are proprietary and demographically narrow. We introduce Anny, a simple, fully differentiable, and scan-free human body model grounded in anthropometric knowledge from the MakeHuman community. Anny defines a continuous, interpretable shape space, where phenotype parameters (e.g. gender, age, height, weight) control blendshapes spanning a wide range of human forms -- across ages (from infants to elders), body types, and proportions. Calibrated using WHO population statistics, it provides realistic and demographically grounded human shape variation within a single unified model. Thanks to its openness and semantic control, Anny serves as a versatile foundation for 3D human modeling -- supporting millimeter-accurate scan fitting, controlled synthetic data generation, and Human Mesh Recovery (HMR). We further introduce Anny-One, a collection of 800k photorealistic humans generated with Anny, showing that despite its simplicity, HMR models trained with Anny can match the performance of those trained with scan-based body models, while remaining interpretable and broadly representative. The Anny body model and its code are released under the Apache 2.0 license, making Anny an accessible foundation for human-centric 3D modeling.[85] Signal Intensity-weighted coordinate channels improve learning stability and generalisation in 1D and 2D CNNs in localisation tasks on biomedical signals
Vittal L. Rao
Main category: cs.CV
TL;DR: 提出一种基于信号强度加权的坐标表示方法,通过将局部信号强度与坐标耦合,提升生物医学数据定位任务中的模型收敛速度和泛化性能。
Details
Motivation: 传统CoordConv仅使用绝对坐标,未能充分利用信号强度信息,难以有效建模复杂强度分布下的空间或时间关系。 Method: 在输入中用局部信号强度对坐标通道进行加权,引入强度-位置耦合的先验信息,形成模态无关的归纳偏置。 Result: 在ECG信号形态转换时刻预测和细胞图像核中心定位两个任务上,新方法均表现出更快的收敛速度和更高的泛化性能。 Conclusion: 信号强度加权的坐标表示能有效增强模型对生物医学信号中位置信息的学习能力,适用于一维和二维信号的定位任务。 Abstract: Localisation tasks in biomedical data often require models to learn meaningful spatial or temporal relationships from signals with complex intensity distributions. A common strategy, exemplified by CoordConv layers, is to append coordinate channels to convolutional inputs, enabling networks to learn absolute positions. In this work, we propose a signal intensity-weighted coordinate representation that replaces the pure coordinate channels with channels scaled by local signal intensity. This modification embeds an intensity-position coupling directly in the input representation, introducing a simple and modality-agnostic inductive bias. We evaluate the approach on two distinct localisation problems: (i) predicting the time of morphological transition in 20-second, two-lead ECG signals, and (ii) regressing the coordinates of nuclear centres in cytological images from the SiPaKMeD dataset. In both cases, the proposed representation yields faster convergence and higher generalisation performance relative to conventional coordinate-channel approaches, demonstrating its effectiveness across both one-dimensional and two-dimensional biomedical signals.[86] A Lightweight 3D-CNN for Event-Based Human Action Recognition with Privacy-Preserving Potential
Mehdi Sefidgar Dilmaghani,Francis Fowley,Peter Corcoran
Main category: cs.CV
TL;DR: 提出一种轻量级3DCNN模型,用于基于事件相机数据的人类活动识别,兼顾高精度、隐私保护和边缘部署效率。
Details
Motivation: 传统基于帧的摄像头会捕获可识别的个人信息,存在隐私隐患;而事件相机仅记录像素变化,具有天然隐私保护优势,因此需开发适配事件数据的高效识别模型。 Method: 设计轻量级三维卷积神经网络(3DCNN),结合焦点损失与类别重加权缓解类别不平衡,并采用针对性数据增强策略提升泛化能力。 Result: 在Toyota Smart Home与ETRI融合数据集上取得0.9415的F1分数和94.17%的整体准确率,性能优于C3D、ResNet3D和MC3_18等基准模型达3%。 Conclusion: 所提方法在保持模型紧凑性的同时实现了高精度的人类活动识别,验证了基于事件数据的深度学习在构建高效、隐私友好的边缘智能系统中的潜力。 Abstract: This paper presents a lightweight three-dimensional convolutional neural network (3DCNN) for human activity recognition (HAR) using event-based vision data. Privacy preservation is a key challenge in human monitoring systems, as conventional frame-based cameras capture identifiable personal information. In contrast, event cameras record only changes in pixel intensity, providing an inherently privacy-preserving sensing modality. The proposed network effectively models both spatial and temporal dynamics while maintaining a compact design suitable for edge deployment. To address class imbalance and enhance generalization, focal loss with class reweighting and targeted data augmentation strategies are employed. The model is trained and evaluated on a composite dataset derived from the Toyota Smart Home and ETRI datasets. Experimental results demonstrate an F1-score of 0.9415 and an overall accuracy of 94.17%, outperforming benchmark 3D-CNN architectures such as C3D, ResNet3D, and MC3_18 by up to 3%. These results highlight the potential of event-based deep learning for developing accurate, efficient, and privacy-aware human action recognition systems suitable for real-world edge applications.[87] Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
Dongkeun Kim,Minsu Cho,Suha Kwak
Main category: cs.CV
TL;DR: 提出一种基于身体部位特征和人际关联的细粒度社交互动检测方法,通过部分感知的自下而上推理框架,在NVI数据集上实现了最先进的性能。
Details
Motivation: 现有方法忽略面部表情、视线和手势等细微社交线索,且未显式建模个体间的交互关系,导致难以准确推断社交群体结构。 Method: 采用部分感知的自下而上群组推理框架,首先检测个体并利用身体部位信息增强特征,然后通过结合空间关系和细微社交线索的相似性推理来关联个体,从而推断群体配置。 Result: 在NVI数据集上的实验表明,该方法优于先前方法,显著提升了细粒度社交互动检测的准确性。 Conclusion: 该方法能更精确地捕捉局部社交信号,有效改善基于细微社交线索的群体结构推断。 Abstract: Social interactions often emerge from subtle, fine-grained cues such as facial expressions, gaze, and gestures. However, existing methods for social interaction detection overlook such nuanced cues and primarily rely on holistic representations of individuals. Moreover, they directly detect social groups without explicitly modeling the underlying interactions between individuals. These drawbacks limit their ability to capture localized social signals and introduce ambiguity when group configurations should be inferred from social interactions grounded in nuanced cues. In this work, we propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection. The proposed method infers social groups and their interactions using body part features and their interpersonal relations. Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning, which considers not only spatial relations but also subtle social cues that signal interactions, leading to more accurate group inference. Experiments on the NVI dataset demonstrate that our method outperforms prior methods, achieving the new state of the art.[88] Disentangled Concepts Speak Louder Than Words:Explainable Video Action Recognition
Jongseo Lee,Wooil Lee,Gyeong-Moon Park,Seong Tae Kim,Jinwoo Choi
Main category: cs.CV
TL;DR: 提出了一种名为DANCE的视频动作识别可解释框架,通过分离运动动态、物体和场景概念来提高解释清晰度,并在多个数据集上验证了其有效性和可解释性。