Table of Contents
cs.CL [Back]
[1] A Women's Health Benchmark for Large Language Models
Victoria-Elisabeth Gruber,Razvan Marinescu,Diego Fajardo,Amin H. Nassar,Christopher Arkfeld,Alexandria Ludlow,Shama Patel,Mehrnoosh Samaei,Valerie Klug,Anna Huber,Marcel Gühner,Albert Botta i Orfila,Irene Lagoja,Kimya Tarr,Haleigh Larson,Mary Beth Howard
Main category: cs.CL
TL;DR: 本文介绍了首个专门评估大语言模型在女性健康领域表现的基准——女性健康基准(WHB),涵盖5个医学专业、3种查询类型和8类错误,评估13个主流大模型后发现约60%的失败率,尤其在“错过紧急情况”方面表现差,表明当前AI尚不能可靠提供女性健康建议。
Details
Motivation: 随着大语言模型成为人们获取健康信息的主要途径,其在女性健康领域的准确性却缺乏系统评估,存在潜在风险。 Method: 构建了包含96个经过严格验证的测试题的女性健康基准(WHB),覆盖五个医学专科、三种查询类型和八种错误类型,并对13个最先进的大语言模型进行了评估。 Result: 当前大模型在女性健康基准上的失败率约为60%,在不同专科和错误类型中表现差异显著;所有模型普遍难以识别‘错过紧急情况’,而较新的模型如GPT-5在减少不当建议方面有改进。 Conclusion: 现有大语言模型在女性健康领域仍存在严重可靠性问题,尚不能作为提供可靠医疗建议的工具,需进一步优化和监管。 Abstract: As large language models (LLMs) become primary sources of health information for millions, their accuracy in women's health remains critically unexamined. We introduce the Women's Health Benchmark (WHB), the first benchmark evaluating LLM performance specifically in women's health. Our benchmark comprises 96 rigorously validated model stumps covering five medical specialties (obstetrics and gynecology, emergency medicine, primary care, oncology, and neurology), three query types (patient query, clinician query, and evidence/policy query), and eight error types (dosage/medication errors, missing critical information, outdated guidelines/treatment recommendations, incorrect treatment advice, incorrect factual information, missing/incorrect differential diagnosis, missed urgency, and inappropriate recommendations). We evaluated 13 state-of-the-art LLMs and revealed alarming gaps: current models show approximately 60\% failure rates on the women's health benchmark, with performance varying dramatically across specialties and error types. Notably, models universally struggle with "missed urgency" indicators, while newer models like GPT-5 show significant improvements in avoiding inappropriate recommendations. Our findings underscore that AI chatbots are not yet fully able of providing reliable advice in women's health.[2] Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker,Yony Bresler
Main category: cs.CL
TL;DR: 本文提出了Struct-SQL,一种基于结构化推理表示的知识蒸馏框架,用于提升小型语言模型在Text-to-SQL任务中的性能。通过使用查询执行计划作为结构化思维链的来源,该方法比非结构化蒸馏基线提升了8.1%的准确率,显著减少了语法错误。
Details
Motivation: 企业级Text-to-SQL系统面临成本、安全与性能之间的三难权衡。现有方案要么依赖昂贵的大模型,要么使用性能不足的小模型。此外,当前知识蒸馏方法多采用非结构化的思维链,存在教学信号模糊的问题。因此,需要一种更清晰、可靠的训练信号来提升小模型的表现。 Method: 提出Struct-SQL框架,利用大语言模型生成的查询执行计划(query execution plan)构建结构化的思维链作为推理蓝图,并通过知识蒸馏训练小型语言模型模仿这种结构化推理过程。 Result: 该方法在Text-to-SQL任务上相比非结构化思维链蒸馏基线实现了8.1%的绝对准确率提升,错误分析显示语法错误显著减少。 Conclusion: 结构化的推理表示能为小模型提供更有效的学习信号,验证了形式化、结构化思维链在提升小模型SQL生成能力方面的有效性,为高效、安全的企业级部署提供了可行路径。 Abstract: Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs) and low-performing Small Language Models (SLMs). Efforts to improve SLMs often rely on distilling reasoning from large LLMs using unstructured Chain-of-Thought (CoT) traces, a process that remains inherently ambiguous. Instead, we hypothesize that a formal, structured reasoning representation provides a clearer, more reliable teaching signal, as the Text-to-SQL task requires explicit and precise logical steps. To evaluate this hypothesis, we propose Struct-SQL, a novel Knowledge Distillation (KD) framework that trains an SLM to emulate a powerful large LLM. Consequently, we adopt a query execution plan as a formal blueprint to derive this structured reasoning. Our SLM, distilled with structured CoT, achieves an absolute improvement of 8.1% over an unstructured CoT distillation baseline. A detailed error analysis reveals that a key factor in this gain is a marked reduction in syntactic errors. This demonstrates that teaching a model to reason using a structured logical blueprint is beneficial for reliable SQL generation in SLMs.[3] XLM: A Python package for non-autoregressive language models
Dhruvesh Patel,Durga Prasad Maram,Sai Sreenivas Chintha,Benjamin Rozonoyer,Andrew McCallum
Main category: cs.CL
TL;DR: 提出XLM Python包,用于简化非自回归语言模型的实现,并提供预训练模型套件以促进研究。
Details
Motivation: 非自回归文本生成近期受到关注,但缺乏标准化工具,导致方法间难以系统比较和组件重用。 Method: 开发XLM包,统一数据处理、损失计算和预测逻辑,支持快速实现小型非自回归语言模型。 Result: 实现了快速构建非自回归模型的工具,并通过xlm-models包提供可复用的预训练模型。 Conclusion: XLM降低了非自回归语言建模的研究门槛,促进了该领域的标准化和可复现性。 Abstract: In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires it own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the XLM python package, which is designed to make implementing small non-autoregressive language models faster with a secondary goal of providing a suite of small pre-trained models (through a companion xlm-models package) that can be used by the research community. The code is available at https://github.com/dhruvdcoder/xlm-core.[4] Perturb Your Data: Paraphrase-Guided Training Data Watermarking
Pranav Shetty,Mirazul Haque,Petr Babkin,Zhiqiang Ma,Xiaomo Liu,Manuela Veloso
Main category: cs.CL
TL;DR: SPECTRA是一种水印技术,通过使用大语言模型(LLM)对文本进行改写,并利用独立评分模型评估改写文本的得分,实现对训练数据的可靠检测,即使这些数据在训练语料库中占比极低(低于0.001%)。
Details
Motivation: 由于大语言模型通常基于从互联网上抓取的海量文本语料进行训练,因此训练数据的版权和数据许可执行变得至关重要。需要一种有效的方法来检测某模型是否使用了特定的训练数据。 Method: SPECTRA通过使用LLM对原始文本进行改写,并选择一个使其评分与原文本相近的改写版本,以避免引入分布偏移。通过比较可疑模型的token概率与评分模型的token概率,判断其是否在水印数据上进行过训练。 Result: SPECTRA在检测用于训练的数据与未用于训练的数据时,实现了超过九个数量级的p值差距,优于所有测试的基线方法。 Conclusion: SPECTRA为数据所有者提供了一种可扩展、可在发布前部署的水印方案,能够经受大规模LLM训练的考验,有效保护数据版权和许可。 Abstract: Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.[5] When F1 Fails: Granularity-Aware Evaluation for Dialogue Topic Segmentation
Michael H. Coen
Main category: cs.CL
TL;DR: 本文提出了一种新的对话主题分割评估目标,强调边界密度和片段连贯性,并指出当前基于F1的评估存在偏差,实际性能差异多源于标注粒度不一致而非模型优劣。
Details
Motivation: 现有对话主题分割评估方法过度依赖严格的边界匹配和F1指标,难以反映真实场景下LLM系统对上下文管理的需求,导致评估结果失真。 Method: 引入以边界密度和段落连贯性为核心的评估框架,并结合容忍窗口的F1(W-F1),在八个涵盖多种对话类型的跨数据集上评估多种分割策略。 Result: 实验显示,不同模型的表现差异主要源于标注粒度不匹配和稀疏边界标签,而非模型能力;普遍存在高连贯性但过度分割的现象,导致传统F1评分偏低。 Conclusion: 对话主题分割应被视为选择合适粒度的过程,而非预测唯一正确边界,因此需将边界评分与边界选择分离处理。 Abstract: Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.cs.CV [Back]
[6] V-Agent: An Interactive Video Search System Using Vision-Language Models
SunYoung Park,Jong-Hyeon Lee,Youngjune Kim,Daegyu Sung,Younghyun Yu,Young-rok Cha,Jeongho Ju
Main category: cs.CV
TL;DR: V-Agent是一个基于多智能体的视频搜索与交互系统,通过细调视觉-语言模型并结合检索向量,实现对视频内容和语音文本的上下文感知检索,在MultiVENT 2.0上达到最先进的零样本性能。