cs.CL [Back]

[1] OpenStaxQA: A multilingual dataset based on open-source college textbooks

Pranav Gupta

Main category: cs.CL

TL;DR: 本文提出了OpenStaxQA，一个基于43本开源大学教材的评估基准，用于评估大型语言模型在高等教育应用中的表现，并使用量化低秩适配器（QLoRa）对约70亿参数的模型进行微调和评估，同时探讨其在AI2推理挑战上的零样本迁移能力及数据集的广泛影响。

Details

Motivation: 为了更好地评估大型语言模型在大学水平教育任务中的性能，需要一个专门针对教育内容、多语言且开放获取的基准测试数据集。 Method: 基于43本开放授权的大学教材构建OpenStaxQA数据集，并使用QLoRa技术对约70亿参数的大型语言模型进行微调；同时在AI2推理挑战开发集上进行零样本评估以检验迁移性能。 Result: 成功构建了OpenStaxQA这一多语言教育评估基准，微调后的模型在该数据集上表现良好，并显示出在AI2推理挑战任务上具有一定零样本迁移能力。 Conclusion: OpenStaxQA是一个有效且具有实际意义的教育导向评估基准，有助于推动大型语言模型在高等教育场景中的发展与应用，并具备良好的可扩展性和社会影响力。 Abstract: We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.

[2] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

Jiqun Pan,Zhenke Duan,Jiani Tu,Anzhi Cheng,Yanqing Wang

Main category: cs.CL

TL;DR: 提出了一种基于知识图谱引导的多智能体系统蒸馏方法（KG-MASD），以提升工业问答系统中轻量模型的安全性与可靠性。

Details

Motivation: 工业问答系统对安全性和可靠性要求高，传统方法难以有效迁移多智能体系统的协同推理能力至轻量模型，且存在推理过程不可控、输出难验证的问题。 Method: 将蒸馏过程建模为马尔可夫决策过程，引入知识图谱作为可验证的结构化先验，增强状态表示并确保收敛性，从而生成高置信度的指令微调数据，并联合蒸馏推理深度与可验证性。 Result: 在工业问答数据集上，KG-MASD相比基线模型准确率提升2.4%至20.1%，显著提高可靠性。 Conclusion: KG-MASD能有效将多智能体系统的协同推理与知识验证能力迁移到紧凑模型，支持在安全关键场景下的可信AI部署。 Abstract: Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at https://github.com/erwinmsmith/KG-MAD/.

[3] Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

Subin An,Yugyeong Ji,Junyoung Kim,Heejin Kook,Yang Lu,Josh Seltzer

Main category: cs.CL

TL;DR: 提出了一种专为人类调查响应设计的两阶段评估框架，通过无意义过滤和基于LLM的多维度评估（努力、相关性、完整性），在英语和韩语数据集上验证了其优于现有指标的表现，并显示出与专家评估的高度一致性。

Details

Motivation: 现有的自动评估方法主要针对LLM生成的文本，无法有效评估具有独特特征的人类撰写的调查响应，导致低质量响应可能引发误导性结论。 Method: 提出一个两阶段评估框架：第一阶段进行无意义响应过滤；第二阶段利用大语言模型（LLM）对响应的努力程度、相关性和完整性三个维度进行评估，该方法基于对真实世界调查数据的实证分析。 Result: 在英语和韩语数据集上的验证表明，该框架不仅优于现有的评估指标，还在响应质量预测和响应拒绝等实际应用中表现出高实用性，并与专家评估结果呈现强相关性。 Conclusion: 所提出的两阶段框架能有效评估人类书写的开放性调查响应质量，具备良好的实际应用价值和跨语言适用性。 Abstract: Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.

[4] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

Qihua Dong,Luis Figueroa,Handong Zhao,Kushal Kafle,Jason Kuen,Zhihong Ding,Scott Cohen,Yun Fu

Main category: cs.CL

TL;DR: 提出CoT Referring方法，通过链式思维训练结构提升多模态大模型在指代表达理解和分割任务中的跨模态推理能力。

Details

Motivation: 提升多模态大语言模型在复杂指代表达理解与分割任务中的推理能力，解决现有方法在复杂查询场景下表现不足的问题。 Method: 构建链式思维（CoT）训练数据结构，系统解析文本结构为逐步指代过程，每步识别关系并确保参考一致性；重构训练数据并新增标注，整合检测与分割能力到统一框架，采用自适应加权损失进行训练。 Result: 在自建复杂指代基准和RefCOCO/+/g数据集上实验显示，相比基线模型性能提升2.5%以上。 Conclusion: CoT Referring有效增强了多模态模型的跨模态推理能力，在复杂指代表达理解和分割任务中显著优于现有方法。 Abstract: Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.

[5] Evaluating Embedding Frameworks for Scientific Domain

Nouman Ahmed,Ronin Wu,Victor Botev

Main category: cs.CL

TL;DR: 本文研究了科学领域中最佳的词表示和分词方法，并构建了一个综合评估套件，用于评估不同算法在下游NLP任务中的表现。

Details

Motivation: 在特定领域（如科学）中，同一个词可能具有不同的含义，因此需要适合该领域的词表示方法；同时，现有生成式AI和Transformer模型虽然效果好，但计算开销大。 Method: 构建包含多个下游任务和相应数据集的评估套件，并使用该套件测试多种词表示和分词算法。 Result: 提出了适用于科学领域的词表示与分词方法的系统性评估框架，并通过实验比较了不同算法的表现。 Conclusion: 所构建的评估套件有助于未来科学领域词表示算法的开发与比较，为高效、领域适配的词表示提供了有效途径。 Abstract: Finding an optimal word representation algorithm is particularly important in terms of domain specific data, as the same word can have different meanings and hence, different representations depending on the domain and context. While Generative AI and transformer architecture does a great job at generating contextualized embeddings for any given work, they are quite time and compute extensive, especially if we were to pre-train such a model from scratch. In this work, we focus on the scientific domain and finding the optimal word representation algorithm along with the tokenization method that could be used to represent words in the scientific domain. The goal of this research is two fold: 1) finding the optimal word representation and tokenization methods that can be used in downstream scientific domain NLP tasks, and 2) building a comprehensive evaluation suite that could be used to evaluate various word representation and tokenization algorithms (even as new ones are introduced) in the scientific domain. To this end, we build an evaluation suite consisting of several downstream tasks and relevant datasets for each task. Furthermore, we use the constructed evaluation suite to test various word representation and tokenization algorithms.

[6] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Toshiki Nakai,Ravi Kiran Chikkala,Lena Sophie Oberkircher,Nicholas Jennings,Natalia Skachkova,Tatiana Anikina,Jesujoba Oluwadara Alabi

Main category: cs.CL

TL;DR: 本研究提出TRepLiNa方法（CKA+REPINA），通过在解码器-only多语言大模型中强制跨语言相似性，提升低资源语言到高资源语言的翻译质量，尤其在数据稀缺场景下表现良好。

Details

Motivation: 解决印度多种低资源语言缺乏语言资源的问题，缩小语言鸿沟。 Method: 结合Centered Kernel Alignment（CKA）和REPINA正则化方法，在Aya-23 8B模型上使用QLoRA进行零样本、少样本和微调实验，探索不同层的跨语言表示对齐。 Result: 实验表明，在MMLoSo任务的语言对（Mundari、Santali、Bhili，以印地语/英语为枢纽）上，TRepLiNa在中层对齐时能有效提升翻译性能，尤其适用于数据稀疏场景。 Conclusion: TRepLiNa是一种低成本、实用的方法，能在资源有限的情况下有效改善低资源语言的翻译质量。 Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India's most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.

[7] Scalable multilingual PII annotation for responsible AI in LLMs

Bharti Meena,Joanna Skubisz,Harshit Rajgarhia,Nand Dave,Kiran Ganesh,Shivali Dalmia,Abhishek Mukherji,Vasudevan Sundarababu,Olga Pospelova

Main category: cs.CL

TL;DR: 提出了一种可扩展的多语言数据整理框架，用于在13个代表性不足的地区高质量标注约336种本地化PII类型，通过人机协作和迭代优化提升标注质量与模型可靠性。

Details

Motivation: 随着大语言模型的广泛应用，确保其在不同监管环境下对个人身份信息（PII）的可靠处理变得至关重要，尤其是在多语言和多地区背景下缺乏高质量PII标注数据的问题亟待解决。 Method: 采用分阶段、人工参与的标注方法，结合语言学专家知识和严格的质量保证流程，利用标注者间一致性指标和根本原因分析来识别并解决标注不一致问题。 Result: 在试点、训练和生产阶段显著提升了召回率并降低了误报率，生成了高保真度的数据集，适用于监督式大模型微调。 Conclusion: 该框架不仅提高了多语言PII标注的质量，还展示了数据驱动的迭代流程如何增强下游模型的可靠性，为跨地区PII处理提供了可扩展的解决方案。 Abstract: As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.

[8] Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments

Aryan Kumar Singh,Janvi Singh

Main category: cs.CL

TL;DR: 该研究提供了一个基于阿育吠陀原则的标准化双语（英语-印地语）Prakriti评估问卷的数据集，包含24项多项选择题，用于评估个体的身体、生理和心理特征，并通过Google表单收集数据以进行自动化评分和分析。

Details

Motivation: 旨在根据阿育吠陀经典理论建立一个标准化、无偏见的Prakriti评估数据集，支持个性化健康分析和计算智能研究。 Method: 采用遵循AYUSH/CCRAS指南设计的24项强制性中立表述的多选问卷，隐藏dosha标签以减少偏差，并通过Google表单部署实现数据自动化采集与评分。 Result: 成功收集并构建了一个结构化的Prakriti数据集，可映射个体特征至Vata、Pitta、Kapha三种dosha得分，适用于特质分布、相关性分析及预测建模。 Conclusion: 该数据集为阿育吠陀研究、个性化医疗分析以及智能健康应用开发提供了可靠且可扩展的基础平台。 Abstract: This dataset provides responses to a standardized, bilingual (English-Hindi) Prakriti Assessment Questionnaire designed to evaluate the physical, physiological, and psychological characteristics of individuals according to classical Ayurvedic principles. The questionnaire consists of 24 multiple-choice items covering body features, appetite, sleep patterns, energy levels, and temperament. It was developed following AYUSH/CCRAS guidelines to ensure comprehensive and accurate data collection. All questions are mandatory and neutrally phrased to minimize bias, and dosha labels (Vata, Pitta, Kapha) are hidden from participants. Data were collected via a Google Forms deployment, enabling automated scoring of responses to map individual traits to dosha-specific scores. The resulting dataset provides a structured platform for research in computational intelligence, Ayurvedic studies, and personalized health analytics, supporting analysis of trait distributions, correlations, and predictive modeling. It can also serve as a reference for future Prakriti-based studies and the development of intelligent health applications.

[9] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

Jiajun Wu,Swaleh Zaidi,Braden Teitge,Henry Leung,Jiayu Zhou,Jessalyn Holodinsky,Steve Drew

Main category: cs.CL

TL;DR: 提出一种完全离线的双设备嵌入式系统，用于电子健康记录的临床摘要生成，兼顾隐私保护与实时性。

Details

Motivation: 电子健康记录中的非结构化数据过多，导致急诊医生难以快速获取关键信息，且现有方法依赖云端处理，存在隐私泄露风险。 Method: 采用双Jetson Nano设备架构：Nano-R负责基于查询检索相关病历片段，Nano-S运行7B以下的小型语言模型生成结构化摘要；通过本地Socket通信连接两阶段，并使用LLM-as-Judge评估摘要质量。 Result: 在MIMIC-IV和真实脱敏EHR数据上验证，系统能在30秒内生成事实准确、完整且清晰的双模态摘要（关键发现列表+情境化叙述）。 Conclusion: 该全离线系统可在资源受限环境下有效支持临床决策，平衡了性能、隐私与响应速度。 Abstract: Electronic health records (EHRs) contain extensive unstructured clinical data that can overwhelm emergency physicians trying to identify critical information. We present a two-stage summarization system that runs entirely on embedded devices, enabling offline clinical summarization while preserving patient privacy. In our approach, a dual-device architecture first retrieves relevant patient record sections using the Jetson Nano-R (Retrieve), then generates a structured summary on another Jetson Nano-S (Summarize), communicating via a lightweight socket link. The summarization output is two-fold: (1) a fixed-format list of critical findings, and (2) a context-specific narrative focused on the clinician's query. The retrieval stage uses locally stored EHRs, splits long notes into semantically coherent sections, and searches for the most relevant sections per query. The generation stage uses a locally hosted small language model (SLM) to produce the summary from the retrieved text, operating within the constraints of two NVIDIA Jetson devices. We first benchmarked six open-source SLMs under 7B parameters to identify viable models. We incorporated an LLM-as-Judge evaluation mechanism to assess summary quality in terms of factual accuracy, completeness, and clarity. Preliminary results on MIMIC-IV and de-identified real EHRs demonstrate that our fully offline system can effectively produce useful summaries in under 30 seconds.

[10] A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation

Aisha Alansari,Hamzah Luqman

Main category: cs.CL

TL;DR: 本文综述了大语言模型（LLM）中“幻觉”现象的研究，提出了幻觉的分类、成因、检测与缓解策略，并分析了现有方法的优缺点及评估基准，指出了未来研究方向。

Details

Motivation: 由于大语言模型在生成自然语言时常常产生事实错误或无依据的内容（即“幻觉”），严重影响其可靠性和可信度，尤其是在需要高事实准确性的领域，因此亟需系统性地理解与应对这一问题。 Method: 提出幻觉类型的分类体系，分析从数据收集、模型架构到推理整个生命周期中的根本原因；构建检测方法和缓解策略的结构化分类体系，并评估现有方法的优缺点及使用的基准指标。 Result: 系统梳理了幻觉在关键自然语言生成任务中的表现形式，总结了当前主流的检测与缓解方法，并指出现有评估基准的局限性。 Conclusion: 该研究为理解和应对大语言模型中的幻觉问题提供了全面框架，强调未来需在机制理解、评估标准化和实际应用可靠性方面持续探索，以构建更真实、可信的LLM。 Abstract: Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.

[11] Language models for longitudinal analysis of abusive content in Billboard Music Charts

Rohitash Chandra,Yathin Suresh,Divyansh Raj Sinha,Sanchit Jindal

Main category: cs.CL

TL;DR: 该研究利用深度学习方法对过去七十年美国Billboard音乐榜单的歌词进行纵向分析，揭示自1990年以来流行音乐中粗俗、性暗示和不当内容显著增加的趋势。

Details

Motivation: 缺乏验证音乐中滥用和性暗示内容增长趋势的系统研究，影响相关政策制定，尤其是这些内容可能对青少年行为产生负面影响。 Method: 采用深度学习和语言模型对Billboard榜单歌曲歌词进行情感分析和滥用内容检测，包括性暗示内容识别，开展长达七十年的纵向研究。 Result: 结果显示自1990年起，流行音乐中的显式内容显著上升，含有亵渎、性暗示和不当语言的歌曲比例持续增加；语言模型能有效捕捉歌词中的细微变化，反映社会规范和语言使用的演变。 Conclusion: 流行音乐中不当内容的增长趋势明显，需基于此类数据驱动的研究制定更有效的监管政策以保护青少年。 Abstract: There is no doubt that there has been a drastic increase in abusive and sexually explicit content in music, particularly in Billboard Music Charts. However, there is a lack of studies that validate the trend for effective policy development, as such content has harmful behavioural changes in children and youths. In this study, we utilise deep learning methods to analyse songs (lyrics) from Billboard Charts of the United States in the last seven decades. We provide a longitudinal study using deep learning and language models and review the evolution of content using sentiment analysis and abuse detection, including sexually explicit content. Our results show a significant rise in explicit content in popular music from 1990 onwards. Furthermore, we find an increasing prevalence of songs with lyrics containing profane, sexually explicit, and otherwise inappropriate language. The longitudinal analysis of the ability of language models to capture nuanced patterns in lyrical content, reflecting shifts in societal norms and language use over time.

[12] Reproducibility Study of "XRec: Large Language Models for Explainable Recommendation"

Ranjan Mishra,Julian I. Bibo,Quinten van Engelen,Henk Schaapman

Main category: cs.CL

TL;DR: 本研究复现了Ma等（2024）提出的XRec框架，并使用Llama 3替代GPT-3.5-turbo进行评估，同时对XRec的专家混合模块的嵌入进行了修改或删除，以探究其影响。结果表明XRec能有效生成个性化解释，且协同信息提升了稳定性，但在某些指标上未持续优于所有基线模型。研究还开源了代码，增强了可复现性。

Details

Motivation: 复现XRec框架并验证其在不同大语言模型（Llama 3）下的有效性，同时探索其Mixture of Experts模块中嵌入的作用，提升模型的可解释性与可复现性。 Method: 基于原作者提供的代码，采用Llama 3作为基础LLM，复现XRec框架；通过修改输入嵌入或删除输出嵌入来调整其Mixture of Experts模块，并在推荐解释任务上进行评估。 Result: XRec能够有效生成个性化推荐解释，引入协同信息有助于提升模型稳定性；但在部分评估指标上未能 consistently 超过所有基线模型；消融实验显示Mixture of Experts的嵌入对解释结构具有重要影响。 Conclusion: XRec在Llama 3上具备良好的可迁移性与解释生成能力，协同信号与语言模型的结合至关重要；Mixture of Experts模块的嵌入设计显著影响性能，未来优化应关注其结构设计与融合方式。 Abstract: In this study, we reproduced the work done in the paper "XRec: Large Language Models for Explainable Recommendation" by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec's Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code repository can be found at https://github.com/julianbibo/xrec-reproducibility.

[13] Type and Complexity Signals in Multilingual Question Representations

Robin Kokot,Wessel Poelman

Main category: cs.CL

TL;DR: 本文研究了多语言Transformer模型如何表示问题的形态句法特性，提出了包含七种语言的问题类型与复杂度（QTC）数据集，并通过引入回归标签和选择性控制的探针方法，比较了不同模型对句法复杂性的捕捉能力。

Details

Motivation: 探究多语言Transformer模型在不同语言中对问题句法结构的表示能力，特别是形态句法属性和结构复杂性如何被编码。 Method: 构建QTC数据集，标注七种语言的问题类型与复杂度指标（如依存距离、树深度、词汇密度）；使用冻结的Glot500-m模型进行逐层探针分析，与子词TF-IDF基线和微调模型对比，采用带选择性控制的回归探针方法。 Result: 统计特征在具有显式标记的语言中能有效分类问题类型，而神经探针更能捕捉细粒度的结构复杂性模式；上下文表示在某些情况下优于统计基线，且参数更新可能降低预训练语言信息的可用性。 Conclusion: 上下文表示能够捕捉跨语言的句法复杂性，但其优势依赖于语言特性；微调可能削弱预训练模型中固有的语言学信息。 Abstract: This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.

[14] LLM Bias Detection and Mitigation through the Lens of Desired Distributions

Ingroj Shrestha,Padmini Srinivasan

Main category: cs.CL

TL;DR: 本文提出一种基于加权自适应损失的微调方法，用于将大语言模型在性别-职业输出上的分布对齐到目标分布（如平等或真实世界分布），并在多种模型上验证了其有效性，显著减少了偏见。

Details

Motivation: 现有偏见缓解研究多关注社会平等，较少关注将大语言模型输出对齐到期望分布（如真实分布）。本文旨在通过定义偏见为对目标分布的偏离，填补这一空白。 Method: 提出一种加权自适应损失函数的微调方法，在保持语言建模能力的同时，使LLM的性别-职业输出分布对齐目标分布。使用美国劳工统计数据构建男性主导、女性主导和性别平衡三种职业集，评估自适应（反映现实）和非自适应（促进平等）方法。 Result: 在三种掩码语言模型上，两种分布下均观察到偏见；该方法在平等目标下几乎完全消除偏见，在现实分布下减少30%-75%偏见。自回归LLM在平等目标下无偏见，但在现实分布下仍有明显偏见，Llama Instruct模型实现50%-62%的偏见减少。 Conclusion: 通过将偏见定义为对期望输出分布的偏离，并采用自适应微调策略，可有效缓解LLM在性别-职业关联中的偏见，同时兼顾事实准确性与公平性需求。 Abstract: Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM's outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM's gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets -- male-dominated, female-dominated, and gender-balanced -- derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.

[15] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

Kshitish Ghate,Andy Liu,Devansh Jain,Taylor Sorensen,Atoosa Kasirzadeh,Aylin Caliskan,Mona T. Diab,Maarten Sap

Main category: cs.CL

TL;DR: 本文提出了EVALUESTEER基准，用于评估大语言模型和奖励模型在用户价值观和风格偏好方面的可引导性，通过合成数据集揭示了现有模型在处理多维度用户偏好时的局限性。

Details

Motivation: 随着大语言模型全球部署，需要构建能适应不同用户价值观和偏好的系统，但现有数据集缺乏对奖励模型引导能力的可控评估。 Method: 构建包含165,888对合成偏好数据的EVALUESTEER基准，涵盖4个价值维度和4个风格维度，在十六种提示条件下评估六种LLM和RM的表现。 Result: 实验显示，当提供完整的用户偏好档案时，最佳模型选择正确响应的准确率低于75%，而仅提供相关偏好时准确率超过99%。 Conclusion: 当前奖励模型难以有效识别和适应用户的多维偏好信息，EVALUESTEER为开发可被引导至多样化人类价值观的奖励模型提供了挑战性测试平台。 Abstract: As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs' and reward models' (RMs) steerability towards users' value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs -- systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user's preferences. We evaluate six open-source and proprietary LLMs and RMs under sixteen systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user's full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.

[16] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Firoj Alam,Ali Ezzat Shahroor,Md. Arid Hasan,Zien Sheikh Ali,Hunzalah Hassan Bhatti,Mohamed Bayan Kmainasi,Shammur Absar Chowdhury,Basel Mousi,Fahim Dalvi,Nadir Durrani,Natasa Milic-Frayling

Main category: cs.CL

TL;DR: 本文提出了一个名为EverydayMMQA的框架，用于构建大规模、文化相关的多模态多语言问答数据集，并基于此开发了包含语音、图像和文本的OASIS数据集，涵盖英语和阿拉伯语变体，支持多种输入组合，旨在提升模型在日常情境中的跨文化推理能力。

Details

Motivation: 现有大型多模态模型在需要文化背景常识的问题上表现不佳，尤其是在低资源和代表性不足的语言中，因此需要构建更具文化覆盖性的多模态问答数据集。 Method: 提出EverydayMMQA框架，用于系统性地构建多语言、多模态的视觉问答数据集；利用该框架构建OASIS数据集，包含约0.92亿张图像和1480万组问答对，其中370万为口语问题，支持语音、文本、图像的不同组合输入。 Result: OASIS数据集涵盖18个国家的英语和阿拉伯语变体，强调真实世界场景，测试模型在实用推理、常识和文化意识任务上的表现；对多个闭源、开源及微调模型进行了基准测试。 Conclusion: EverydayMMQA框架和OASIS数据集为构建具有文化感知能力的多模态大模型提供了有效的训练和评估资源，未来将向社区公开。 Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

[17] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Angie Boggust,Donghao Ren,Yannick Assogba,Dominik Moritz,Arvind Satyanarayan,Fred Hohman

Main category: cs.CL

TL;DR: 本文提出了语义正则表达式（semantic regexes），用于更精确、一致地描述大语言模型的特征，相比自然语言描述在保持准确性的同时更具结构性和可分析性。

Details

Motivation: 现有的自动化可解释性方法生成的自然语言特征描述往往模糊且不一致，需要手动重新标注，缺乏结构化和可扩展性。 Method: 提出语义正则表达式，结合捕捉语言和语义模式的基元与上下文化、组合和量化修饰符，构建结构化的特征描述方法。 Result: 在定量基准和定性分析中，语义正则表达式与自然语言描述准确率相当，但更简洁、一致，并支持跨层特征复杂度分析和模型级模式归纳；用户研究表明其有助于建立准确的心理模型。 Conclusion: 语义正则表达式提供了一种更优的LLM特征描述方式，兼具精确性、结构性和可扩展性，推动自动化可解释性从单个特征理解迈向全局模式分析。 Abstract: Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.

[18] Protecting De-identified Documents from Search-based Linkage Attacks

Pierre Lison,Mark Anderson

Main category: cs.CL

TL;DR: 提出一种对抗基于搜索的链接攻击的方法，通过构建N-gram倒排索引并利用大语言模型迭代重写低频片段，有效防止去标识化文本被重新链接到源文档，同时保持语义完整性。

Details

Motivation: 现有去标识化模型无法应对链接风险，即去标识化文本可能通过提取短语并匹配原始数据集而被重新关联到源头，需提出新方法来防御此类攻击。 Method: 首先构建文档集合中N-gram的倒排索引，识别出现在少于k个文档中的N-gram；然后使用基于大语言模型的重写器迭代改写这些易受攻击的片段，直至无法进行链接。 Result: 在法院案件文本集合上的实验表明，该方法能有效阻止基于搜索的链接攻击，同时较好地保留原文的语义内容。 Conclusion: 所提方法能够在不损害文本语义的前提下，显著降低去标识化文本的链接风险，提升了文本发布时的隐私保护水平。 Abstract: While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

[19] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion

Fan Zhou,Chang Tian,Tim Van de Cruys

Main category: cs.CL

TL;DR: 本文提出了一种名为RegDiff的正则化扩散框架，用于属性可控的文本生成，无需在采样时依赖预训练分类器，从而降低计算成本并提升生成效率。

Details

Motivation: 现有的控制文本生成方法在属性控制与语义保持之间存在权衡，分类器自由引导难以有效控制属性，而分类器引导计算成本高且存在泛化问题。 Method: RegDiff采用VAE编码器-解码器结构保证重建保真度，并在训练中引入属性监督的潜在扩散模型，仅在训练阶段注入属性信息，避免采样时依赖分类器。 Result: 在五个涵盖多种风格属性的数据集上实验表明，RegDiff在生成风格化文本方面优于强基线方法。 Conclusion: RegDiff是一种高效、有效的属性可控文本生成扩散模型，解决了现有方法在计算开销和属性控制上的局限。 Abstract: Generating stylistic text with specific attributes is a key problem in controllable text generation. Recently, diffusion models have emerged as a powerful paradigm for both visual and textual generation. Existing approaches can be broadly categorized into classifier-free guidance (CFG) and classifier guidance (CG) methods. While CFG effectively preserves semantic content, it often fails to provide effective attribute control. In contrast, CG modifies the denoising trajectory using classifier gradients, enabling better attribute alignment but incurring high computational costs during sampling and suffering from classifier generalization issues. In this work, we propose RegDiff, a regularized diffusion framework that leverages attribute features without requiring a pretrained classifier during sampling, thereby achieving controllable generation with reduced computational costs. Specifically, RegDiff employs a VAE-based encoder--decoder architecture to ensure reconstruction fidelity and a latent diffusion model trained with attribute supervision to enable controllable text generation. Attribute information is injected only during training. Experiments on five datasets spanning multiple stylistic attributes demonstrate that RegDiff outperforms strong baselines in generating stylistic texts. These results validate the effectiveness of RegDiff as an efficient solution for attribute-controllable text diffusion. Our code, datasets, and resources will be released upon publication at https://github.com/xxxx.

[20] Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

Elle

Main category: cs.CL

TL;DR: 研究提出一个衡量奖励模型（RM）与人类偏好一致性的框架，发现RM存在社会人口偏见并可能强化有害刻板印象，提示仅靠提示引导不足以解决这些问题。

Details

Motivation: 理解奖励模型在语言模型对齐中的行为，尤其是其在捕捉人类偏好时可能存在的社会人口偏见。 Method: 建立量化RM观点的框架，分析其在争议话题上的立场，并探索通过提示引导奖励偏向特定群体的效果。 Result: 发现RM与多个群体偏好不一致，可能系统性奖励有害刻板印象，且提示引导无法完全纠正这些偏差。 Conclusion: 在偏好学习中需更审慎地评估奖励模型行为，以避免语言技术传播不良社会偏见。 Abstract: Reward models (RMs) are central to the alignment of language models (LMs). An RM often serves as a proxy for human preferences to guide downstream LM behavior. However, our understanding of RM behavior is limited. Our work (i) formalizes a framework for measuring the alignment of opinions captured by RMs, (ii) investigates the extent to which RMs demonstrate sociodemographic biases, and (iii) explores the effects of prompting to steer rewards towards the preferences of a target group. We study the subjective and diverse perspectives on controversial topics, which allows us to quantify RM perspectives in terms of their opinions, attitudes, and values. We show that RMs are poorly aligned with several demographic groups and can systematically reward harmful stereotypes, and steering alone is not enough to overcome these limitations. Our findings underscore the need for more careful consideration of RM behavior in model alignment during preference learning to prevent the propagation of unwanted social biases in the language technologies that we use.

[21] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

R. Alexander Knipper,Indrani Dey,Souvika Sarkar,Hari Narayanan,Sadhana Puntambekar,Santu Karmaker

Main category: cs.CL

TL;DR: 本文提出了一种新的对齐框架，利用大语言模型（LLM）帮助教师生成与教学目标和虚拟实验相匹配的教学问题，通过教师-LLM对话、实验内容理解、问题分类法和TELeR提示控制四项组件，显著提升问题质量、格式可解析性和遵循度，尤其在大型模型上表现更优。

Details

Motivation: 教师难以将现有的虚拟实验室资源与其教学目标有效对齐，第三方材料不匹配，自行开发又耗时且难于扩展，因此需要一种自动化、可扩展的解决方案来支持个性化教学问题的生成。 Method: 提出一个包含四个组件的框架：通过教师与LLM对话理解教学目标、通过知识单元和关系分析理解实验内容、使用问题分类法设计认知层次、采用TELeR分类法优化提示细节；基于教师参与的小规模案例研究指导设计，并在19个开源LLM上生成超过1100个问题进行评估。 Result: 该框架显著提升了问题的质量（认知要求提高0.29-0.39分），格式可解析性达到80%，遵循度超过90%；大型模型表现更优，可解析性提升37.1%，遵循度提升25.7%，平均质量提高0.8个Likert评分点。 Conclusion: 该框架有效支持教师利用大语言模型生成符合教学意图和实验情境的高质量、结构化教学问题，为虚拟实验环境中的个性化教学提供了可扩展且高效的解决方案。 Abstract: Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, >90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.

[22] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

Yitao Long,Tiansheng Hu,Yilun Zhao,Arman Cohan,Chen Zhao

Main category: cs.CL

TL;DR: 本文提出了FinLFQA，一个用于评估大语言模型在复杂金融问题上生成长篇回答及其细粒度归因能力的基准。

Details

Motivation: 现有归因基准多关注简单的文本证据检索，难以反映真实场景（如金融应用）中对归因的更高要求，需要涵盖证据、推理步骤和领域知识的综合归因。 Method: 构建包含复杂金融问题的数据集，通过人工标注评估三种归因方面：财务报告中的支持性证据、中间数值推理步骤、以及指导推理的领域知识，并提出自动评估框架。 Result: 在八个大语言模型上的实验表明，细粒度指标能更好区分模型性能，端到端生成可媲美事后归因方法，而迭代优化仅在外部反馈指导下才有效。 Conclusion: FinLFQA为评估大模型在专业领域的归因能力提供了更全面的基准，强调了细粒度评估和真实场景下归因复杂性的重要性。 Abstract: Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

[23] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

Elena Chistova

Main category: cs.CL

TL;DR: UniRST是首个能够统一处理11种语言共18个树库的RST风格话语解析器，无需修改各树库的关系标签集。

Details

Motivation: 不同树库间关系标签集不兼容，难以构建统一的多语言话语解析模型。 Method: 提出两种训练策略：Multi-Head（每个标签集使用独立分类层）和Masked-Union（通过选择性标签掩码实现共享参数训练），并在低资源场景下采用数据增强技术进行单树库解析基准测试。 Result: Masked-Union方法在参数效率和性能上均表现最佳；UniRST在18个单树库基线中超越了16个，验证了单模型多语言端到端解析的优势。 Conclusion: UniRST实现了跨语言、跨树库的高效统一话语解析，Masked-Union是更优的训练策略，表明多资源联合训练能提升话语解析性能。 Abstract: We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.

[24] MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

Neeraja Kirtane,Yuvraj Khanna,Peter Relan

Main category: cs.CL

TL;DR: 本文提出了一个名为MathRobust-LV的测试集和评估方法，用于评估大语言模型在高中数学问题中对语言变化的鲁棒性，发现尽管模型在数学基准上表现良好，但在面对语言变体时准确率下降，表明语言鲁棒性仍是挑战。

Details

Motivation: 现有的数学推理评估多关注高难度竞赛题目，而忽视了实际教育场景中高中水平问题的语言变异性，因此需要更贴近真实教学环境的全面评估方法。 Method: 构建了一个保持数值结构和答案不变但改变表面细节（如名称、上下文、变量）的测试集MathRobust-LV，并在34个模型上评估其在语言变体下的性能变化。 Result: 实验显示，从基线到变体问题，模型准确率普遍下降，小型模型降幅达9-11%，较强模型也有明显退化，仅前沿模型如GPT-5、Gemini-2.5pro相对稳定。 Conclusion: 语言变体下的数学推理鲁棒性仍是大语言模型的重大挑战，当前模型在此方面存在脆弱性，影响其在教育应用中的可靠部署。 Abstract: Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.

[25] A Survey on Agentic Security: Applications, Threats and Defenses

Asif Shahriar,Md Nafiu Rahman,Sadif Ahmed,Farig Sadeque,Md Rizwan Parvez

Main category: cs.CL

TL;DR: 本文首次全面综述了LLM智能体的安全格局，围绕应用、威胁和防御三个相互关联的支柱构建领域框架，并对150多篇论文进行了分类分析，揭示了智能体架构的发展趋势及模型与模态覆盖方面的研究空白。

Details

Motivation: 随着LLM从被动模型转向自主智能体，网络安全进入新范式，但同时也带来了新的内在安全风险，亟需系统性研究这些风险及其应对措施。 Method: 通过综合分析150余篇相关论文，提出一个涵盖应用、威胁和防御三方面相互依赖的框架，并进行跨领域的趋势分析与研究缺口识别。 Result: 建立了关于LLM智能体安全的全面分类体系，揭示了当前在模型和模态覆盖方面的关键研究不足，并展示了智能体架构的新兴发展趋势。 Conclusion: LLM智能体在网络安全中具有巨大潜力，但其自身引入的安全风险需要更深入的研究，未来应关注模型与模态的全面性和防御机制的有效性。 Abstract: The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity. While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks. In this work we present the first holistic survey of the agentic security landscape, structuring the field around three interdependent pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage.

[26] Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul,Alessio Tosolini,Claire Bowern

Main category: cs.CL

TL;DR: 本研究探讨了在资源匮乏的澳大利亚原住民语言Yan-nhangu上使用wav2vec2自动语音识别（ASR）系统的可行性，比较了音位与正字法分词策略的效果，并展示了ASR在语言记录中的实用价值。

Details

Motivation: 现代ASR系统依赖大量数据的transformer架构，难以应用于资源匮乏的语言，因此需要探索适合此类语言的ASR方法。 Method: 在Yan-nhangu语言数据上微调wav2vec2 ASR模型，比较音位和正字法两种分词策略对词错误率（WER）和字符错误率（CER）的影响。 Result: 语言学指导的音位分词策略显著优于正字法分词，在WER和CER上均有改善；此外，手动修正ASR输出比从零开始手写转录更快。 Conclusion: ASR结合合适的分词策略可用于资源匮乏语言的语言记录工作，能显著提升转录效率，具有实际应用潜力。 Abstract: Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR's viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

[27] Test-Time Scaling of Reasoning Models for Machine Translation

Zihao Li,Shaoxiong Ji,Jörg Tiedemann

Main category: cs.CL

TL;DR: 本文研究了推理时计算增加对机器翻译质量的影响，发现测试时扩展（TTS）在通用模型的直接翻译中效果有限，但在领域特定微调和后编辑场景中能显著提升性能。

Details

Motivation: 探索测试时扩展（TTS）在机器翻译中的有效性，尤其是在数学和编程任务之外的应用潜力。 Method: 评估了12个推理模型在多个机器翻译基准上的表现，涵盖直接翻译、强制推理外推和后编辑三种场景。 Result: 通用模型在直接翻译中TTS效果有限且不稳定；领域特定微调可使TTS效果显著提升至自适应的最佳推理深度；强制延长推理会降低翻译质量；而在后编辑场景中TTS能有效促进自我纠正。 Conclusion: 推理时计算在机器翻译中的价值不在于增强通用模型的单次翻译，而在于多步自我纠正流程和任务专用模型的结合应用。 Abstract: Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model's reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.

[28] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Zhepeng Cen,Haolin Chen,Shiyu Wang,Zuxin Liu,Zhiwei Liu,Ding Zhao,Silvio Savarese,Caiming Xiong,Huan Wang,Weiran Yao

Main category: cs.CL

TL;DR: 本文提出了Webscale-RL数据引擎，将大规模预训练文本转化为用于强化学习的多样化问答对，构建了包含120万样本的数据集，显著提升了语言模型在强化学习中的效率和性能。

Details

Motivation: 现有的强化学习数据集规模小、多样性不足，限制了强化学习在语言模型中的应用，而模仿学习存在训练与生成之间的差距，需要更高效的数据驱动方法来弥补这一缺陷。 Method: 提出Webscale-RL流水线，系统性地将大规模预训练文档转化为可验证的问答对，构建大规模、多领域的强化学习数据集，并在此基础上进行强化学习训练。 Result: 实验表明，使用该数据集训练的模型在多个基准上显著优于持续预训练和强基线方法，且强化学习训练效率更高，仅用少至1/100的token即可达到持续预训练的性能。 Conclusion: Webscale-RL为将强化学习扩展到预训练级别提供了可行路径，推动了更强大且高效的语言模型发展。 Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.

[29] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Seng Pei Liew,Takuya Kato

Main category: cs.CL

TL;DR: 研究了自举预训练的扩展行为，发现其效率随着基础模型预训练令牌数量的对数增加而降低，提出了一个简单的扩展定律来描述这种现象。

Details

Motivation: 减少从零开始训练语言模型的成本，探索自举预训练在过度训练模型上的有效性。 Method: 通过实证研究分析自举预训练在不同阶段预训练令牌数量下的扩展行为，并建立简单的扩展定律模型。 Result: 发现自举预训练的扩展效率随基础模型预训练令牌数的对数递减，且该现象可通过简单扩展定律准确建模。 Conclusion: 过度预训练会降低后续自举预训练的增益，揭示了多阶段预训练策略中的基本权衡，为高效语言模型训练提供了实践指导。 Abstract: Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

[30] Flipping the Dialogue: Training and Evaluating User Language Models

Tarek Naous,Philippe Laban,Wei Xu,Jennifer Neville

Main category: cs.CL

TL;DR: 本文提出了一种专门用于模拟人类用户对话行为的User Language Models（User LMs），相较于使用助手型大模型模拟用户，User LMs更贴近真实用户行为，提升了多轮对话评估的真实性。实验表明，在更真实的模拟环境下，强大助手模型（如GPT-4o）的表现显著下降，揭示了当前助手在应对真实用户多样性时的不足。

Details

Motivation: 现有研究多用经过训练的助手型大模型来模拟人类用户，但这类模型生成的用户行为过于理想化，无法反映真实用户表达中的不完整、模糊和即兴修改等特点，导致对话评估环境失真。因此，需要构建更贴近真实用户行为的模拟方法。 Method: 提出并构建了专门用于模拟人类用户的User Language Models（User LMs），通过对模型进行针对性后训练，使其学习真实用户在多轮对话中的语言模式和行为特征，并在编码与数学对话场景中评估其模拟效果及对助手模型性能的影响。 Result: User LMs在多个评估指标上表现出比现有方法更高的行为一致性与鲁棒性，能更真实地模拟人类用户。当使用User LMs进行对话模拟时，GPT-4o的性能从74.6%下降至57.4%，说明更真实的用户模拟带来了更大的挑战。 Conclusion: 使用专为用户角色设计的User LMs能有效提升多轮对话评估的真实性，揭示了当前助手模型在面对真实用户行为时的局限性，强调了未来需加强助手对非理想化用户输入的适应能力。 Abstract: Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user's request. To satisfy this specific role, LMs are post-trained to be helpful assistants -- optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.

[31] The Algebra of Meaning: Why Machines Need Montague More Than Moore's Law

Cheonkam Jeong,Sungdo Kim,Jewoo Park

Main category: cs.CL

TL;DR: 本文提出了一种基于类型理论的神经符号架构Savassan，将语言对齐视为解析问题，通过将自然语言编译为带有道义算子和司法语境的类型化逻辑形式，解决大模型中的幻觉、审查脆弱性和合规性难题。

Details

Motivation: 现有语言模型虽流畅但常错误处理语义类型，导致幻觉、审查不稳定和合规决策不透明，根源在于缺乏类型论语义而非数据或规模不足。 Method: 基于Montague的语言类型组合代数理论，设计Savassan架构：神经组件从输入提取候选结构，符号组件进行类型检查、约束推理和跨司法域映射，将语句编译为显式的描述性、规范性和法律性结构。 Result: 系统可‘一次解析’语句（如产品缺陷指控），并将其投影到多个法律本体（如韩国/日本的诽谤风险、美国的受保护言论、欧盟的GDPR检查），生成统一且可解释的合规建议，而非简单封禁。 Conclusion: 可信的自主系统需要对意义进行组合式类型化，使系统能在一个统一的意义代数中区分事实描述、规范要求和法律责任，从而实现可解释、跨域兼容的合规推理。 Abstract: Contemporary language models are fluent yet routinely mis-handle the types of meaning their outputs entail. We argue that hallucination, brittle moderation, and opaque compliance outcomes are symptoms of missing type-theoretic semantics rather than data or scale limitations. Building on Montague's view of language as typed, compositional algebra, we recast alignment as a parsing problem: natural-language inputs must be compiled into structures that make explicit their descriptive, normative, and legal dimensions under context. We present Savassan, a neuro-symbolic architecture that compiles utterances into Montague-style logical forms and maps them to typed ontologies extended with deontic operators and jurisdictional contexts. Neural components extract candidate structures from unstructured inputs; symbolic components perform type checking, constraint reasoning, and cross-jurisdiction mapping to produce compliance-aware guidance rather than binary censorship. In cross-border scenarios, the system "parses once" (e.g., defect claim(product x, company y)) and projects the result into multiple legal ontologies (e.g., defamation risk in KR/JP, protected opinion in US, GDPR checks in EU), composing outcomes into a single, explainable decision. This paper contributes: (i) a diagnosis of hallucination as a type error; (ii) a formal Montague-ontology bridge for business/legal reasoning; and (iii) a production-oriented design that embeds typed interfaces across the pipeline. We outline an evaluation plan using legal reasoning benchmarks and synthetic multi-jurisdiction suites. Our position is that trustworthy autonomy requires compositional typing of meaning, enabling systems to reason about what is described, what is prescribed, and what incurs liability within a unified algebra of meaning.

[32] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

Haofei Yu,Keyang Xuan,Fenghai Li,Kunlun Zhu,Zijie Lei,Jiaxun Zhang,Ziheng Qi,Kyle Richardson,Jiaxuan You

Main category: cs.CL

TL;DR: TinyScientist 提出一个轻量、可扩展且交互性强的自动研究框架，以应对日益复杂的多智能体研究工作流，支持快速集成新工具并提供开源实现。

Details

Motivation: 随着大语言模型在自动研究中的应用增多，现有系统因复杂性和维护难度限制了扩展和普及，亟需更灵活、易用的框架。 Method: 识别自动研究流程的核心组件，设计一个模块化、可交互的框架，支持多智能体协作、工具调用与迭代开发，并提供开源代码库、Web演示和PyPI包。 Result: 实现了易于扩展和控制的自动研究系统，已通过开源代码、Web界面和Python包形式发布，提升了自动研究工具的可访问性与可用性。 Conclusion: TinyScientist 为研究人员和开发者提供了一个灵活、开放的平台，有效降低了构建和维护自动研究工作流的门槛，推动了自动化科研的发展。 Abstract: Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.

[33] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLM）的越狱现象，通过分析GPT-J和Mamba2模型在越狱提示与良性提示下的隐藏层响应差异，揭示了不同层次的行为特征，为基于模型内部动态的越狱检测与防御提供了新方向。

Details

Motivation: 随着对话式大语言模型的普及，越狱攻击日益严重，攻击者利用精心设计的提示获取受限或敏感信息，现有防御机制难以完全抵御新型攻击，因此需要深入理解模型内部机制以提升安全性。 Method: 通过分析开源大语言模型GPT-J和状态空间模型Mamba2的内部表示，比较其在越狱提示和良性提示下各隐藏层的响应行为，识别越狱攻击引发的异常模式。 Result: 研究发现，越狱提示在模型的不同隐藏层中引发了与良性提示显著不同的响应模式，表明可通过监测内部表征变化来识别越狱行为，初步结果支持利用模型动态进行检测的可行性。 Conclusion: 通过探究大语言模型内部表示的变化，可为开发更鲁棒的越狱检测与防御机制提供有效路径，未来研究可进一步利用层间动态行为提升模型安全性。 Abstract: Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.

[34] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

Nhat M. Hoang,Do Xuan Long,Cong-Duy Nguyen,Min-Yen Kan,Luu Anh Tuan

Main category: cs.CL

TL;DR: 本文首次对状态空间模型（SSM）和基于Transformer的模型（TBM）在token和层级别上的表示传播进行了统一分析，揭示了二者在上下文信息流动中的关键差异：TBM早期快速同质化token表示，而SSM则早期保持token独特性但深层趋于同质化。研究进一步指出，TBM的过平滑源于架构设计，而SSM主要由训练动态导致。

Details

Motivation: 尽管SSM在长序列处理中表现出高效性，但其与TBM在上下文信息如何跨层和跨token传播方面的机制尚不清楚，缺乏系统性分析。 Method: 采用中心核对齐、稳定性度量和探针方法，对SSM和TBM在token和层级别的表示演化进行统一分析，并结合理论分析与参数随机化实验。 Result: 发现TBM在早期层迅速同质化token表示，多样性仅在后期恢复；SSM则在早期保持token独特性，但在深层趋于同质化。理论和实验表明，TBM的过平滑主要由其架构引起，而SSM的过平滑更多源于训练动态。 Conclusion: 该研究阐明了SSM和TBM在表示传播上的内在归纳偏置，为面向长上下文推理的模型架构和训练设计提供了指导。 Abstract: State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing, offering linear scaling and lower memory use. Yet, how contextual information flows across layers and tokens in these architectures remains understudied. We present the first unified, token- and layer-level analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, stability metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.

[35] Aligning Large Language Models via Fully Self-Synthetic Data

Shangjian Yin,Zhepei Wei,Xinyu Zhu,Wei-Lin Chen,Yu Meng

Main category: cs.CL

TL;DR: 本文提出了一种名为Self-Alignment Optimization (SAO)的全自生成框架，用于大语言模型的对齐，所有训练数据（包括提示、回复和偏好）均由模型自身生成，无需人工或外部标注，显著降低了成本并提升了对话能力。

Details

Motivation: 传统基于人类反馈的强化学习（RLHF）和基于AI反馈的强化学习（RLAIF）依赖昂贵的人工标注或外部模型（如GPT-4）生成偏好数据，成本高且资源消耗大。因此，需要一种更经济、自主的对齐方法。 Method: SAO通过让大语言模型进行角色扮演来自动生成多样化的用户提示和回复，并由模型自身对这些回复进行自我评估以实现偏好优化，整个过程无需外部干预或额外标注。 Result: 实验表明，SAO在AlpacaEval 2.0等标准基准上显著提升了模型的对话能力，同时在问答、数学推理等下游任务中保持了良好性能。 Conclusion: SAO提供了一种可行的大语言模型自对齐方案，实现了低成本、高质量的模型自我改进，具有良好的实用性和可复现性。 Abstract: Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: https://github.com/SJY8460/SAO.

[36] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

Yunzhong Xiao,Yangmin Li,Hewei Wang,Yunlong Tang,Zora Zhiruo Wang

Main category: cs.CL

TL;DR: 本文提出ToolMem，使智能体通过记忆神经工具的优缺点来提升任务中的工具选择准确性，在文本和多模态生成任务中显著提高了工具性能预测和选择的准确率。

Details

Motivation: 现有智能体通常使用固定工具，缺乏根据任务场景灵活选择最优工具的能力，而人类能通过经验积累选择合适工具，因此需要让智能体具备类似的记忆与选择能力。 Method: 提出ToolMem框架，智能体在与工具交互后总结其优缺点并存储到记忆中，推理时检索相关记忆以选择最合适的工具。 Result: 在文本和图文生成任务中，相比无记忆的基线智能体，ToolMem使工具性能预测准确率分别提升14.8%和28.7%，最优工具选择成功率提高21%和24%。 Conclusion: ToolMem有效提升了智能体在多样化任务中对不确定神经工具的选择能力，增强了灵活性与性能。 Abstract: Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.

[37] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

Shangjian Yin,Shining Liang,Wenbiao Ding,Yuli Qian,Zhouxing Shi,Hongzhi Li,Yutao Xie

Main category: cs.CL

TL;DR: 本文提出了PiKa，一个数据高效的专家级对齐数据集家族，仅用3万条监督微调（SFT）样本即可超越基于数百万专有数据训练的模型，在Llama和Qwen系列模型上均展现出卓越性能，为开源大语言模型对齐提供了可扩展的新路径。

Details

Motivation: 现有的大语言模型对齐数据集大多依赖昂贵的人工标注或私有数据，限制了可复现性和可扩展性；同时，当前方法需要大量数据（如超过30万条）却仍表现不足，本文旨在探索高质量、小规模数据即可实现有效对齐的可能性。 Method: 提出PiKa数据集，其中PiKa-SFT仅使用3万条高质量指令数据进行监督微调，并在Llama-3-8B-Base和Qwen2.5系列模型上进行实验，通过AlpacaEval 2.0和Arena-Hard等基准评估其性能，与使用更大规模数据训练的模型进行对比。 Result: 在AlpacaEval 2.0和Arena-Hard基准上，基于PiKa-SFT微调的Llama-3-8B-Base模型超过了使用超过1000万条专有数据训练的Llama-3-8B-Instruct模型；在Qwen2.5系列（0.5B至7B）上也实现了持续性能提升。 Conclusion: 高质量的语言模型对齐可以在显著减少数据量的前提下实现，PiKa证明了小而精的数据集在模型对齐中的有效性，为资源受限的研究社区提供了可行且可扩展的开源解决方案。 Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: https://github.com/SJY8460/PiKa.

[38] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

Yisha Wu,Cen,Zhao,Yuanpei Cao,Xiaoqing Su,Yashar Mehdad,Mindy Ji,Claire Na Cheng

Main category: cs.CL

TL;DR: 提出了一种面向客服代理的增量式摘要系统，通过在对话中智能生成简洁的要点笔记，减少上下文切换和重复审阅，结合Mixtral-8x7B模型生成笔记和DeBERTa分类器过滤无关内容，并利用代理编辑实现在线优化与离线重训练。系统在生产环境中部署后，相比批量摘要平均减少3%处理时间，在复杂案例中最高减少9%，并获得高用户满意度。

Details

Motivation: 客服代理在处理对话时需频繁切换上下文并重复审阅内容，影响效率，传统批量摘要无法实时支持，因此需要一种能动态生成高质量摘要的增量式方法。 Method: 采用微调的Mixtral-8x7B模型持续生成对话笔记，使用DeBERTa分类器过滤琐碎内容，并根据客服代理对生成笔记的编辑反馈，闭环更新模型以持续优化生成质量。 Result: 系统在生产环境中实现了平均3%的案件处理时间减少，在高度复杂的案例中最多减少9%，同时获得客服代理的高度满意度评价。 Conclusion: 增量式摘要结合持续反馈机制能有效提升摘要质量和客服代理的工作效率，并具备大规模应用的可行性。 Abstract: We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents' context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.

[39] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks

Qinhao Zhou,Xiang Xiang,Kun He,John E. Hopcroft

Main category: cs.CL

TL;DR: 本文提出了一种针对机器翻译任务的新型提示优化方法，利用小参数模型和基于回译的策略，降低训练开销并实现高效性能。

Details

Motivation: 现有提示优化方法主要关注通用任务中的指令部分优化，且依赖大参数模型，难以有效应用于以输入成分为关键的机器翻译等任务。 Method: 采用基于回译的训练策略，使用小参数模型对提示中的输入成分进行优化，减少训练开销，并提升在机器翻译任务中的表现。 Result: 所提方法在机器翻译任务中表现出高效的性能，同时训练成本较低，并具备向其他下游任务扩展的潜力。 Conclusion: 该方法为以输入为核心的提示优化提供了新思路，特别适用于机器翻译等自然语言生成任务，且无需依赖大模型即可实现有效优化。 Abstract: In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textit{instruction}, which defines the task or objective, and the \textit{input}, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textit{input} component is particularly critical, while the \textit{instruction} component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textit{instruction} component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textit{input} component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.

[40] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Leonardo Bertolazzi,Sandro Pezzelle,Raffaelle Bernardi

Main category: cs.CL

TL;DR: 该研究探讨了大语言模型（LLM）在推理任务中因语义内容的合理性而产生判断偏差（即“内容效应”）的机制，发现有效性和合理性在模型的表征空间中呈线性且高度对齐，导致两者混淆；通过操纵和去偏向量，可减轻这种混淆，提升推理准确性。

Details

Motivation: 理解大语言模型中内容效应的内在机制，揭示其与人类双过程推理理论的异同，并探索如何减少模型在逻辑推理中的偏差。 Method: 分析LLM内部表征中有效性和合理性的几何结构，使用线性探针和 steering vectors 探究二者的关系，并构建去偏向量以解耦这两个概念。 Result: 发现有效性和合理性在表征空间中线性存在且高度对齐，steering vectors 可以因果地影响彼此的判断，且对齐程度预测了模型内容效应的强度；引入的去偏方法能有效减少内容效应并提高逻辑推理准确率。 Conclusion: 大语言模型将合理性与逻辑有效性混淆是内容效应的关键机制，通过表征层面的干预（如去偏向量）可以改善模型的逻辑推理能力。 Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

[41] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Miao Lu,Weiwei Sun,Weihua Du,Zhan Ling,Xuesong Yao,Kang Liu,Jiecao Chen

Main category: cs.CL

TL;DR: 本文提出了一种基于摘要的上下文管理方法SUPO，用于在固定上下文长度限制下进行长周期多轮工具使用的强化学习微调，显著提升了任务成功率并控制了上下文长度。

Details

Motivation: 现有强化学习流程在长周期多轮工具使用中面临指令遵循退化、 rollout 成本高和上下文长度受限等问题，难以扩展到长上下文任务。 Method: 通过LLM生成的摘要定期压缩工具使用历史，保留任务相关信息，并在此基础上构建端到端的策略梯度优化框架，联合优化工具使用行为和摘要策略。 Result: 在交互式函数调用和搜索任务上，SUPO显著提高了成功率，同时保持甚至降低了工作上下文长度；在测试时扩展摘要轮次还能进一步提升复杂任务性能。 Conclusion: 基于摘要的上下文管理是一种原则性强且可扩展的方法，能够有效支持超越固定上下文长度限制的RL代理训练。 Abstract: We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

[42] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Manuel Frank,Haithem Afli

Main category: cs.CL

TL;DR: 本文提出了PTEB，一种动态生成语义保持的同义句并在多次运行中聚合结果的评估协议，以解决现有静态测试集可能导致性能高估的问题。

Details

Motivation: 现有的句子嵌入模型评估依赖于固定的测试集（如MTEB），反复调优会导致报告性能虚高，难以反映真实鲁棒性。 Method: 提出PTEB，利用基于LLM的低成本方法在评估时随机生成语义保持但词汇多样的同义句，并结合多个运行的结果进行评估。 Result: 在7个MTEB任务和3个多语言数据集上验证了句子编码器对词元空间变化敏感，且小模型未比大模型更受影响，结果在多次运行中具有统计稳健性。 Conclusion: PTEB提供了一种更可靠的评估方式，并倡导NLP领域向基于评估时计算的动态、随机化评估范式转变。 Abstract: Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in semantic textual similarity gold ratings, we show that LLMs generate token-diverse but semantically preserving, paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs and we extended our experiments to 3 multilingual datasets covering 10 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.

[43] Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Tiancheng Xing,Jerry Li,Yixuan Du,Xiyang Hu

Main category: cs.CL

TL;DR: 本文提出了一种名为Rank Anything First (RAF)的两阶段令牌优化方法，能够通过自然语言扰动有效提升目标项目在大语言模型（LLM）排序中的排名，揭示了LLM作为重排序器时易受对抗性操纵的安全隐患。

Details

Motivation: 大语言模型（LLMs）在信息检索中被广泛用作重排序器，但其排序行为可能受到微小、自然语言提示的影响。为了揭示这种潜在的安全漏洞，需要一种有效且隐蔽的攻击方法来评估LLM重排序的鲁棒性。 Method: 提出RAF方法，分为两个阶段：第一阶段使用贪婪坐标梯度结合排名目标梯度和可读性评分筛选候选令牌；第二阶段通过基于熵的动态加权方案，在精确排名损失和可读性损失下评估候选，并采用温度控制采样选择最优令牌。该方法逐令牌生成促进排名的提示，兼顾排序效果与语言自然性。 Result: 在多个大语言模型上的实验表明，RAF能显著提升目标项目的排名，且生成的提示具有高度自然性，相比现有方法在有效性与鲁棒性方面表现更优。 Conclusion: LLM-based reranking存在严重安全风险，容易受到隐蔽且有效的对抗性操纵，这对现代检索系统的可信度和鲁棒性提出了新的挑战。 Abstract: Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small, natural-sounding prompts. To expose this vulnerability, we present Rank Anything First (RAF), a two-stage token optimization method that crafts concise textual perturbations to consistently promote a target item in LLM-generated rankings while remaining hard to detect. Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens at the current position by combining the gradient of the rank-target with a readability score; Stage 2 evaluates those candidates under exact ranking and readability losses using an entropy-based dynamic weighting scheme, and selects a token via temperature-controlled sampling. RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness. Experiments across multiple LLMs show that RAF significantly boosts the rank of target items using naturalistic language, with greater robustness than existing methods in both promoting target items and maintaining naturalness. These findings underscore a critical security implication: LLM-based reranking is inherently susceptible to adversarial manipulation, raising new challenges for the trustworthiness and robustness of modern retrieval systems. Our code is available at: https://github.com/glad-lab/RAF.

[44] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Boyi Zeng,Lin Chen,Ziwei He,Xinbing Wang,Zhouhan Lin

Main category: cs.CL

TL;DR: 提出一种无需训练的基于权重矩阵的指纹识别方法，利用线性分配问题和无偏中心核对齐相似性来抵御多种后训练操作的影响，实现高效、准确的模型血缘验证。

Details

Motivation: 保护大语言模型的知识产权，迫切需要能够判断可疑模型是否源自现有基础模型的方法。 Method: 基于权重矩阵设计训练-free指纹识别方法，结合线性分配问题（LAP）和无偏中心核对齐（CKA）相似性度量，以消除参数 manipulation 的影响。 Result: 在60个正例和90个负例模型对的测试集上，该方法对六类后训练操作均表现出极强的鲁棒性，假阳性风险接近零，所有分类指标均达到完美得分，且在NVIDIA 3090 GPU上30秒内完成计算。 Conclusion: 所提方法为可靠的大语言模型血缘验证提供了强有力的基础，兼具高效性、高保真性和强鲁棒性。 Abstract: Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at https://github.com/LUMIA-Group/AWM.

[45] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

I-Fan Lin,Faegheh Hasibi,Suzan Verberne

Main category: cs.CL

TL;DR: 提出一种无需训练和标签的短文本聚类方法，通过迭代向量更新和大模型引导，在无先验知识的情况下实现与现有最先进方法相当或更优的效果。

Details

Motivation: 在实际商业场景中，如面向客户的聊天机器人，通常缺乏标注数据且未知簇的数量，需要一种无需标签和预知簇数的高效短文本聚类方法。 Method: 基于代表性文本构建稀疏向量，并通过大语言模型（LLM）指导进行迭代优化，适用于任何嵌入模型且不依赖对比学习。 Result: 在多个数据集和较小的LLM上实验表明，该方法具有模型无关性，可扩展性强，计算成本低，性能优于或相当于现有的最先进方法。 Conclusion: 该方法在低资源、可扩展的真实场景中表现出色，比现有聚类方法更具实用性。 Abstract: In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.

[46] A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

Eitan Klinger,Zihao Huang,Tran Minh Nguyen,Emma Jayeon Park,Yige Chen,Yang Gu,Qingyu Gao,Siliang Liu,Mengyang Qiu,Jungyeul Park

Main category: cs.CL

TL;DR: 提出了一种基于流利度的多参考评估框架，通过四种聚合策略实例化GLEU，统一并改进了多语言语法错误纠正的评估方法。

Details

Motivation: 现有语法错误纠正评估方法多基于编辑操作且以英语为中心，难以适应多语言和生成式场景，且无法充分反映合法的人类修正多样性。 Method: 将n-gram相似性建模为对多个合法修正的聚合问题，提出了基于流利度的多参考评估框架，并实例化了四种GLEU聚合策略：select-best、simple-average、weighted-average和merged-counts。 Result: 在捷克语、爱沙尼亚语、乌克兰语和中文语料上的实验表明，不同聚合策略捕捉到流利度和覆盖度的互补特征，新框架能更好地包容语言多样性而不惩罚合理的变体。 Conclusion: 该框架为语法错误纠正提供了一个原则性强、面向流利度的多参考评估统一方法，适用于多语言环境，有效支持合法修正的多样性。 Abstract: Evaluating grammatical error correction requires metrics that reflect the diversity of valid human corrections rather than privileging a single reference. Existing frameworks, largely edit-based and English-centric, rely on rigid alignments between system and reference edits, limiting their applicability in multilingual and generative settings. This paper introduces a formal framework for \textit{fluency-based multi-reference evaluation}, framing $n$-gram similarity as an aggregation problem over multiple legitimate corrections. Within this formulation, we instantiate GLEU through four aggregation strategies--\textsc{select-best}, \textsc{simple-average}, \textsc{weighted-average}, and \textsc{merged-counts}--and analyze their properties of boundedness, monotonicity, and sensitivity to reference variation. Empirical results on Czech, Estonian, Ukrainian, and Chinese corpora show that these strategies capture complementary aspects of fluency and coverage. The framework unifies multi-reference evaluation into a principled, fluency-oriented approach that incorporates linguistic diversity without penalizing legitimate variation.

[47] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

Jaeseong Lee,Dayoung Kwon,seung-won hwang

Main category: cs.CL

TL;DR: 提出一种轻量级、无需训练的叠加部署策略，通过在推理时选择性地抑制大推理模型（LRM）的过思考，以降低计算开销同时保留其推理能力。

Details

Motivation: 大推理模型（LRM）在结构化任务中表现出色，但容易出现过思考问题，导致性能下降和资源浪费。传统多模型路由方法成本高且不实用。 Method: 通过分析奇异值的累积能量，确定最优的低秩投影，在推理过程中动态调节LRM的推理强度，实现选择性‘遗忘’，避免过思考。 Result: 该方法无需额外训练，能够有效减少计算量，同时保持LRM的推理性能。 Conclusion: 所提出的叠加部署策略为优化LRM推理提供了一种高效、低成本的解决方案，平衡了推理能力与计算开销。 Abstract: Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.

[48] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition

Lei Xu,Pierre Beckmann,Marco Valentino,André Freitas

Main category: cs.CL

TL;DR: 本文提出了一种自适应、多范式的神经符号推理框架，能够自动识别自然语言问题中的形式化推理策略，并动态选择合适的逻辑求解器进行求解。实验表明该方法显著优于现有基线模型。

Details

Motivation: 现有的神经符号NLP方法多为静态设计，缺乏对多样化形式推理策略的灵活整合能力，限制了其在复杂推理任务中的表现。 Method: 通过自动形式化接口，利用大语言模型识别自然语言问题中的推理策略，并动态选择和调用专门的逻辑求解器，实现自适应的多范式推理。 Result: LLMs在预测所需形式推理策略上的准确率超过90%；该框架相比GPT-4o和DeepSeek-V3.1分别提升27%和6%；在零样本、CoT和符号CoT设置下，GPT-4o也获得最高10%的增益；小模型经后训练可改善表现。 Conclusion: 本工作建立了自适应LLM-符号推理的基础，为解决异构推理挑战提供了统一材料与形式推理的新路径。 Abstract: Neuro-symbolic NLP methods aim to leverage the complementary strengths of large language models and formal logical solvers. However, current approaches are mostly static in nature, i.e., the integration of a target solver is predetermined at design time, hindering the ability to employ diverse formal inference strategies. To address this, we introduce an adaptive, multi-paradigm, neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from problems expressed in natural language; and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces. Extensive experiments on individual and multi-paradigm reasoning tasks support the following conclusions: LLMs are effective at predicting the necessary formal reasoning strategies with an accuracy above 90 percent. This enables flexible integration with formal logical solvers, resulting in our framework outperforming competing baselines by 27 percent and 6 percent compared to GPT-4o and DeepSeek-V3.1, respectively. Moreover, adaptive reasoning can even positively impact pure LLM methods, yielding gains of 10, 5, and 6 percent on zero-shot, CoT, and symbolic CoT settings with GPT-4o. Finally, although smaller models struggle with adaptive neuro-symbolic reasoning, post-training offers a viable path to improvement. Overall, this work establishes the foundations for adaptive LLM-symbolic reasoning, offering a path forward for unifying material and formal inferences on heterogeneous reasoning challenges.

[49] Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Luca Giordano,Simon Razniewski

Main category: cs.CL

TL;DR: 本论文研究了大语言模型（LLM）知识具象化的可终止性、可重复性和鲁棒性，使用miniGPTKB方法在三个领域进行实验，发现该过程通常可终止且对某些变量具有鲁棒性，但在不同语言和模型下可重复性有限。

Details

Motivation: 尽管大语言模型包含大量事实知识，但如何有效测量和系统化这些知识仍不清楚，尤其是将其转化为结构化形式的过程缺乏深入研究。 Method: 采用miniGPTKB（领域特定的子爬取）方法，对种子、语言、随机性和模型四种变化进行实验，评估产出量、词汇相似性和语义相似性三类指标。 Result: 实验结果显示：(i) 终止率较高但依赖于模型；(ii) 可重复性表现不一；(iii) 对种子和温度扰动鲁棒性强，对语言和模型变化则较弱。 Conclusion: LLM知识具象化能够可靠地提取核心知识，但在跨语言和跨模型应用中存在显著局限性，需进一步优化以提升稳定性和一致性。 Abstract: Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.

[50] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu,Shufan Jiang,Chios Chen,Yiyang Feng,Hehai Lin,Heqing Zou,Yao Shu,Yanran Li,Chengwei Qin

Main category: cs.CL

TL;DR: 本文提出了FURINA-Builder，一个用于自动生成可定制角色扮演（RP）基准的多智能体协作管道，并基于此构建了综合性RP评测基准FURINA-Bench。实验表明，o3和DeepSeek-R1分别在英文和中文RP任务中表现最佳，且推理能力强的模型虽提升RP表现但会增加幻觉，揭示了性能与可靠性之间的权衡。

Details

Motivation: 现有角色扮演评测基准因范围窄、交互范式过时、适应性差而迅速过时，难以满足多样化应用场景下的评估需求，亟需一种灵活、可扩展的基准构建方法。 Method: 提出FURINA-Builder，通过多智能体协作模拟测试角色与场景库中其他角色的对话，利用LLM裁判选择评估维度并优化生成测试语句，从而自动构建可定制化、多语言、多场景的RP评测基准FURINA-Bench。 Result: 构建了包含真实与合成角色的FURINA-Bench；人类评估验证了其有效性；发现o3和DeepSeek-R1在英中文RP任务中领先；证实已有角色优于合成角色；揭示推理能力增强RP表现但加剧幻觉，存在性能与可靠性的帕累托权衡。 Conclusion: FURINA-Builder能有效构建高质量、可扩展的角色扮演评测基准，FURINA-Bench为评估LLMs的RP能力提供了更具挑战性和现实意义的平台，同时揭示了当前模型在角色一致性与推理间的新权衡问题。 Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character's responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

[51] Overview of the Plagiarism Detection Task at PAN 2025

André Greiner-Petter,Maik Fröbe,Jan Philip Wahle,Terry Ruas,Bela Gipp,Akiko Aizawa,Martin Potthast

Main category: cs.CL

TL;DR: 本文介绍了PAN 2025生成式抄袭检测任务，提出了一个基于大语言模型的大规模自动生成抄袭数据集，并评估了参赛方法与基线模型的表现，发现当前方法在新数据上表现尚可，但在旧数据上泛化能力不足。

Details

Motivation: 随着大语言模型广泛用于生成文本，科学文献中自动生成的抄袭现象日益严重，亟需有效检测方法。 Method: 使用Llama、DeepSeek-R1和Mistral三种大语言模型构建大规模生成式抄袭数据集，采用基于嵌入向量的语义相似性方法作为基线，并汇总参赛团队的方法进行对比分析。 Result: 基于嵌入的朴素语义相似性方法在新数据集上达到最高0.8召回率和0.5精确率，但在2015年数据集上表现显著下降，表明现有方法泛化能力有限。 Conclusion: 当前生成式抄袭检测方法在特定数据上有效，但缺乏跨数据集的鲁棒性和通用性，未来需提升模型的适应性与多样性。 Abstract: The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.

[52] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Philipp Mondorf,Mingyang Wang,Sebastian Gerstner,Ahmad Dawar Hakimi,Yihong Liu,Leonor Veloso,Shijia Zhou,Hinrich Schütze,Barbara Plank

Main category: cs.CL

TL;DR: 本文研究了在大型语言模型中定位电路的集成方法，提出并比较了并行和串行两种集成策略，发现结合多种方法（包括串行集成）的并行集成效果最佳。

Details

Motivation: 为了提高电路定位方法在大型语言模型中的性能，探索是否可以通过集成多个定位方法来提升精度。 Method: 提出了两种集成方式：并行集成（如平均、取最小或最大归因分数）和串行集成（使用EAP-IG的结果作为边剪枝法的初始化）。 Result: 两种集成方法均在基准指标上表现出显著提升，其中包含串行集成的并行集成效果最好。 Conclusion: 集成多种电路定位方法能有效提升定位精度，尤其是将高效方法与高精度方法结合的策略，在多模型-任务组合下表现优于基线。 Abstract: The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.

[53] Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Chenpeng Wang,Xiaojie Cheng,Chunye Wang,Linfeng Yang,Lei Zhang

Main category: cs.CL

TL;DR: 提出MTR框架，通过模拟工具响应和多阶段训练，使语言模型无需实时API即可学习工具增强推理。

Details

Motivation: 解决工具增强语言模型在训练和部署中依赖实时API带来的可扩展性和可靠性问题。 Method: 采用多智能体架构（ToolMaker、AutoAgent、ToolActor）生成带模拟观测的完整ReAct轨迹，分两阶段训练：第一阶段通过监督微调学习推理轨迹结构，第二阶段使用组相对策略优化结合正确性和一致性奖励优化策略。 Result: 在四个多跳问答基准上，MTR达到与实时API系统相当的精确匹配分数，并在推理密集型任务上表现更优。 Conclusion: 结构化推理轨迹结合模拟响应足以有效训练工具增强语言模型，无需依赖实时API交互。 Abstract: Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches 'trace grammar' from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

[54] Mid-Training of Large Language Models: A Survey

Kaixiang Mo,Yuxin Shi,Weiwei Weng,Zhiqiang Zhou,Shuman Liu,Haibo Zhang,Anxiang Zeng

Main category: cs.CL

TL;DR: 本文首次系统性地提出了大语言模型中间训练阶段的统一范式，并构建了涵盖数据分布、学习率调度和长上下文扩展的分类体系。

Details

Motivation: 尽管中间训练在最先进模型中被广泛使用，但此前缺乏对其作为统一范式的综述，因此需要系统性梳理以促进研究与实践。 Method: 通过分析梯度噪声尺度、信息瓶颈和课程学习等理论，提出一个涵盖数据分布、学习率调度和上下文扩展的中间训练分类法，并总结实践经验、评估基准和性能增益。 Result: 建立了首个针对大语言模型中间训练的 taxonomy，汇总了实用见解和评估基准，报告了不同方法带来的性能提升。 Conclusion: 中间训练能有效缓解噪声数据导致的收益递减问题，提升模型泛化能力和收敛稳定性，未来需进一步探索其优化路径与挑战。 Abstract: Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning. Recent advances highlight the importance of an intermediate mid-training stage, where models undergo multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length. This stage mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training. Its effectiveness can be explained through gradient noise scale, the information bottleneck, and curriculum learning, which together promote generalization and abstraction. Despite widespread use in state-of-the-art systems, there has been no prior survey of mid-training as a unified paradigm. We introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. We distill practical insights, compile evaluation benchmarks, and report gains to enable structured comparisons across models. We also identify open challenges and propose avenues for future research and practice.

[55] GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics

Giorgos Filandrianos,Orfeas Menis Mastromichalakis,Wafaa Mohammed,Giuseppe Attanasio,Chrysoula Zerva

Main category: cs.CL

TL;DR: 本文提出一个大规模挑战集，用于检测自动质量评估（QE）指标在翻译含性别模糊职业术语时的性别偏见，扩展了多种语言对，并设计了平行文本以系统分析QE指标的性别偏差。

Details

Motivation: 现有研究对自动质量评估指标中的性别偏见探讨不足，且受限于小规模数据集、职业覆盖窄和语言种类有限，因此需要更全面的数据集来深入分析QE指标的性别偏见问题。 Method: 基于GAMBIT语料库，构建包含33个源-目标语言对的大规模平行数据集，每个源文本对应两种仅在职业术语语法性别上不同的目标文本（阳性与阴性），并调整相关语法成分，从而评估QE指标是否对不同性别版本给出相近评分。 Result: 该数据集具有规模大、覆盖广、完全平行的特点，支持按职业进行细粒度偏见分析以及跨语言的系统性比较。 Conclusion: 所提出的挑战集为检测和分析自动质量评估指标中的性别偏见提供了有效工具，推动了公平性评估的发展。 Abstract: Gender bias in machine translation (MT) systems has been extensively documented, but bias in automatic quality estimation (QE) metrics remains comparatively underexplored. Existing studies suggest that QE metrics can also exhibit gender bias, yet most analyses are limited by small datasets, narrow occupational coverage, and restricted language variety. To address this gap, we introduce a large-scale challenge set specifically designed to probe the behavior of QE metrics when evaluating translations containing gender-ambiguous occupational terms. Building on the GAMBIT corpus of English texts with gender-ambiguous occupations, we extend coverage to three source languages that are genderless or natural-gendered, and eleven target languages with grammatical gender, resulting in 33 source-target language pairs. Each source text is paired with two target versions differing only in the grammatical gender of the occupational term(s) (masculine vs. feminine), with all dependent grammatical elements adjusted accordingly. An unbiased QE metric should assign equal or near-equal scores to both versions. The dataset's scale, breadth, and fully parallel design, where the same set of texts is aligned across all languages, enables fine-grained bias analysis by occupation and systematic comparisons across languages.

[56] SID: Multi-LLM Debate Driven by Self Signals

Xuhang Chen,Zhifan Song,Deyi Ji,Shuo Gao,Lanyun Zhu

Main category: cs.CL

TL;DR: 提出一种基于自信号驱动的多LLM辩论框架SID，利用模型级置信度和词元级语义焦点来自适应引导辩论过程，提升性能并减少计算开销。

Details

Motivation: 现有MAD方法主要依赖外部结构，忽视生成过程中的自信号（如logits和注意力），导致冗余计算和性能下降。 Method: 引入Self-Signals Driven Multi-LLM Debate (SID)，利用模型级置信度实现高置信代理早退，结合注意力机制压缩冗余辩论内容。 Result: 在多个LLM和多模态LLM及基准上验证，SID在准确率上优于现有MAD方法，同时降低token消耗。 Conclusion: 利用自信号能有效提升多LLM辩论系统的性能与效率，为多代理协作提供了新思路。 Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

[57] OpenJAI-v1.0: An Open Thai Large Language Model

Pontakorn Trakuekul,Attapol T. Rutherford,Jullajak Karnjanaekarin,Narongkorn Panitsrisit,Sumana Sumanakul

Main category: cs.CL

TL;DR: OpenJAI-v1.0 是一个基于 Qwen3-14B 开发的开源泰英双语大语言模型，专注于提升指令遵循、长上下文理解和工具使用能力，在多项基准测试中表现优于现有开源泰语模型，且未出现灾难性遗忘，已公开发布以支持泰国 AI 社区。

Details

Motivation: 为了提升大语言模型在泰语实际任务中的性能，并为泰国AI社区提供高质量的开源NLP资源。 Method: 基于Qwen3-14B模型，通过精心筛选的数据，在指令遵循、长上下文理解和工具使用三个关键应用场景上进行优化训练。 Result: OpenJAI-v1.0在各项基准测试中均优于其基础模型和其他主流开源泰语模型，同时避免了灾难性遗忘问题。 Conclusion: OpenJAI-v1.0是一个高性能、开源的泰英双语大语言模型，可作为泰国AI社区的重要NLP资源。 Abstract: We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.

[58] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Wafaa Mohammed,Vlad Niculae,Chrysoula Zerva

Main category: cs.CL

TL;DR: 本研究探讨了大语言模型（LLM）在上下文感知翻译中对篇章现象的处理能力，提出质量感知解码（QAD）方法以有效提取LLM中编码的篇章知识，实验证明QAD优于其他解码方法，并能提升翻译的语义丰富性和人类偏好一致性。

Details

Motivation: 大语言模型在机器翻译中表现强劲，但在处理指代消解和词汇衔接等篇章现象方面仍存在不足，尤其是在文档级翻译中缺乏足够的上下文理解能力。 Method: 通过系统分析LLM在上下文感知翻译中的表现，验证其内部是否编码了篇章知识，并提出质量感知解码（QAD）方法来更有效地利用这些知识。 Result: 研究表明LLM确实编码了篇章知识，QAD方法在提取该知识方面显著优于其他解码策略，能够生成语义更丰富、更符合人类偏好的翻译结果。 Conclusion: 质量感知解码（QAD）是一种有效的手段，可增强大语言模型在文档级翻译中对篇章现象的建模能力，提升整体翻译质量。 Abstract: Large language models (LLMs) have emerged as strong contenders in machine translation.Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.

[59] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

Yining Wang,Jinman Zhao,Chuangxin Zhao,Shuhao Guan,Gerald Penn,Shinan Liu

Main category: cs.CL

TL;DR: 本文提出了一种名为$\lambda$-GRPO的方法，通过引入可学习参数$\lambda$来自适应地控制强化学习中token级别的奖励权重，解决了现有方法中的长度偏差问题，并在多个数学推理任务上取得了优于GRPO和DAPO的一致性提升。

Details

Motivation: 现有的基于规则奖励的强化学习方法（如GRPO）存在长度偏差问题，即长回答因奖励分散到更多token而在梯度更新中占优，而现有改进方法（如DAPO、Dr. GRPO）缺乏对token偏好的可解释建模。因此，作者希望让模型在优化过程中自动学习token级别偏好。 Method: 将现有方法统一在一个框架下，引入可学习的参数$\lambda$来动态调整每个token的权重，从而实现对不同位置或长度token的自适应加权。该方法无需修改训练数据或增加计算开销。 Result: $\lambda$-GRPO在Qwen2.5系列模型（1.5B、3B、7B）上相比GRPO平均准确率分别提升了+1.9%、+1.0%、+1.7%，且优于DAPO，在多个数学推理基准上表现出一致的性能增益。 Conclusion: 通过让模型学习token级偏好，$\lambda$-GRPO有效缓解了长度偏差问题，提升了推理性能，验证了学习奖励分配机制的有效性和实用性。 Abstract: Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.

[60] MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Zaid Alyafeai,Maged S. Al-Shaibani,Bernard Ghanem

Main category: cs.CL

TL;DR: 本文提出了MeXtract，一个用于从科学论文中提取元数据的轻量级语言模型系列，基于Qwen 2.5微调，在MOLE基准上达到SOTA性能，并扩展了该基准以支持跨域评估，同时公开了代码、数据和模型。

Details

Motivation: 准确高效地提取科学文献中的元数据具有挑战性，传统方法在跨领域和模式变化上的泛化能力有限。 Method: 通过微调Qwen 2.5构建参数规模为0.5B到3B的轻量级语言模型，并在MOLE基准上进行评估，同时扩展该基准确立更具挑战性的跨域子集。 Result: MeXtract在同类规模模型中于MOLE基准上实现了最先进的性能，且在未见模式上展现出良好的迁移能力和鲁棒性。 Conclusion: 所提出的方法在元数据提取任务中表现出色，具备良好的适应性和推广能力，有助于推动科学文献处理的研究。 Abstract: Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.

[61] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Zecheng Tang,Baibei Ji,Quantong Qiu,Haitian Wang,Xiaobo Liang,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出了Long-RewardBench，一个用于评估长上下文奖励模型的基准，并提出了一种多阶段训练策略，使模型在长上下文场景中更鲁棒，同时保持短上下文性能。

Details

Motivation: 现有奖励模型主要关注响应质量，但在长上下文场景（如LLM代理）中缺乏对上下文一致性的评估能力，难以满足现实应用需求。 Method: 构建了包含成对比较和Best-of-N任务的Long-RewardBench基准，并提出一种通用的多阶段训练策略，以提升模型在长上下文中的表现。 Result: 实验表明，该方法显著提升了模型在长上下文下的评估能力，且保持了短上下文性能；8B规模的LongRM超越了70B基线模型，并媲美Gemini 2.5 Pro。 Conclusion: 所提出的训练策略能有效扩展任意模型为鲁棒的长上下文奖励模型，为未来LLM代理等复杂应用提供了可靠评估基础。 Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model's responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

[62] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Cheng-Han Chiang,Xiaofei Wang,Linjie Li,Chung-Ching Lin,Kevin Lin,Shujie Liu,Zhendong Wang,Zhengyuan Yang,Hung-yi Lee,Lijuan Wang

Main category: cs.CL

TL;DR: 本文提出了SHANKS，一种使口语语言模型在用户说话时就能进行隐式推理的通用推理框架，从而实现低延迟、实时的语音交互。

Details

Motivation: 现有大语言模型和口语语言模型只能在用户说完后才开始思考，导致响应延迟高，无法实现实时交互。受人类‘边听边想’启发，作者希望构建能持续思考的模型。 Method: SHANKS将输入语音流按固定时长分块，在接收到每个语音块后立即基于此前所有语音和推理内容生成隐式思维链，并利用该推理决定是否打断用户或调用工具。 Result: 实验显示，在数学解题场景中，SHANKS的打断准确率比无思考基线高出37.1%；在工具增强对话中，56.9%的工具调用在用户说完前已完成。 Conclusion: SHANKS推动了模型从‘说完再想’向‘边听边想’转变，支持更自然、实时的人机语音交互。 Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

[63] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Vaibhav Srivastav,Steven Zheng,Eric Bezzam,Eustache Le Bihan,Nithin Koluguri,Piotr Żelasko,Somshubra Majumdar,Adel Moumen,Sanchit Gandhi

Main category: cs.CL

TL;DR: 本文提出了一个可复现的开源ASR评测基准Open ASR Leaderboard，涵盖60多个系统和11个数据集，支持多语言和长语音识别，并统一评估WER和RTFx指标。

Details

Motivation: 现有ASR评估主要集中于短语音和英语，缺乏对效率的报告，且难以进行公平比较。 Method: 构建包含60多个开源与闭源系统的评测基准，在11个数据集上标准化文本归一化，并同时报告词错误率（WER）和逆实时因子（RTFx）。 Result: Conformer编码器结合LLM解码器在英语WER上表现最佳但速度较慢；CTC和TDT解码器在RTFx上更优，适合长语音场景；Whisper衍生模型在英语上精度提升但牺牲了多语言能力。 Conclusion: 该基准支持透明、可扩展的ASR评估，推动准确率与效率的平衡发展，并促进多语言和长语音识别研究。 Abstract: Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

[64] EDUMATH: Generating Standards-aligned Educational Math Word Problems

Bryan R. Christ,Penelope Molitz,Jonathan Kropko,Thomas Hartvigsen

Main category: cs.CL

TL;DR: 该研究利用大语言模型（LLM）生成符合教学标准且个性化定制的数学应用题（MWP），并通过教师标注数据训练模型，提升了生成质量和教育适用性。

Details

Motivation: 教师因班级规模大、工作负担重，难以为每位学生个性化定制数学应用题，而个性化题目有助于提升学习效果，因此需要自动化解决方案。 Method: 采用人类专家与LLM联合评估的方式，对超过11,000道LLM生成的MWP进行评价，构建首个面向教学标准对齐的教师标注MWP数据集，并用其训练开源模型和文本分类器。 Result: 训练出的12B模型性能媲美更大规模的开源模型；30B模型结合分类器后超越现有闭源基线；生成的题目更接近人工编写；学生实验表明定制化题目在表现上相当但更受学生偏好。 Conclusion: LLM可用于高效生成符合教育标准且个性化的数学应用题，结合教师反馈可显著提升生成质量与实用性，具有广阔的教学应用前景。 Abstract: Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students' interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models' MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models' MWPs relative to human-written MWPs but consistently prefer our customized MWPs.

Geng Liu,Feng Li,Junjie Mu,Mengxiao Zhu,Francesco Pierri

Main category: cs.CL

TL;DR: 该研究探讨了中文大语言模型中的社会身份偏见，发现普遍存在内群体偏好和外群体负面倾向，且在真实对话中这种偏见可能加剧。

Details

Motivation: 由于大语言模型在用户应用中的广泛部署，担忧其可能反映并放大社会偏见，因此需要探究中文语境下的社会身份偏见问题。 Method: 使用针对中文的提示词，在十个代表性中文大模型中评估‘我们’与‘他们’框架下的回应，并扩展到240个在中国语境中显著的社会群体；同时分析真实用户与聊天机器人之间的中文对话语料库。 Result: 所有模型均表现出系统性的内群体正面和外群体负面倾向，且这种偏见不仅存在于合成提示中，也出现在自然对话中，表明在真实交互中偏见可能被强化。 Conclusion: 研究提供了一个语言敏感的中文大模型评估框架，证明英语中记录的社会身份偏见具有跨语言普遍性，并在面向用户的场景中增强。 Abstract: Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns about their potential to reflect and amplify social biases. We investigate social identity framing in Chinese LLMs using Mandarin-specific prompts across ten representative Chinese LLMs, evaluating responses to ingroup ("We") and outgroup ("They") framings, and extending the setting to 240 social groups salient in the Chinese context. To complement controlled experiments, we further analyze Chinese-language conversations from a corpus of real interactions between users and chatbots. Across models, we observe systematic ingroup-positive and outgroup-negative tendencies, which are not confined to synthetic prompts but also appear in naturalistic dialogue, indicating that bias dynamics might strengthen in real interactions. Our study provides a language-aware evaluation framework for Chinese LLMs, demonstrating that social identity biases documented in English generalize cross-linguistically and intensify in user-facing contexts.

[66] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

Markus Reuter,Tobias Lingenberg,Rūta Liepiņa,Francesca Lagioia,Marco Lippi,Giovanni Sartor,Andrea Passerini,Burcu Sayin

Main category: cs.CL

TL;DR: 本文提出了一种名为Summary-Augmented Chunking (SAC) 的简单高效方法，通过在文本块中加入文档级摘要来增强上下文，显著减少法律领域检索增强生成（RAG）中的文档级检索错配（DRM），提升检索精度与召回率。

Details

Motivation: 在法律领域应用RAG时，由于文档结构相似且数据库庞大，检索系统常出现从完全错误的文档中提取信息的严重问题，即文档级检索错配（DRM），影响可靠性。 Method: 提出SAC方法，在标准分块基础上为每个文本块添加文档级合成摘要，以保留全局上下文；并在多种法律信息检索任务中评估其效果。 Result: 实验表明，SAC显著降低了DRM，提升了文本层级的检索精度和召回率；且通用摘要策略优于融入法律专家知识的定制化方法。 Conclusion: SAC是一种实用、可扩展且易于集成的技术，能有效增强RAG系统在大规模法律文档数据集上的可靠性。 Abstract: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.

[67] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Neel Prabhanjan Rachamalla,Aravind Konakalla,Gautam Rajeev,Ashish Kulkarni,Chandra Khatri,Shubham Agarwal

Main category: cs.CL

TL;DR: 提出了一种结合翻译和合成扩展的人机协同管道，用于生成高质量、多样化的印度语言后训练数据集Pragyaan-IT和Pragyaan-Align，强调任务多样性、多轮对话、指令保真度、安全对齐和文化细微差别。

Details

Motivation: 现有开源数据集在印度语言上存在多语言覆盖不足、文化背景缺失和任务多样性差距的问题，限制了大语言模型的有效性。 Method: 采用人机协同的流程，将翻译与合成数据扩展相结合，从57个多样化数据集中为10种印度语言构建指令微调和偏好对齐数据集，并注重任务多样性、多轮对话、指令保真度、安全对齐和文化细节保留。 Result: 构建了两个高质量的印度语言数据集：Pragyaan-IT（22.5K样本）和Pragyaan-Align（100K样本），涵盖13个大类和56个子类别，显著提升多语言大模型在印度语言上的训练效果。 Conclusion: 该方法为构建更具包容性和有效性的多语言大语言模型提供了可靠的数据基础，尤其改善了印度语言在任务多样性、文化相关性和安全性方面的表现。 Abstract: The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

[68] Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du,Jiaxi Hu,Tao Zhang,Weigao Sun,Yu Cheng

Main category: cs.CL

TL;DR: 本文提出了一种名为Native Hybrid Attention (NHA)的新型混合注意力架构，结合了线性注意力与全注意力的优点，在保持长期上下文的同时通过滑动窗口增强短期信息，实现了高效且准确的序列建模。

Details

Motivation: Transformer因自注意力机制具有二次复杂度导致计算效率低，而线性注意力虽高效但在长上下文任务中准确率下降，因此需要一种兼顾效率与精度的混合方案。 Method: NHA将线性RNN更新的键值槽与滑动窗口提供的短期token结合，并在统一的softmax注意力操作下进行处理；通过单一超参数（滑动窗口大小）控制层间行为，实现结构统一的层内与层间混合。 Result: 实验表明，NHA在回忆密集型和常识推理任务上优于Transformer及其他混合基线模型；预训练大模型可通过NHA结构化改造，在保持竞争力准确率的同时显著提升效率。 Conclusion: NHA提供了一种灵活、高效的注意力机制设计，能够在不引入额外融合参数的情况下平衡性能与计算成本，适用于长序列建模及大模型的高效化改造。 Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra \& inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

[69] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Shrestha Ghosh,Luca Giordano,Yujia Hu,Tuan-Phong Nguyen,Simon Razniewski

Main category: cs.CL

TL;DR: 本文基于GPTKB v1.5对GPT-4.1的1亿条信念进行深入分析，揭示其事实知识与现有知识库差异显著，准确性低于以往基准，且存在不一致、模糊和幻觉等问题。

Details

Motivation: 当前对大语言模型（LLM）中事实知识的理解不足，且多基于有偏样本，需系统性探索其真实知识状态。 Method: 利用递归生成的GPTKB v1.5数据集，包含GPT-4.1模型的约1亿条事实性信念，进行全面分析。 Result: 发现LLM的事实知识与传统知识库差异大；准确率低于先前报告；存在严重不一致、歧义和幻觉问题。 Conclusion: LLM中的事实知识远比预期复杂且不可靠，需重新评估其知识表示能力，并为未来提升事实一致性与可信赖性指明研究方向。 Abstract: LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models' factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.

[70] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Rajvee Sheth,Samridhi Raj Sinha,Mahavir Patil,Himanshu Beniwal,Mayank Singh

Main category: cs.CL

TL;DR: 该论文首次全面分析了面向语码转换（CSW）的大语言模型（LLM）研究，涵盖五个研究领域、12项NLP任务、30多个数据集和80多种语言。

Details

Motivation: 尽管大语言模型迅速发展，但在处理多语言混合输入方面仍存在挑战，尤其是在语码转换场景中，现有模型受限于数据集不足和评估偏差，难以在多语言社会中有效部署。 Method: 对总计 unique_references 项研究进行系统综述，按模型架构、训练策略和评估方法分类，总结大语言模型如何重塑语码转换建模。 Result: 梳理了当前CSW-aware LLM的研究进展与持续存在的挑战，包括数据稀缺、评估不公和语言学基础薄弱等问题。 Conclusion: 提出实现真正多语言智能的路线图，强调需要构建包容性数据集、公平评估机制以及基于语言学原理的模型，并维护了一个包含所有资源的开源列表。 Abstract: Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing \total{unique_references} studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

[71] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

Yuntao Gui,James Cheng

Main category: cs.CL

TL;DR: Search-R3是一种新颖的框架，通过将大语言模型（LLM）的推理过程直接生成搜索嵌入，提升检索任务性能。

Details

Motivation: 尽管大语言模型在自然语言理解方面表现出色，但在检索任务中尚未被充分利用，因此需要一种能结合推理与嵌入生成的方法。 Method: 利用LLM的思维链能力，通过监督学习、强化学习和专用强化学习环境三个机制，实现推理过程中直接生成搜索嵌入。 Result: 在多个基准测试中显著优于先前方法，有效统一了推理与嵌入生成过程。 Conclusion: Search-R3通过集成推理与嵌入生成，显著提升了复杂知识密集型任务的处理能力，是信息检索领域的重要进展。 Abstract: Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs' chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model's ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: https://github.com/ytgui/Search-R3

[72] Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations

Miriam Wanner,Sophia Hager,Anjalie Field

Main category: cs.CL

TL;DR: 研究发现，当地新闻台被Sinclair广播集团收购后，报道更多全国性新闻和两极化话题，减少了对本地议题的关注。

Details

Motivation: 探讨地方新闻台被大型媒体集团收购后报道内容的变化及其潜在影响。 Method: 使用计算方法分析地方新闻台在被Sinclair收购前后发布的网络内容，并与全国性新闻媒体进行比较。 Result: 发现被收购后的地方新闻台更多报道全国性新闻和极化议题，本地议题报道减少。 Conclusion: Sinclair的收购导致地方新闻内容趋向全国化和政治化，可能影响公众对本地事务的关注。 Abstract: Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.

[73] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Amir Hossein Yari,Kalmit Kulkarni,Ahmad Raza Khan,Fajri Koto

Main category: cs.CL

TL;DR: 本文介绍了ITEM，一个大规模基准，用于评估26种自动指标在六种主要印度语言中与人类判断的一致性，揭示了LLM-based评估者表现最佳，且不同任务和语言下指标表现存在差异。

Details

Motivation: 现有的机器翻译和文本摘要评估指标主要集中于英语等高资源语言，忽视了印度语言，因此需要验证当前评估方法的普适性，并为低资源语言建立可靠的评估标准。 Method: 构建了一个涵盖六种印度语言的大规模基准ITEM，系统评估26种自动指标与人类判断的对齐程度，分析其在一致性、对异常值敏感性、语言特异性可靠性、指标间相关性及对受控扰动的鲁棒性等方面的表现。 Result: 发现：(1) 基于大语言模型的评估指标与人类判断一致性最高；(2) 异常值显著影响指标与人类判断的一致性；(3) 在文本摘要中指标更擅长捕捉内容保真度，而在机器翻译中更反映流畅性；(4) 不同指标在面对多种扰动时表现出不同的鲁棒性和敏感性。 Conclusion: 研究结果为印度语言的评估指标设计提供了关键指导，强调需考虑语言特性、任务类型及数据质量对自动评估的影响。 Abstract: While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

[74] LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

Fred Philippy,Laura Bernardy,Siwen Guo,Jacques Klein,Tegawendé F. Bissyandé

Main category: cs.CL

TL;DR: 提出了一种无需机器翻译的跨语言指令调优方法，用于提升低资源语言（如卢森堡语）的语言模型性能。

Details

Motivation: 由于缺乏高质量的指令数据集，低资源语言在指令调优方面面临严重限制，而依赖机器翻译常导致语义错位和文化不准确问题。 Method: 利用英语、法语和德语中的对齐数据，构建了一个高质量的卢森堡语跨语言指令调优数据集，避免使用机器生成的翻译。 Result: 实验证明，跨语言指令调优不仅提升了语言间的表征对齐，还增强了模型在卢森堡语上的生成能力。 Conclusion: 跨语言数据构建可避免机器翻译带来的问题，有效促进低资源语言的语言模型发展。 Abstract: Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model's generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.

[75] Accelerating Diffusion LLM Inference via Local Determinism Propagation

Fanheng Kong,Jingyuan Zhang,Yahui Liu,Zirui Wu,Yu Tian,Victoria W.,Guorui Zhou

Main category: cs.CL

TL;DR: 本文提出了一种名为LocalLeap的无需训练的自适应并行解码策略，用于解决扩散大语言模型（dLLMs）中的延迟解码问题，在显著提升推理效率的同时保持生成质量。

Details

Motivation: 现有dLLM开源实现存在质量与速度之间的权衡，保守的采样策略导致重复冗余的精炼迭代（即延迟解码），影响实际部署效率。 Method: 基于对dLLM解码动态的系统分析，提出LocalLeap方法，利用局部确定性传播和渐进式空间一致性衰减两个经验原则，识别高置信度锚点并在其邻域内进行局部放松的并行解码，实现早期提交已确定的token。 Result: 在多个基准测试中，LocalLeap实现了6.94倍的吞吐量提升，将解码步数减少至原来的14.2%，且对生成质量影响可忽略不计。 Conclusion: LocalLeap有效缓解了dLLM中的延迟解码问题，显著提高了推理效率，为dLLM的实际应用提供了高效的解码解决方案。 Abstract: Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$\times$ throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

[76] All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

Miriam Wanner,Leif Azzopardi,Paul Thomas,Soham Dan,Benjamin Van Durme,Nick Craswell

Main category: cs.CL

TL;DR: 本文提出了VITALERRORS基准和VITAL指标，以解决现有大语言模型事实性评估方法对关键信息错误不敏感的问题。

Details

Motivation: 现有评估方法将所有声明视为同等重要，导致在关键信息缺失或错误时产生误导性评估结果。 Method: 构建包含6,733个查询的VITALERRORS基准，设计用于遗漏或伪造关键信息的最小化修改LLM响应，并提出结合声明相关性和重要性的VITAL评估指标。 Result: 实验证明现有评估指标对关键信息错误不敏感，而VITAL指标能更可靠地检测此类错误。 Conclusion: VITAL指标通过考虑声明的相关性和重要性，显著提升了对LLM生成内容中关键事实错误的检测能力，为更准确的事实性评估提供了基础。 Abstract: Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.

[77] Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

Zhu Li,Yuqing Zhang,Xiyuan Gao,Shekhar Nayak,Matt Coler

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型增强的检索增强框架，用于讽刺感知的语音合成，通过结合语义嵌入和韵律示例，在VITS模型中实现了更自然且符合语境的讽刺语音生成。

Details

Motivation: 讽刺作为一种依赖语义、上下文和韵律线索的非字面语言，在语音合成中具有挑战性，但现有研究多集中于基本情感类别，缺乏对讽刺表达的深入探索。 Method: 采用LoRA微调的LLaMA 3提取讽刺的语用不一致和话语级语义特征，并通过检索增强生成（RAG）模块获取讽刺表达的韵律示例，二者共同作为VITS语音合成模型的条件输入。 Result: 实验表明，该方法在客观指标和主观评测上均优于基线模型，显著提升了语音自然度、讽刺表现力以及下游讽刺检测性能。 Conclusion: 所提出的双条件框架有效整合了语义与韵律信息，为讽刺语音合成提供了新思路，并验证了大语言模型与检索增强技术在表达性语音合成中的潜力。 Abstract: Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

[78] TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

Guo Yutong,Wanying Wang,Yue Wu,Zichen Miao,Haoyu Wang

Main category: cs.CL

TL;DR: 提出TALENT框架，结合OCR文本和自然语言描述，利用小规模视觉语言模型与大语言模型协同完成表格视觉问答任务，在降低计算成本的同时达到甚至超过大型单体模型的性能。

Details

Motivation: 现有大型视觉语言模型在处理表格视觉问答时计算开销大，且容易忽略细节；轻量方法使用的结构化表示不适合大语言模型并引入错误。 Method: 设计TALENT框架，使用小型VLM生成OCR文本和自然语言叙述，结合问题输入大语言模型进行推理，并构建包含多步定量推理挑战的ReTabVQA数据集。 Result: 实验表明，TALENT在公开数据集和ReTabVQA上均能以更低计算成本匹配或超越大型VLM的性能。 Conclusion: TALENT通过双表征和分工协作机制，实现了高效准确的表格视觉问答，为轻量化多模态系统提供了新思路。 Abstract: Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.

[79] Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning

Taylor Sorensen,Yejin Choi

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型的上下文学习能力及两步元学习训练方法来建模人类差异性的系统，并在LeWiDi竞赛中取得两项任务第一，消融实验表明各组件（如标注者示例、数据集微调等）对性能有显著影响。

Details

Motivation: 许多NLP任务存在主观性、模糊性或标注者分歧，传统方法难以建模人类差异，因此需要一种能捕捉并利用这种变异性的系统。 Method: 采用大语言模型的上下文学习能力，结合两步元学习流程：1）在多个需上下文学习的数据集上进行后训练；2）通过上下文元学习针对特定数据分布进行专业化调整。 Result: 在LeWiDi竞赛中整体获胜，两个任务均表现最佳；消融实验显示引入标注者示例至关重要，数据集特定微调在大数据集上有帮助，后训练对其中一个竞赛数据集有效，且模型规模越大性能越好。 Conclusion: 所提出的系统能有效建模人类差异，在存在标注分歧的任务中表现优异，验证了上下文学习与元学习结合在捕捉人类变异方面的潜力。 Abstract: Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models' (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system's performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.

[80] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Manish Nagaraj,Sakshi Choudhary,Utkarsh Saxena,Deepak Ravikumar,Kaushik Roy

Main category: cs.CL

TL;DR: TRIM是一种高效的、基于注意力机制的指令调优数据集构建方法，通过匹配任务相关的表示模式来选择高质量核心集，性能优于现有方法且计算成本更低。

Details

Motivation: 现有的指令调优数据选择方法依赖梯度等粗粒度信号，计算昂贵且忽略细粒度特征，难以高效构建高质量小规模数据集。 Method: 提出TRIM，一种前向传播、以token为中心的框架，利用多层注意力机制生成‘指纹’来捕捉任务相关的表征模式，并据此筛选核心集，无需反向传播。 Result: TRIM在下游任务中比现有最先进方法提升最高达9%，部分场景下甚至超过全数据微调的效果，同时显著降低计算成本。 Conclusion: TRIM是一种可扩展且高效的指令调优数据集构建方案，能够在保持高性能的同时大幅减少数据需求和计算开销。 Abstract: Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based "fingerprints" from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

[81] Comparing human and language models sentence processing difficulties on complex structures

Samuel Joseph Amouyal,Aya Meltzer-Asscher,Jonathan Berant

Main category: cs.CL

TL;DR: 该研究系统比较了人类与大语言模型（LLM）在七种复杂语言结构下的句子理解能力，发现LLM在歧义路径（garden path）句子上表现尤其差，且模型规模影响其与人类表现的相关性。

Details

Motivation: 探究大语言模型是否表现出类似人类的语言处理困难，尤其是面对复杂或易误解的句子结构时的认知机制。 Method: 在统一实验框架下，收集人类和五类最先进LLM在七种挑战性语言结构上的理解数据，并设置无难度结构的对照句进行对比分析。 Result: LLM整体在目标结构上表现较差，尤其在歧义路径句上（GPT-5仅46.8%准确率），而在非歧义句上接近完美（93.7%）。模型与人类表现的排名相关性随参数量增加而提高；目标句与对照句的性能差距在中等强度模型中与人类趋势一致，但在过弱或过强模型中趋于一致高低表现。 Conclusion: LLM在语言理解上的困难模式与人类存在收敛与分歧，揭示了其与人类认知相似性的边界，为理解LLM的语言处理机制提供了新视角。 Abstract: Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

[82] Reasoning for Hierarchical Text Classification: The Case of Patents

Lekang Jiang,Wenjun Sun,Stephan Goetz

Main category: cs.CL

TL;DR: 提出了一种新的分层文本分类框架RHC，通过将分类任务重构为逐步推理过程，利用大语言模型在冷启动和强化学习两个阶段进行训练，显著提升了分类效果、可解释性、可扩展性和适用性。

Details

Motivation: 现有的分层文本分类方法仅输出扁平标签集合，缺乏对预测结果的解释能力，尤其在专利分类等复杂场景下表现受限。 Method: 将分层分类任务重新定义为逐步推理任务，采用两阶段训练：第一阶段使用链式思维（CoT）格式进行冷启动，第二阶段通过强化学习增强多步推理能力。 Result: RHC在准确率和macro F1上比现有方法提升约3%，具备良好的可解释性（生成自然语言解释），且随模型规模扩大效果更优，并在多个HTC基准上达到SOTA。 Conclusion: RHC不仅在专利分类等复杂HTC任务中表现出色，还具有广泛的适用性，为分层分类提供了更有效、可解释的解决方案。 Abstract: Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.

[83] More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Yike Zhao,Simin Guo,Ziqing Yang,Shifan Han,Dahua Lin,Fei Tan

Main category: cs.CL

TL;DR: 本研究对开源数据集和数据合成技术在数学推理中的应用进行了全面分析，强调了数据质量优于数量，并提出了适用于工业场景的有效数据选择策略。

Details

Motivation: 尽管已有多种数据构建方法被提出，但它们在实际应用管道中的效用仍缺乏探索。因此，本文旨在评估不同数据构造方法在真实训练与部署场景下的表现。 Method: 通过统一的训练与部署模拟管道，系统评估开源数据集和数据合成技术；提炼有效的数据选择策略，并识别适用于工业应用的实用方法。 Result: 研究发现，将数据组织成更易解释的格式或从更强模型中蒸馏数据，通常比单纯增加数据量更有效；结构化和高质量的数据显著提升LLM的推理能力。 Conclusion: 在现实世界的推理任务中，“更好的数据”比“更多的数据”更为重要；该研究为提升LLM能力提供了可操作的指导，支持低成本的数据整理和可扩展的模型增强。 Abstract: The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance "more data" versus "better data" for real-world reasoning tasks.

[84] NurseLLM: The First Specialized Language Model for Nursing

Md Tawkat Islam Khondaker,Julia Harrington,Shady Shehata

Main category: cs.CL

TL;DR: 本文提出了NurseLLM，首个专为护理领域多选题设计的专业大语言模型，并构建了大规模护理多选题数据集和多个基准测试，实验表明其性能优于通用和医学专用的最先进模型。

Details

Motivation: 尽管大语言模型在医疗系统中取得了显著进展，但在护理等专业领域的应用仍探索不足，因此需要开发针对护理领域的专业化模型。 Method: 提出了一种多阶段的数据生成流程来构建大规模护理多选题数据集，并基于该数据集训练NurseLLM；同时设计了多个护理评估基准，用于系统评估模型性能。 Result: NurseLLM在多个基准测试上优于同规模的通用和医学专用大模型，验证了领域专业化的重要性；此外探索了推理机制与多智能体协作系统在护理中的潜力。 Conclusion: 专业化的大语言模型在护理领域具有显著优势，未来研究可进一步结合推理与多智能体协作以提升临床应用价值。 Abstract: Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.

[85] Quantifying Data Contamination in Psychometric Evaluations of LLMs

Jongwook Han,Woojung Song,Jonggeun Lee,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出一个系统性框架，用于衡量大语言模型在心理测量评估中的数据污染问题，发现诸如BFI-44和PVQ-40等常用量表存在严重的项目记忆和目标分数匹配现象。

Details

Motivation: 先前研究担忧心理测量量表的数据污染可能影响大语言模型评估的可靠性，但缺乏系统性的量化分析，因此本文旨在填补这一空白。 Method: 提出一个评估数据污染的框架，从项目记忆、评估记忆和目标分数匹配三个维度出发，对21个主流大语言模型和四个常用心理量表进行系统分析。 Result: 研究发现多个流行的心理量表（如BFI-44和PVQ-40）存在显著的数据污染，模型不仅能记忆题目，还能调整回答以达到特定的目标分数。 Conclusion: 当前对大语言模型进行的心理测量评估可能因训练数据污染而失真，需谨慎解释结果，并改进评估方法以确保有效性。 Abstract: Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.

[86] CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models

Yong-En Tian,Yu-Chien Tang,An-Zi Yen,Wen-Chih Peng

Main category: cs.CL

TL;DR: 本文提出了一个新任务CARPAS，旨在根据文档内容动态调整摘要生成中的预设方面，并通过预测相关方面的数量来引导大语言模型更准确地生成摘要。

Details

Motivation: 现实场景中提供的方面可能不完整、无关或缺失，现有方法难以适应，因此需要一种能根据文档内容自适应优化方面的方法。 Method: 构建了三个新数据集，提出先预测相关方面的数量作为子任务，利用该预测结果指导LLMs进行方面精炼和摘要生成。 Result: 实验表明所提方法在所有数据集上显著提升了性能，且发现LLMs会服从指定的数量要求，即使与其自身判断不符。 Conclusion: 通过引入内容感知的方面精炼机制及数量预测子任务，可有效提升方面摘要的质量，为实际应用中LLM的可控性提供了重要启示。 Abstract: Aspect-based summarization has attracted significant attention for its ability to generate more fine-grained and user-aligned summaries. While most existing approaches assume a set of predefined aspects as input, real-world scenarios often present challenges where these given aspects may be incomplete, irrelevant, or entirely missing from the document. Users frequently expect systems to adaptively refine or filter the provided aspects based on the actual content. In this paper, we initiate this novel task setting, termed Content-Aware Refinement of Provided Aspects for Summarization (CARPAS), with the aim of dynamically adjusting the provided aspects based on the document context before summarizing. We construct three new datasets to facilitate our pilot experiments, and by using LLMs with four representative prompting strategies in this task, we find that LLMs tend to predict an overly comprehensive set of aspects, which often results in excessively long and misaligned summaries. Building on this observation, we propose a preliminary subtask to predict the number of relevant aspects, and demonstrate that the predicted number can serve as effective guidance for the LLMs, reducing the inference difficulty, and enabling them to focus on the most pertinent aspects. Our extensive experiments show that the proposed approach significantly improves performance across all datasets. Moreover, our deeper analyses uncover LLMs' compliance when the requested number of aspects differs from their own estimations, establishing a crucial insight for the deployment of LLMs in similar real-world applications.

[87] Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible

Imry Ziv,Nur Lan,Emmanuel Chemla,Roni Katzir

Main category: cs.CL

TL;DR: 该论文探讨了大语言模型（如GPT-2）是否像人类一样对“人类可能语言”和“人类不可能语言”有区分能力，结果发现GPT-2在多数情况下对两者学习难度无差异，且整体上无法系统性区分自然语言与不可能语言，表明LLMs不具备人类语言习得的先天偏好。

Details

Motivation: 研究大语言模型是否具备类似人类语言习得的先天偏置，从而判断其与人类语言学习机制的相似性。 Method: 通过比较GPT-2在真实语言数据集及其经扰动生成的‘不可能语言’数据集上的学习曲线，并分析跨语言的困惑度曲线指标差异，检验模型对两类语言的区分能力。 Result: GPT-2在大多数情况下对自然语言和其不可能变体的学习难度相当；在整体层面上也未表现出对可能与不可能语言的系统性区分。 Conclusion: 大语言模型不具有塑造语言类型学的人类先天语言偏置，与人类语言学习机制存在本质差异。 Abstract: Are large language models (LLMs) sensitive to the distinction between humanly possible languages and humanly impossible languages? This question is taken by many to bear on whether LLMs and humans share the same innate learning biases. Previous work has attempted to answer it in the positive by comparing LLM learning curves on existing language datasets and on "impossible" datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous claims. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole set of natural languages and the whole set of impossible languages. By considering cross-linguistic variance in various metrics computed on the perplexity curves, we show that GPT-2 provides no systematic separation between the possible and the impossible. Taken together, these perspectives show that LLMs do not share the human innate biases that shape linguistic typology.

[88] Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Benjamin Akera,Evelyn Nafula Ouma,Gilbert Yiga,Patrick Walukagga,Phionah Natukunda,Trevor Saaka,Solomon Nsumba,Lilian Teddy Nabukeera,Joel Muhanguzi,Imran Sekalala,Nimpamya Janat Namara,Engineer Bainomugisha,Ernest Mwebaze,John Quinn

Main category: cs.CL

TL;DR: 本文提出了针对乌干达多语言环境优化的开源大模型Sunflower 14B和32B，基于Qwen 3架构，显著提升对多数乌干达语言的理解能力，主张区域性聚焦策略比泛化支持更高效。

Details

Motivation: 非洲有超过2000种语言，但多数被语言技术发展所忽视；现有大模型集中于主要语言，导致对小语种支持零散且不足，因此需要更高效的区域化解决方案。 Method: 采用区域性聚焦策略，以乌干达为案例，基于Qwen 3架构开发了两个开源大模型Sunflower 14B和32B，并针对乌干达主要语言进行训练与优化。 Result: Sunflower模型在大多数乌干达语言上展现出最先进的理解能力，相比通用大模型在该地区语言处理上表现更优。 Conclusion: 区域性专注的大模型开发策略能更高效地缩小语言技术鸿沟，Sunflower模型为多语言地区提供了可复用的技术路径，并有助于减少实际应用中的语言障碍。 Abstract: There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

[89] Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Chengzhi Zhong,Fei Cheng,Qianying Liu,Yugo Murawaki,Chenhui Chu,Sadao Kurohashi

Main category: cs.CL

TL;DR: 提出了一种无需训练的方法，通过识别和操控少量稀疏维度来实现多语言生成的语言控制，仅需50句平行或单语数据，效果优于先前基于神经元的方法。

Details

Motivation: 研究大语言模型在有限非英语数据下仍具备多语言能力的原因，并探索其跨语言转换机制是否由少量固定位置的维度控制。 Method: 假设跨语言转换由中间到最终层中一致索引的小而稀疏的维度集控制，基于此提出一种无需训练的方法，利用少量平行或单语数据识别并操纵这些关键维度。 Result: 实验表明这些维度具有可解释性，干预后能切换输出语言同时保持语义内容，且性能优于之前的神经元级方法，成本显著更低。 Conclusion: 大语言模型的跨语言生成能力由少数关键维度控制，所提方法高效、低成本地实现了对多语言输出的精准干预。 Abstract: Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

[90] How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Benjamin Akera,Evelyn Nafula,Patrick Walukagga,Gilbert Yiga,John Quinn,Ernest Mwebaze

Main category: cs.CL

TL;DR: 本文研究了在资源匮乏的非洲语言中使用Whisper模型进行自动语音识别（ASR）的可行性，通过在卢旺达语和基库尤语上的实验，发现仅需50小时训练数据即可实现可用性能（WER < 13%），并指出数据质量对系统表现至关重要。

Details

Motivation: 由于缺乏足够的转录语音数据，低资源非洲语言的ASR系统开发面临挑战。尽管多语言模型如Whisper提供了新机遇，但在实际部署中仍存在关于所需数据量和常见失败模式的疑问。 Method: 在两种班图语（卢旺达语和基库尤语）上评估Whisper模型：对卢旺达语进行从1到1400小时训练数据的系统性规模分析，对基库尤语使用270小时数据进行详细错误分析。 Result: 实验表明，50小时训练数据即可达到WER低于13%，200小时后WER低于10%；错误分析显示38.6%的高错误案例源于噪声标注，说明数据质量至关重要。 Conclusion: 数据量和数据质量同样关键，研究结果为类似低资源语言的ASR系统部署提供了可操作的基准和指导。 Abstract: The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI's Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper's performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13\%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10\%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6\% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

[91] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation

Arjun Krishnakumar,Rhea Sanjay Sukthanker,Hannan Javed Mahadik,Gabriela Kadlecová,Vladyslav Moroshan,Timur Carstensen,Frank Hutter,Aaron Klein

Main category: cs.CL

TL;DR: 提出了一种高效的预训练小语言模型（SLM）框架，结合结构稀疏子网络初始化、进化搜索和知识蒸馏，显著降低预训练成本。

Details

Motivation: 提升小语言模型在有限资源下的性能与训练效率，降低对大规模计算资源的依赖。 Method: 采用结构稀疏的子网络初始化，利用进化搜索自动发现高质量初始化，并通过大模型的知识蒸馏加速训练和提升泛化能力。 Result: 所提方法在相同计算预算下优于随机初始化模型，最佳模型仅用9.2倍少的预训练token即达到Pythia SLM相当的验证困惑度。 Conclusion: 该框架显著提升了小语言模型预训练的效率和可及性，为大规模低成本SLM开发提供了可复现路径。 Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 9.2x fewer pretraining tokens. We release all code and models at https://github.com/whittle-org/whittle/, offering a practical and reproducible path toward cost-efficient small language model development at scale.

[92] Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Ziyi Wang,Yuxuan Lu,Yimeng Zhang,Jing Huang,Dakuo Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于强化学习的个性化用户行为模拟方法Customer-R1，通过引入显式人格特征和动作正确性奖励信号，在在线购物环境中实现了更精准的逐步行为预测。

Details

Motivation: 现有方法（如提示、监督微调和强化学习）主要学习群体层面的策略，缺乏对用户个性化的建模，导致生成的行为模拟过于通用。因此，如何让大语言模型更好地模拟个性化用户行为成为一个关键问题。 Method: 提出Customer-R1，一种基于强化学习的方法，其策略依赖于显式的人格特征，并通过动作正确性奖励信号来优化下一步的推理和动作生成。 Result: 在OPeRA数据集上的实验表明，Customer-R1在下一动作预测任务中显著优于提示和SFT基线方法，并且更能匹配用户的真实动作分布，显示出更高的个性化行为模拟保真度。 Conclusion: Customer-R1通过结合显式人格建模和强化学习，有效提升了大语言模型在在线购物场景中对个性化用户行为的模拟能力。 Abstract: Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.

[93] Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Donggyu Lee,Sungwon Park,Yerin Hwang,Hyunwoo Oh,Hyoshin Kim,Jungwon Kim,Meeyoung Cha,Sangyoon Park,Jihee Kim

Main category: cs.CL

TL;DR: 提出一个基于经济学和金融学期刊的因果推理新基准，涵盖多领域和任务类型，实验显示当前大模型在因果识别上存在显著局限性。

Details

Motivation: 现有因果推理基准依赖合成数据且覆盖领域狭窄，难以真实反映大模型在实际复杂场景中的因果理解能力。 Method: 从顶级经济学和金融学期刊中提取经过严谨因果识别方法（如工具变量、双重差分、断点回归）验证的关系，构建包含40,379个评估项的新基准，覆盖健康、环境、科技等多个领域五类任务。 Result: 在八个最先进的大语言模型上的实验表明，最佳模型准确率仅为57.6%，模型规模与性能无一致正相关，且高级推理模型在基本因果识别上仍表现不佳。 Conclusion: 当前大语言模型在可靠因果推理方面存在重大缺陷，难以满足高风险应用的需求，亟需针对性改进。 Abstract: Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6\% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

[94] LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Zhivar Sourati,Zheng Wang,Marianne Menglin Liu,Yazhe Hu,Mengqing Guo,Sujeeth Bharadwaj,Kyu Han,Tao Sheng,Sujith Ravi,Morteza Dehghani,Dan Roth

Main category: cs.CL

TL;DR: 提出了一种新的布局感知动态RAG框架LAD-RAG，通过构建符号化文档图并结合神经嵌入，在视觉丰富文档的问答任务中实现了更高的检索召回率和准确率。

Details

Motivation: 传统RAG方法在处理视觉丰富文档时忽略了文档的结构组织和跨页依赖关系，且检索固定数量页面，导致多页推理任务中证据不全、回答质量下降。 Method: 在摄入阶段构建包含布局结构和跨页依赖的符号化文档图，并与神经嵌入共同存储；在推理阶段利用LLM代理动态交互神经与符号索引，自适应检索所需证据。 Result: 在MMLongBench-Doc、LongDocURL、DUDE和MP-DocVQA等数据集上实验表明，LAD-RAG平均完美召回率超过90%，无需top-k调参，相比基线模型在相似噪声水平下召回率最高提升20%，且QA准确率更高、延迟极低。 Conclusion: LAD-RAG通过融合符号化文档结构与动态检索机制，有效提升了视觉丰富文档上的多页复杂问答性能。 Abstract: Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents' structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.

[95] When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Xunyi Jiang,Dingyi Chang,Julian McAuley,Xin Xu

Main category: cs.CL

TL;DR: 本文系统研究了现有事实性评估基准的老化问题，发现许多常用基准已过时，导致对大语言模型事实性的评估不可靠。作者提出了一个更新的事实检索流程和三个指标来量化这种老化影响，并呼吁关注基准时效性问题。

Details

Motivation: 随着大语言模型和现实世界的快速发展，静态的评估基准已无法同步更新，导致其在评估LLM事实性时可能存在严重偏差，但这一问题尚未被深入探讨。 Method: 选取五个流行的事实性基准和八个不同年份发布的LLM，构建了一个最新的事实检索流程，并设计了三个指标来量化基准老化及其对评估结果的影响。 Result: 实验表明，广泛使用的事实性基准中有相当一部分样本已经过时，导致对LLM事实性的评估结果不可靠。 Conclusion: 现有的事实性评估基准因老化问题可能误导模型评估，需建立动态更新机制；本研究提供了评估基准可靠性的测试平台，并呼吁更多关于基准老化问题的研究。 Abstract: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.

[96] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Christos Ziakas,Nicholas Loo,Nishita Jain,Alessandra Russo

Main category: cs.CL

TL;DR: 提出Red-Bandit，一种基于多臂老虎机策略的在线自适应红队测试框架，通过LoRA专家模块针对不同攻击风格进行强化学习，实现对大语言模型漏洞的高效探测。

Details

Motivation: 现有自动化红队测试方法缺乏在推理时有效适应模型特定漏洞的机制，难以针对不同攻击风格动态调整。 Method: 采用参数高效的LoRA专家模块，每个专家专注于特定攻击风格，并通过强化学习进行后训练；在推理时使用多臂老虎机策略动态选择最有效的攻击专家，平衡探索与利用。 Result: 在AdvBench上达到SOTA水平（ASR@10），生成的提示更具可读性（更低困惑度），且能识别出导致模型不安全行为的关键攻击风格。 Conclusion: Red-Bandit能有效自适应地发现并利用大语言模型的特定漏洞，其策略还可作为诊断工具揭示模型弱点。 Abstract: Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model's response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit's bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

[97] Hybrid Reinforcement: When Reward Is Sparse, It's Better to Be Dense

Leitian Tao,Ilia Kulikov,Swarnadeep Saha,Tianlu Wang,Jing Xu,Yixuan Li,Jason E Weston,Ping Yu

Main category: cs.CL

TL;DR: 本文提出了HERO框架，通过结合验证器信号和奖励模型分数，在大语言模型的推理后训练中实现更有效的学习。

Details

Motivation: 传统的二值反馈过于严格，无法有效处理部分正确或替代答案的情况，限制了学习效果。 Method: 提出HERO框架，采用分层归一化和方差感知加权方法，将验证器信号与奖励模型分数结合。 Result: 在多个数学推理基准上，HERO优于仅使用奖励模型或验证器的方法，尤其在难以验证的任务上有显著提升。 Conclusion: 混合奖励设计在保持验证器稳定性的同时，利用奖励模型的细致反馈提升了推理能力。 Abstract: Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

[98] LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Joseph Enguehard,Morgane Van Ermengem,Kate Atkinson,Sujeong Cha,Arijit Ghosh Chowdhury,Prashanth Kallur Ramaswamy,Jeremy Roghair,Hannah R Marlowe,Carina Suzana Negreanu,Kitty Boxall,Diana Mincu

Main category: cs.CL

TL;DR: 本文提出一种无需参考答案的法律领域大语言模型输出评估新方法，通过将回答分解为“法律数据点”（LDPs）来模拟律师评估过程，提升了评估的可靠性与人类专家判断的一致性，并在多个数据集上优于现有基线方法。

Details

Motivation: 现有的大语言模型评估方法在法律领域存在依赖昂贵参考数据或标准化评测的局限，且LLM-as-a-Judge在法律语境下的可靠性和可信度不足，亟需更贴合法律行业评估习惯的解决方案。 Method: 将长篇法律回答分解为自包含的信息单元——“法律数据点”（LDPs），并基于此构建一种无需参考答案的新评估框架，模拟律师评估法律答案的方式进行自动评分。 Result: 该方法在专有数据集和LegalBench上均优于多种基线方法，与人类专家评估的相关性更高，并能提升标注者间一致性；部分LDP标注数据已开源。 Conclusion: 所提出的LDP-based评估方法更符合法律专业实践，显著提升大模型在法律问答中的评估可靠性与可复现性，推动法律AI评估领域的研究发展。 Abstract: Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

[99] Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Jonggeun Lee,Woojung Song,Jongwook Han,Haesung Pyun,Yohan Jo

Main category: cs.CL

TL;DR: 本文提出PA-Tool，一种无需训练的工具模式生成方法，通过调整工具命名以匹配小语言模型的预训练知识，显著提升其工具使用能力。

Details

Motivation: 小语言模型在工具使用任务中表现不佳，尤其是工具选择和参数识别方面，常因模式错位（如虚构不存在的工具名）而出错。现有方法要求模型适应任意工具模式，效果有限。 Method: 提出PA-Tool方法，利用来自污染检测的‘峰度’信号衡量预训练熟悉度，自动生成并选择与模型预训练知识对齐的工具名称，实现模式自动重命名。 Result: 在MetaTool和RoTBench上性能提升高达17个百分点，模式错位错误减少80%，小模型接近最先进性能，且无需重新训练即可适应新工具。 Conclusion: 通过将工具模式适配模型而非反之，PA-Tool展示了模式级干预在释放小模型工具使用潜力方面的有效性，兼顾效率与性能。 Abstract: Small language models (SLMs) offer significant computational advantages for tool-augmented AI systems, yet they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but non-existent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models' pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness-a signal from contamination detection indicating pretraining familiarity-to automatically rename tool components. By generating multiple candidates and selecting those with highest output concentration across samples, PA-Tool identifies pretrain-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17% points, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency for adaptation to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.

[100] Online Rubrics Elicitation from Pairwise Comparisons

MohammadHossein Rezaei,Robert Vacareanu,Zihao Wang,Clinton Wang,Yunzhong He,Afra Feyza Akyürek

Main category: cs.CL

TL;DR: 本文提出了OnlineRubrics方法，通过在线动态生成评估标准，利用当前策略与参考策略的响应对比，持续优化LLM训练，相较静态评分标准提升了最多8%的表现。

Details

Motivation: 静态评分标准易导致奖励黑客行为，且无法捕捉训练过程中出现的新需求，因此需要动态调整评估标准以提高训练效果。 Method: 提出Online Rubrics Elicitation（OnlineRubrics）方法，通过当前策略和参考策略生成的回答进行成对比较，动态提炼评估标准，并在训练过程中持续更新。 Result: 在AlpacaEval、GPQA、ArenaHard及专家问题验证集上，相比仅使用静态评分标准的方法，性能一致提升最多达8%。 Conclusion: OnlineRubrics能够有效识别并缓解训练中的错误，动态更新的评估标准显著优于静态标准，适用于开放性长文本生成任务的LLM后训练。 Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

[101] On the Convergence of Moral Self-Correction in Large Language Models

Guangliang Liu,Haitao Mao,Bochuan Cao,Zhiyu Xue,Xitong Zhang,Rongrong Wang,Kristen Marie Johnson

Main category: cs.CL

TL;DR: 本文研究了大语言模型在道德自我修正中的内在机制，揭示了多轮交互中性能收敛的关键特征，并发现持续注入的自我修正指令能够激活道德概念，减少模型不确定性，从而实现性能稳定。

Details

Motivation: 尽管大语言模型的内在自我修正已在实践中表现出有效性，但其作用机制尚不清楚，本文旨在探究道德自我修正为何有效及其背后机理。 Method: 通过实验和机制分析，研究多轮自我修正过程中模型表现的变化，分析指令如何激活道德概念并影响模型输出。 Result: 发现持续的自我修正指令能激活道德概念，降低模型不确定性，并在多轮交互后导致性能收敛。 Conclusion: 道德自我修正具有通过激活内在道德概念实现性能收敛的潜力，揭示了其有效性背后的机制。 Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

[102] Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang,Yunlong Liang,Fandong Meng,Songming Zhang,Kaiyu Huang,Yufeng Chen,Jinan Xu,Jie Zhou

Main category: cs.CL

TL;DR: 提出M-Thinker模型，通过GRPO算法结合语言一致性（LC）和跨语言思维对齐（CTA）奖励，提升非英语语言下的推理性能与语言一致性。

Details

Motivation: 现有大推理模型在处理非英语语言时存在输入输出语言不一致和推理路径错误导致准确率下降的问题，限制了其全球应用。 Method: 采用GRPO强化学习算法，引入语言一致性（LC）奖励以确保输入、思考和输出语言一致，设计跨语言思维对齐（CTA）奖励，将模型在英语中的推理能力迁移到非英语语言中。 Result: M-Thinker-1.5B/7B模型在MMATH和PolyMath两个多语言基准上实现了接近100%的语言一致性和优异性能，并在域外语言上表现出良好的泛化能力。 Conclusion: M-Thinker有效解决了大推理模型在非英语语言中的语言一致性与推理质量问题，推动了多语言大模型的全球化部署。 Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the "think-then-answer" paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model's non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

[103] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Yue Li,Ran Tao,Derek Hommel,Yusuf Denizay Dönder,Sungyong Chang,David Mimno,Unso Eun Seo Jo

Main category: cs.CL

TL;DR: 本文提出了CORGI，一个面向真实商业场景的新型text-to-SQL基准，包含来自企业场景的合成数据库和四类日益复杂的商业查询，揭示了现有大模型在高级商业智能任务上的性能差距。

Details

Motivation: 现有的text-to-SQL基准主要关注历史数据的事实检索，缺乏对真实商业决策中复杂需求（如因果推理、预测和推荐）的支持，因此需要构建更贴近实际业务场景的评估基准。 Method: 构建了一个名为CORGI的新基准，包含受Doordash、Airbnb和Lululemon等企业启发的合成数据库，并设计了描述性、解释性、预测性和推荐性四类递进复杂度的商业问题，同时提供了公开的数据集、评估框架和提交平台。 Result: 实验表明，现有大语言模型在高层级问题上表现不佳，尤其在准确预测和提供可执行建议方面存在困难；CORGI基准比BIRD基准难度高出约21%。 Conclusion: CORGI凸显了当前大模型与真实商业智能需求之间的差距，强调了发展具备多层级、多步骤代理智能的模型的必要性。 Abstract: In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21\% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

[104] Vibe Checker: Aligning Code Evaluation with Human Preference

Ming Zhong,Xiang Zhou,Ting-Yun Chang,Qingze Wang,Nan Xu,Xiance Si,Dan Garrette,Shyam Upadhyay,Jeremiah Liu,Jiawei Han,Benoit Schillings,Jiao Sun

Main category: cs.CL

TL;DR: 本文提出了“vibe check”概念，强调代码生成中除功能正确性外还需遵循人类偏好，引入VeriCode分类法和Vibe Checker测试平台来评估大模型在代码指令遵循方面的能力，发现指令遵循是影响人类偏好的关键因素。

Details

Motivation: 现有代码评估方法（如pass@k）仅关注功能正确性，忽视了用户对代码可读性、意图保持等非功能性需求，无法全面反映人类偏好。 Method: 提出VeriCode——包含30种可验证代码指令的分类体系及相应确定性验证器，并将其集成到现有评测集中，构建Vibe Checker测试平台，用于联合评估功能正确性和指令遵循能力。 Result: 评估31个主流大语言模型发现，即使最强模型也难以同时满足多项指令，且存在功能退化现象；结合功能正确性与指令遵循的综合得分最能反映人类偏好，其中指令遵循成为现实编程任务中的主要区分因素。 Conclusion: 指令遵循是实现‘vibe check’的关键组成部分，应作为代码生成模型的重要评估维度，未来模型需在功能与非功能需求间更好平衡以贴合真实用户偏好。 Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models' code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.

[105] Artificial Hippocampus Networks for Efficient Long-Context Modeling

Yunhao Fang,Weihao Yu,Shu Zhong,Qinghao Ye,Xuehan Xiong,Lai Wei

Main category: cs.CL

TL;DR: 提出一种基于人工神经网络的记忆框架，结合Transformer的KV缓存和可学习的AHN模块，实现高效长序列建模，在减少计算和内存开销的同时提升性能。

Details

Motivation: 解决长序列建模中RNN类模型（高效但记忆压缩）与Transformer类模型（高保真但内存增长）之间的效率与保真度权衡问题。 Method: 受认知科学多存储模型启发，维护一个滑动窗口内的Transformer KV缓存作为无损短期记忆，并通过名为人工海马网络（AHN）的学习模块将窗口外信息压缩为固定大小的长期记忆。AHN实例化为Mamba2、DeltaNet和Gated DeltaNet等现代RNN类架构。 Result: 在LV-Eval和InfiniteBench长上下文基准上，AHN增强模型显著优于滑动窗口基线，性能媲美甚至超越全注意力模型。例如，Qwen2.5-3B-Instruct配合AHN推理FLOPs减少40.5%，内存缓存减少74.0%，在128k序列长度下LV-Eval平均分从4.41提升至5.88。 Conclusion: 所提记忆框架有效平衡了长序列建模中的效率与准确性，通过分离短时无损记忆与长时压缩记忆，实现了性能提升与资源消耗降低的双重优势。 Abstract: Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer's KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.

cs.CV [Back]

[106] Milestone Determination for Autonomous Railway Operation

Josh Hunter,John McDermid,Simon Burton,Poppy Fynes,Mia Dempster

Main category: cs.CV

TL;DR: 提出一种基于里程碑确定的铁路自动化视觉系统训练方法，通过生成路线特定的序列数据，简化学习过程并提高现实应用中的安全性和效率。

Details

Motivation: 现有数据集缺乏时空上下文，难以支持实时决策，且传统方法在真实性和适用性上存在问题。 Method: 采用基于规则的模型，聚焦于路线中的关键决策点（里程碑），生成上下文相关的序列数据，避免对动态组件的泛化识别。 Result: 构建了更贴近实际运行逻辑的丰富序列数据集，为视觉代理在可控环境中的训练提供了有效框架。 Conclusion: 该方法为铁路自动化中的机器学习系统提供了更安全、高效的训练途径，适用于可预测环境下的视觉任务。 Abstract: In the field of railway automation, one of the key challenges has been the development of effective computer vision systems due to the limited availability of high-quality, sequential data. Traditional datasets are restricted in scope, lacking the spatio temporal context necessary for real-time decision-making, while alternative solutions introduce issues related to realism and applicability. By focusing on route-specific, contextually relevant cues, we can generate rich, sequential datasets that align more closely with real-world operational logic. The concept of milestone determination allows for the development of targeted, rule-based models that simplify the learning process by eliminating the need for generalized recognition of dynamic components, focusing instead on the critical decision points along a route. We argue that this approach provides a practical framework for training vision agents in controlled, predictable environments, facilitating safer and more efficient machine learning systems for railway automation.

[107] CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

Mingzhe Zheng,Dingjie Song,Guanyu Zhou,Jun You,Jiahao Zhan,Xuran Ma,Xinyuan Song,Ser-Nam Lim,Qifeng Chen,Harry Yang

Main category: cs.CV

TL;DR: 本文提出了CML-Dataset和CML-Bench，用于评估大型语言模型在电影剧本生成中的表现，重点衡量对话连贯性、角色一致性和情节合理性，并提出CML-Instruction提示策略以提升生成质量。

Details

Motivation: 大型语言模型虽能生成结构化文本，但在电影剧本创作中缺乏情感深度和叙事连贯性，难以捕捉电影的‘灵魂’，因此需要专门的评估基准来识别其不足。 Method: 构建包含（摘要，内容）对的CML-Dataset，分析高质量电影剧本的叙事结构，提出三个关键评估维度：对话连贯性、角色一致性和情节合理性，并据此设计CML-Bench评估基准；同时提出CML-Instruction提示策略以引导LLM生成更优剧本。 Result: CML-Bench能有效区分人工编写与LLM生成剧本的质量，实验表明CML-Instruction显著提升LLM生成剧本在各项指标上的表现，且结果与人类偏好一致。 Conclusion: 通过引入基于真实剧本分析的评估基准和精细化提示策略，可有效提升LLM在电影剧本生成任务中的叙事质量和 cinematic 表现力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the 'soul' of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where 'content' consists of segments from esteemed, high-quality movie scripts and 'summary' is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.

[108] User to Video: A Model for Spammer Detection Inspired by Video Classification Technology

Haoyang Zhang,Zhou Yang,Yucai Pang

Main category: cs.CV

TL;DR: 提出基于用户视频化的垃圾用户检测模型UVSD，将用户行为子空间转化为视频帧并利用视频分类技术识别垃圾用户。

Details

Motivation: 受视频分类技术启发，尝试将用户行为序列建模为视频数据以提升垃圾用户检测效果。 Method: 提出user2pixel算法将用户像素化（RGB表示立场），通过behavior2image算法将行为子空间转换为帧图像，并结合低秩向量化、切分与扩散算法生成用户行为视频，最后采用视频分类方法进行分类。 Result: 在WEIBO和TWITTER公开数据集上的实验表明，UVSD模型优于现有先进方法。 Conclusion: UVSD通过将用户行为视频化并引入视频分类技术，有效提升了垃圾用户检测性能。 Abstract: This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel's RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.

[109] Uncertainty Quantification In Surface Landmines and UXO Classification Using MC Dropout

Sagar Lekhak,Emmett J. Ientilucci,Dimah Dera,Susmita Ghosh

Main category: cs.CV

TL;DR: 本研究提出在细调的ResNet-50中引入MC Dropout进行不确定性量化，用于地表地雷与未爆物分类，提升在噪声和对抗攻击下的预测可靠性。

Details

Motivation: 确定性神经网络在噪声和对抗攻击下易出现漏检或误分类，需提高深度学习模型在人道主义排雷中的可靠性。 Method: 采用MC Dropout集成到细调的ResNet-50中，对模拟数据集进行表面地雷与UXO分类，并量化认知不确定性。 Result: 在干净、对抗性和含噪图像上的实验表明，模型能有效标记出挑战条件下的不可靠预测。 Conclusion: 不确定性量化有助于提升排雷中深度学习模型的鲁棒性和决策可靠性，应重视现有模型对抗脆弱性并发展更可靠的实用模型。 Abstract: Detecting surface landmines and unexploded ordnances (UXOs) using deep learning has shown promise in humanitarian demining. However, deterministic neural networks can be vulnerable to noisy conditions and adversarial attacks, leading to missed detection or misclassification. This study introduces the idea of uncertainty quantification through Monte Carlo (MC) Dropout, integrated into a fine-tuned ResNet-50 architecture for surface landmine and UXO classification, which was tested on a simulated dataset. Integrating the MC Dropout approach helps quantify epistemic uncertainty, providing an additional metric for prediction reliability, which could be helpful to make more informed decisions in demining operations. Experimental results on clean, adversarially perturbed, and noisy test images demonstrate the model's ability to flag unreliable predictions under challenging conditions. This proof-of-concept study highlights the need for uncertainty quantification in demining, raises awareness about the vulnerability of existing neural networks in demining to adversarial threats, and emphasizes the importance of developing more robust and reliable models for practical applications.

[110] multimodars: A Rust-powered toolkit for multi-modality cardiac image fusion and registration

Anselm W. Stark,Marc Ilic,Ali Mokhtari,Pooya Mohammadi Kazaj,Christoph Graeni,Isaac Shiri

Main category: cs.CV

TL;DR: multimodars 是一个用于融合多种心血管成像模态（如血管内成像与冠状动脉CTA）的开源工具包，旨在构建可靠的3D冠状动脉模型。

Details

Motivation: 由于单一成像方式存在局限性，例如血管内成像分辨率高但缺乏整体解剖背景，而CCTA虽提供三维结构但分辨率较低且易受伪影影响，因此需要一种能够整合多模态数据并支持多状态分析（如支架前后、静息/应激状态）的灵活、可重复、高性能工具。 Method: multimodars 采用确定性的配准算法，基于NumPy设计紧凑的数据模型，并使用优化的Rust后端以提升性能；支持CSV/NumPy格式输入，兼容AIVUS-CAA等软件输出，便于集成到分析流程中。 Result: 该工具包实现了高精度的多模态图像融合，具备良好的可扩展性和可重复性，适用于多种临床场景下的3D冠状动脉建模。 Conclusion: multimodars 填补了现有工具在多模态、多状态心血管图像融合方面的空白，为研究和临床应用提供了一个开放、高效且易于集成的解决方案。 Abstract: Combining complementary imaging modalities is critical to build reliable 3D coronary models: intravascular imaging gives sub-millimetre resolution but limited whole-vessel context, while CCTA supplies 3D geometry but suffers from limited spatial resolution and artefacts (e.g., blooming). Prior work demonstrated intravascular/CCTA fusion, yet no open, flexible toolkit is tailored for multi-state analysis (rest/stress, pre-/post-stenting) while offering deterministic behaviour, high performance, and easy pipeline integration. multimodars addresses this gap with deterministic alignment algorithms, a compact NumPy-centred data model, and an optimised Rust backend suitable for scalable, reproducible experiments. The package accepts CSV/NumPy inputs including data formats produced by the AIVUS-CAA software

[111] Does Physics Knowledge Emerge in Frontier Models?

Ieva Bagdonaviciute,Vibhav Vineet

Main category: cs.CV

TL;DR: 研究评估了六种前沿视觉语言模型（VLMs）在物理动态理解与预测方面的能力，发现在感知与物理推理技能之间存在弱相关性，揭示了当前VLMs在因果理解上的局限性。

Details

Motivation: 探讨当前领先的视觉语言模型是否具备理解和预测物理动态的能力，特别是在因果推理方面的表现。 Method: 在CLEVRER、Physion和Physion++三个物理模拟数据集上对六种前沿VLMs进行基准测试，并设计诊断子测试以分离感知与物理推理能力。 Result: 发现感知或物理推理能力强的模型在预测或反事实评估任务中并未表现出一致的优越性能，两者之间相关性较弱。 Conclusion: 当前VLMs的感知与物理推理能力仍处于割裂状态，未能有效结合形成因果理解，需设计更紧密融合感知与推理的架构。 Abstract: Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.

[112] Enhanced Self-Distillation Framework for Efficient Spiking Neural Network Training

Xiaochen Zhao,Chengting Yu,Kairong Yu,Lei Liu,Aili Wang

Main category: cs.CV

TL;DR: 提出了一种增强的自蒸馏框架，结合基于速率的反向传播，用于高效训练脉冲神经网络（SNN），通过分离可靠与不可靠知识提升收敛性，在多个数据集上实现了高性能且降低训练复杂度。

Details

Motivation: 传统基于代理梯度和BPTT的SNN训练方法性能落后于人工神经网络（ANN），且计算和内存开销大，难以在资源受限条件下实现高效训练。 Method: 设计了一个增强的自蒸馏框架，将SNN中间层的发放率投影到轻量级ANN分支上，利用模型自身生成的高质量知识通过ANN路径优化子结构，并将教师信号分解为可靠与不可靠部分，仅使用可靠知识进行指导。 Result: 在CIFAR-10、CIFAR-100、CIFAR10-DVS和ImageNet等多个数据集上验证了该方法的有效性，显著降低了训练复杂度的同时实现了高性能SNN训练。 Conclusion: 所提方法通过可靠的自蒸馏机制有效提升了SNN的训练效率和性能，适用于资源受限场景下的高精度SNN训练。 Abstract: Spiking Neural Networks (SNNs) exhibit exceptional energy efficiency on neuromorphic hardware due to their sparse activation patterns. However, conventional training methods based on surrogate gradients and Backpropagation Through Time (BPTT) not only lag behind Artificial Neural Networks (ANNs) in performance, but also incur significant computational and memory overheads that grow linearly with the temporal dimension. To enable high-performance SNN training under limited computational resources, we propose an enhanced self-distillation framework, jointly optimized with rate-based backpropagation. Specifically, the firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality knowledge generated by the model itself is used to optimize substructures through the ANN pathways. Unlike traditional self-distillation paradigms, we observe that low-quality self-generated knowledge may hinder convergence. To address this, we decouple the teacher signal into reliable and unreliable components, ensuring that only reliable knowledge is used to guide the optimization of the model. Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate that our method reduces training complexity while achieving high-performance SNN training. Our code is available at https://github.com/Intelli-Chip-Lab/enhanced-self-distillation-framework-for-snn.

[113] Ensemble Deep Learning and LLM-Assisted Reporting for Automated Skin Lesion Diagnosis

Sher Khan,Raz Muhammad,Adil Hussain,Muhammad Sajjad,Muhammad Rashid

Main category: cs.CV

TL;DR: 提出了一种统一的皮肤科诊断AI框架，结合异构神经网络集成和嵌入式大语言模型，提升诊断准确性并生成结构化临床报告，支持患者教育与连续诊疗。

Details

Motivation: 现有皮肤科AI系统存在架构单一、数据集偏差及自然语言处理与诊断割裂的问题，导致诊断可靠性不足且难以融入临床工作流。 Method: 设计一个由多种卷积神经网络组成的异构集成模型，并引入内在不确定性机制；同时将大语言模型直接嵌入诊断流程，自动生成结构化临床报告和患者教育内容。 Result: 该框架能提供更可靠的诊断结果，识别跨肤色病变，生成包含病灶特征、诊断推理和监测建议的结构化报告，提升医患沟通效率。 Conclusion: 该统一框架通过整合多样化模型和语言生成能力，弥合了AI在皮肤科临床转化中的关键鸿沟，推动可部署、高精度且支持全程诊疗的AI系统发展。 Abstract: Cutaneous malignancies demand early detection for favorable outcomes, yet current diagnostics suffer from inter-observer variability and access disparities. While AI shows promise, existing dermatological systems are limited by homogeneous architectures, dataset biases across skin tones, and fragmented approaches that treat natural language processing as separate post-hoc explanations rather than integral to clinical decision-making. We introduce a unified framework that fundamentally reimagines AI integration for dermatological diagnostics through two synergistic innovations. First, a purposefully heterogeneous ensemble of architecturally diverse convolutional neural networks provides complementary diagnostic perspectives, with an intrinsic uncertainty mechanism flagging discordant cases for specialist review -- mimicking clinical best practices. Second, we embed large language model capabilities directly into the diagnostic workflow, transforming classification outputs into clinically meaningful assessments that simultaneously fulfill medical documentation requirements and deliver patient-centered education. This seamless integration generates structured reports featuring precise lesion characterization, accessible diagnostic reasoning, and actionable monitoring guidance -- empowering patients to recognize early warning signs between visits. By addressing both diagnostic reliability and communication barriers within a single cohesive system, our approach bridges the critical translational gap that has prevented previous AI implementations from achieving clinical impact. The framework represents a significant advancement toward deployable dermatological AI that enhances diagnostic precision while actively supporting the continuum of care from initial detection through patient education, ultimately improving early intervention rates for skin lesions.

[114] Vision Transformer for Transient Noise Classification

Divyansh Srivastava,Andrzej Niedzielski

Main category: cs.CV

TL;DR: 本文提出使用预训练的Vision Transformer (ViT-B/32)模型对LIGO数据中的22个已有及O3a新增的2个噪声类别进行分类，通过结合Gravity Spy与O3a数据集，实现了92.26%的分类准确率，展示了ViT在提升引力波探测中噪声识别能力的潜力。

Details

Motivation: LIGO数据中的瞬态噪声（glitches）会干扰引力波信号的检测，Gravity Spy项目虽已对噪声进行分类，但随着O3运行引入了新的噪声类型，需要更新分类模型以提高检测准确性。 Method: 采用预训练的Vision Transformer (ViT-B/32)模型，在包含Gravity Spy原有数据和O3a新增两类噪声的组合数据集上进行训练，实现对24类噪声的分类。 Result: 模型在混合数据集上达到了92.26%的分类效率，显示出ViT在区分LIGO数据中瞬态噪声方面的高效性。 Conclusion: Vision Transformer能够有效提升LIGO数据中噪声事件的分类精度，有助于更准确地识别真实的引力波信号，为未来引力波天文学研究提供了有力支持。 Abstract: Transient noise (glitches) in LIGO data hinders the detection of gravitational waves (GW). The Gravity Spy project has categorized these noise events into various classes. With the O3 run, there is the inclusion of two additional noise classes and thus a need to train new models for effective classification. We aim to classify glitches in LIGO data into 22 existing classes from the first run plus 2 additional noise classes from O3a using the Vision Transformer (ViT) model. We train a pre-trained Vision Transformer (ViT-B/32) model on a combined dataset consisting of the Gravity Spy dataset with the additional two classes from the LIGO O3a run. We achieve a classification efficiency of 92.26%, demonstrating the potential of Vision Transformer to improve the accuracy of gravitational wave detection by effectively distinguishing transient noise. Key words: gravitational waves --vision transformer --machine learning

[115] General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

Fahim Shahriar,Cheryl Wang,Alireza Azimi,Gautham Vasan,Hany Hamed Elanwar,A. Rupam Mahmood,Colin Bellinger

Main category: cs.CV

TL;DR: 提出了一种基于掩码的目标表示方法，用于目标条件强化学习，实现了对未见物体的高效学习和良好泛化。

Details

Motivation: 现有目标表示方法在泛化性、收敛速度和硬件依赖方面存在局限，需要更灵活高效的方法。 Method: 采用基于掩码的视觉提示作为目标表示，利用掩码生成密集奖励，无需精确位置信息或特殊硬件。 Result: 在模拟环境中达到99.9%的到达准确率，并成功实现从零开始学习和仿真到现实的迁移，在真实机器人上完成拾取任务。 Conclusion: 基于掩码的目标表示能有效提升GCRL的泛化能力和实用性，适用于不同机器人平台和实际应用场景。 Abstract: Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.

[116] Improving the Spatial Resolution of GONG Solar Images to GST Quality Using Deep Learning

Chenyang Li,Qin Li,Haimin Wang,Bo Shen

Main category: cs.CV

TL;DR: 提出基于GAN的超分辨率方法，提升GONG低分辨率Hα太阳图像至高分辨率，有效恢复黑子、纤维结构细节，定量指标MSE为467.15，RMSE为21.59，CC为0.7794，性能受限于图像配准偏差。

Details

Motivation: 全盘Hα太阳图像空间分辨率有限，难以分辨精细结构（如纤维和细丝），需提升低分辨率图像质量以匹配高分辨率观测。 Method: 采用基于Real-ESRGAN的方法，结合残差密集块和相对论判别器，利用精心配准的GONG与BBSO/GST图像对进行训练。 Result: 模型成功恢复了黑子半影、纤维和细丝的精细结构，MSE为467.15，RMSE为21.59，CC为0.7794，但配准偏差限制了定量性能。 Conclusion: 所提GAN方法能有效提升太阳图像分辨率，逼近高分辨率观测质量，未来将优化配准并扩展数据集以进一步提升重建效果。 Abstract: High-resolution (HR) solar imaging is crucial for capturing fine-scale dynamic features such as filaments and fibrils. However, the spatial resolution of the full-disk H$\alpha$ images is limited and insufficient to resolve these small-scale structures. To address this, we propose a GAN-based superresolution approach to enhance low-resolution (LR) full-disk H$\alpha$ images from the Global Oscillation Network Group (GONG) to a quality comparable with HR observations from the Big Bear Solar Observatory/Goode Solar Telescope (BBSO/GST). We employ Real-ESRGAN with Residual-in-Residual Dense Blocks and a relativistic discriminator. We carefully aligned GONG-GST pairs. The model effectively recovers fine details within sunspot penumbrae and resolves fine details in filaments and fibrils, achieving an average mean squared error (MSE) of 467.15, root mean squared error (RMSE) of 21.59, and cross-correlation (CC) of 0.7794. Slight misalignments between image pairs limit quantitative performance, which we plan to address in future work alongside dataset expansion to further improve reconstruction quality.

[117] ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

Yike Wu,Yiwei Wang,Yujun Cai

Main category: cs.CV

TL;DR: 提出ChainMPQ，一种无需训练的方法，通过多视角问题引导的图文交错链来减少大视觉语言模型中的关系幻觉。

Details

Motivation: 关系幻觉在大视觉语言模型中占比最高但关注最少，影响模型可靠性。 Method: 基于问题提取主客体关键词以增强对应图像区域，并构建聚焦于关系三要素（主体、客体、关系）的多视角问题，通过前序步骤的文本与视觉记忆为后续步骤提供上下文支持，形成图文交错的推理链。 Result: 在多个大视觉语言模型和基准测试上的实验表明，ChainMPQ显著减少了关系幻觉，消融研究验证了其三个核心模块的有效性。 Conclusion: ChainMPQ能有效提升大视觉语言模型的关系推理能力，降低关系幻觉，且无需额外训练。 Abstract: While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.

[118] Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

Young D. Kwon,Abhinav Mehrotra,Malcolm Chadwick,Alberto Gil Ramos,Sourav Bhattacharya

Main category: cs.CV

TL;DR: MobilePicasso是一种高效的高分辨率图像编辑系统，专为资源受限设备设计，通过三阶段方法实现低计算成本和内存使用下的高质量图像编辑。

Details

Motivation: 现有的扩散模型在资源受限设备上进行高分辨率图像编辑时面临内存和图像质量方面的挑战。 Method: MobilePicasso包含三个阶段：1）在标准分辨率下使用幻觉感知损失进行图像编辑；2）应用潜在投影避免进入像素空间；3）通过自适应上下文保持平铺将编辑后的潜在图像 upscale 到更高分辨率。 Result: 用户研究表明，与现有方法相比，MobilePicasso将图像质量提高了18-48%，幻觉减少了14-51%，延迟显著降低（最高达55.8倍加速），运行时内存仅增加9%。令人惊讶的是，其在设备上的运行速度甚至快于基于A100 GPU服务器的高分辨率模型。 Conclusion: MobilePicasso在高分辨率图像编辑中实现了高效、高质量的性能，适合移动设备部署，并展现出优于服务器端模型的潜力。 Abstract: High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8$\times$ speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.

[119] RGBD Gaze Tracking Using Transformer for Feature Fusion

Tobias J. Bauer

Main category: cs.CV

TL;DR: 本文提出了一种基于RGBD图像和Transformer架构的AI眼动追踪系统，并构建了新的数据集用于训练。实验结果表明，去除预训练GAN模块和使用MLP替代Transformer可显著降低误差。

Details

Motivation: 现有数据集缺乏深度信息或不适合眼动角度估计任务，且RGBD图像与Transformer结合的方法尚未被研究。 Method: 采用基于Transformer的模块融合RGBD图像特征，使用GAN去除深度图伪影并提取头部姿态特征，训练多种模型配置并在三个数据集上进行评估。 Result: 在ShanghaiTechGaze+数据集上，使用Transformer的模型均方欧氏误差为55.3mm，不使用预训练GAN时降至30.1mm，改用MLP后进一步降至26.9mm；在ETH-XGaze数据集上，使用Transformer的平均角度误差为3.59°，不使用时为3.26°，而Zhang等人的模型为2.04°。 Conclusion: 尽管引入Transformer未显著提升性能，但消融实验显示改进模型结构（如移除GAN或替换为MLP）可有效降低误差，验证了所提方法的可行性与优化方向。 Abstract: Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59{\deg} and without Transformer module 3.26{\deg}, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04{\deg}. On the OTH-Gaze-Estimation dataset created for...

[120] Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping

Tiago de Conto,John Armston,Ralph Dubayah

Main category: cs.CV

TL;DR: 提出一种基于深度学习的框架，融合GEDI激光雷达和多模态SAR数据，生成全球高分辨率森林结构复杂性连续地图。

Details

Motivation: GEDI卫星激光雷达虽能刻画森林结构复杂性，但采样稀疏，难以实现连续高分辨率制图，限制了其在生态系统监测中的广泛应用。 Method: 采用改进的EfficientNetV2深度学习架构，融合GEDI观测与多模态合成孔径雷达（SAR）数据，利用超过1.3亿个GEDI足迹进行训练，实现全球25米分辨率的森林结构复杂性制图。 Result: 模型在全球范围内达到R²=0.82的高性能，参数少于40万，具备良好的跨生物群落和时间泛化能力，并生成2015–2022年全球多时相森林结构复杂性数据集，提供校准的不确定性估计。 Conclusion: 该框架支持无需专用硬件的可扩展处理，通过迁移学习可扩展至其他森林结构变量预测，为气候变化下的生物多样性保护和生态系统管理提供了强有力的监测工具。 Abstract: Forest structural complexity metrics integrate multiple canopy attributes into a single value that reflects habitat quality and ecosystem function. Spaceborne lidar from the Global Ecosystem Dynamics Investigation (GEDI) has enabled mapping of structural complexity in temperate and tropical forests, but its sparse sampling limits continuous high-resolution mapping. We present a scalable, deep learning framework fusing GEDI observations with multimodal Synthetic Aperture Radar (SAR) datasets to produce global, high-resolution (25 m) wall-to-wall maps of forest structural complexity. Our adapted EfficientNetV2 architecture, trained on over 130 million GEDI footprints, achieves high performance (global R2 = 0.82) with fewer than 400,000 parameters, making it an accessible tool that enables researchers to process datasets at any scale without requiring specialized computing infrastructure. The model produces accurate predictions with calibrated uncertainty estimates across biomes and time periods, preserving fine-scale spatial patterns. It has been used to generate a global, multi-temporal dataset of forest structural complexity from 2015 to 2022. Through transfer learning, this framework can be extended to predict additional forest structural variables with minimal computational cost. This approach supports continuous, multi-temporal monitoring of global forest structural dynamics and provides tools for biodiversity conservation and ecosystem management efforts in a changing climate.

Yi Xin,Qi Qin,Siqi Luo,Kaiwen Zhu,Juncheng Yan,Yan Tai,Jiayi Lei,Yuewen Cao,Keqi Wang,Yibin Wang,Jinbin Bai,Qian Yu,Dengyang Jiang,Yuandong Pu,Haoxing Chen,Le Zhuo,Junjun He,Gen Luo,Tianbin Li,Ming Hu,Jin Ye,Shenglong Ye,Bo Zhang,Chang Xu,Wenhai Wang,Hongsheng Li,Guangtao Zhai,Tianfan Xue,Bin Fu,Xiaohong Liu,Yu Qiao,Yihao Liu

Main category: cs.CV

TL;DR: Lumina-DiMOO是一种开源的多模态基础模型，采用全离散扩散建模实现跨模态的生成与理解，在多种任务上表现优异。

Details

Motivation: 现有统一多模态模型在效率和通用性方面存在局限，需要更高效的建模范式。 Method: 采用全离散扩散模型处理多模态输入输出，提升采样效率并支持多样化任务。 Result: 在多个基准测试中达到当前最优性能，超越现有开源统一多模态模型。 Conclusion: Lumina-DiMOO通过全离散扩散建模有效提升了多模态任务的性能与效率，推动了相关研究的发展。 Abstract: We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

[122] TransFIRA: Transfer Learning for Face Image Recognizability Assessment

Allen Tu,Kartik Narayan,Joshua Gleason,Jennifer Xu,Matthew Meyn,Tom Goldstein,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了TransFIRA，一种轻量级、无需标注的面部可识别性评估框架，通过嵌入空间中的类中心相似性和角度分离来定义可识别性，显著提升了无约束环境下的人脸和人体识别性能。

Details

Motivation: 在姿态、模糊、光照和遮挡变化极大的非受限环境中，传统视觉质量指标无法有效预测人脸图像是否可被识别，现有FIQA方法依赖启发式规则或生成模型，与编码器决策几何脱节。 Method: 提出TransFIRA框架，基于类中心相似性（CCS）和类中心角分离（CCAS）在嵌入空间中直接定义可识别性，并设计可识别性感知的聚合策略，无需外部标签或特定骨干训练。 Result: 在BRIAR和IJB-C数据集上达到最先进的验证精度，与真实可识别性的相关性几乎翻倍，在跨数据集迁移和人体识别任务中也表现出强鲁棒性和良好性能。 Conclusion: TransFIRA是一种统一、基于几何、编码器特异且可扩展的可识别性评估框架，在准确性、可解释性和适用范围上显著推进了FIQA的发展。 Abstract: Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder's decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary--aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment -- encoder-specific, accurate, interpretable, and extensible across modalities -- significantly advancing FIQA in accuracy, explainability, and scope.

[123] Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data

Carly Sutter,Kara J. Sulia,Nick P. Bassill,Christopher D. Wirz,Christopher D. Thorncroft,Jay C. Rothenberger,Vanessa Przybylo,Mariana G. Cains,Jacob Radford,David Aaron Evans

Main category: cs.CV

TL;DR: 本研究利用卷积神经网络和随机森林模型，结合约2.2万张手工标注的交通摄像头图像和天气数据，自动分类纽约州道路表面状况，模型在未见过的摄像头数据上达到81.5%的准确率。

Details

Motivation: 纽约州交通部依赖人工驾驶和实时摄像头观测道路状况，尤其在冬季天气事件中费时费力，亟需自动化工具辅助决策。 Method: 使用卷积神经网络和随机森林模型，基于手工标注的摄像头图像和天气数据训练道路表面状况分类模型，并优先考虑模型在不同摄像头间的泛化能力。 Result: 模型在完全未见过的摄像头数据上对六类道路状况（严重积雪、积雪、湿滑、干燥、能见度差、遮挡）的分类准确率达到81.5%。 Conclusion: 该机器学习模型可有效支持纽约州交通部实现道路状况的自动化监测，提升冬季交通管理效率与决策能力。 Abstract: The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.

[124] TDiff: Thermal Plug-And-Play Prior with Patch-Based Diffusion

Piyush Dashpute,Niki Nezakati,Wolfgang Heidrich,Vishwanath Saragadam

Main category: cs.CV

TL;DR: 提出了一种基于patch的扩散框架（TDiff），用于多任务热成像图像恢复，通过在小块上训练以解决低分辨率、固定模式噪声等问题。

Details

Motivation: 低成本热像仪图像常存在低分辨率、固定模式噪声和局部退化，且现有数据集规模和多样性有限。 Method: 采用基于patch的扩散模型，对重叠的小尺寸热图像块进行去噪，并通过平滑的空间窗口融合恢复全分辨率图像。 Result: 在去噪、超分辨率和去模糊任务中，于模拟和真实热数据上均取得良好效果。 Conclusion: TDiff是首个用于多任务热图像恢复的patch-based扩散框架，可作为统一的热图像恢复 pipeline。 Abstract: Thermal images from low-cost cameras often suffer from low resolution, fixed pattern noise, and other localized degradations. Available datasets for thermal imaging are also limited in both size and diversity. To address these challenges, we propose a patch-based diffusion framework (TDiff) that leverages the local nature of these distortions by training on small thermal patches. In this approach, full-resolution images are restored by denoising overlapping patches and blending them using smooth spatial windowing. To our knowledge, this is the first patch-based diffusion framework that models a learned prior for thermal image restoration across multiple tasks. Experiments on denoising, super-resolution, and deblurring demonstrate strong results on both simulated and real thermal data, establishing our method as a unified restoration pipeline.

[125] SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Oindrila Saha,Vojtech Krs,Radomir Mech,Subhransu Maji,Kevin Blackburn-Matzen,Matheus Gadelha

Main category: cs.CV

TL;DR: SIGMA-GEN是一个统一的多身份图像生成框架，首次实现单次生成中保持多个主体身份，并支持多种精度的用户引导。

Details

Motivation: 现有方法难以在单次生成中同时保持多个主体的身份信息并灵活响应不同精度的空间约束，因此需要一个更强大且灵活的框架。 Method: 提出SIGMA-GEN框架，结合结构与空间约束进行多主体身份保持生成；引入包含27k图像和10万以上独特主体的合成数据集SIGMA-SET27K以支持训练和评估。 Result: 在身份保持、图像生成质量和速度方面均达到最先进水平，支持从粗略框到像素级分割等多种输入引导方式。 Conclusion: SIGMA-GEN实现了高效、灵活且高保真的多主体图像生成，为复杂场景下的可控生成提供了有效解决方案。 Abstract: We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision -- from coarse 2D or 3D boxes to pixel-level segmentations and depth -- with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/

[126] Superpixel Integrated Grids for Fast Image Segmentation

Jack Roberts,Jeova Farias Sales Rocha Neto

Main category: cs.CV

TL;DR: 本文提出了一种新的基于超像素的数据结构SIGRID，用于分割任务中替代全分辨率图像，在保持甚至提升性能的同时显著提高计算效率。

Details

Motivation: 超像素虽然能简化图像并提升处理效率，但其不规则的空间分布导致深度学习方法需要特殊架构和训练算法，削弱了其优势。因此需要一种既能保留超像素优势又能兼容标准深度学习模型的方法。 Method: 提出SIGRID（Superpixel-Integrated Grid），利用经典形状描述子编码超像素的颜色和形状信息，并将其组织为规则网格结构，适配标准卷积网络。在四个基准数据集上使用两种主流分割架构进行评估。 Result: 实验表明，SIGRID在压缩数据的同时，性能与像素级表示相当甚至更优，并显著加快了模型训练速度。 Conclusion: SIGRID在分割任务中实现了精度与计算效率的良好平衡，是替代全分辨率图像的一种有效方案。 Abstract: Superpixels have long been used in image simplification to enable more efficient data processing and storage. However, despite their computational potential, their irregular spatial distribution has often forced deep learning approaches to rely on specialized training algorithms and architectures, undermining the original motivation for superpixelations. In this work, we introduce a new superpixel-based data structure, SIGRID (Superpixel-Integrated Grid), as an alternative to full-resolution images in segmentation tasks. By leveraging classical shape descriptors, SIGRID encodes both color and shape information of superpixels while substantially reducing input dimensionality. We evaluate SIGRIDs on four benchmark datasets using two popular convolutional segmentation architectures. Our results show that, despite compressing the original data, SIGRIDs not only match but in some cases surpass the performance of pixel-level representations, all while significantly accelerating model training. This demonstrates that SIGRIDs achieve a favorable balance between accuracy and computational efficiency.

[127] Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Qingxuan Wu,Zhiyang Dou,Chuan Guo,Yiming Huang,Qiao Feng,Bing Zhou,Jian Wang,Lingjie Liu

Main category: cs.CV

TL;DR: 提出Text2Interact框架，通过InterCompose和InterActor实现高质量、文本对齐的人类交互动作生成。

Details

Motivation: 现有方法受限于双人交互训练数据不足以及文本到交互建模不够细粒度，难以捕捉真实且与文本一致的时空耦合关系。 Method: 1) InterCompose：基于LLM生成描述并结合单人动作先验，通过检索、生成与评估组合合成双人交互数据；2) InterActor：引入词级文本条件和自适应交互损失，增强细粒度时空协调与物理合理性。 Result: 实验表明该方法在动作多样性、保真度和泛化性方面均优于现有方法，包括分布外场景和用户研究。 Conclusion: Text2Interact实现了更真实、更贴近文本描述的人类交互建模，推动了文本驱动交互生成的发展。 Abstract: Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.

[128] From Captions to Keyframes: Efficient Video Summarization via Caption- and Context-Aware Frame Scoring

Shih-Yao Lin,Sibendu Paul,Caren Chen

Main category: cs.CV

TL;DR: 提出KeyScore和STACFP框架，实现高效视频-语言理解，显著减少帧数并提升性能。

Details

Motivation: 在长视频中选择少量保留语义和上下文信息的关键帧，以提高视频-语言理解的效率。 Method: 提出KeyScore，结合语义相似性、时间多样性和上下文影响来评估帧重要性；设计STACFP进行时空自适应聚类生成候选帧。 Result: 相比全帧推断减少高达99%的帧数，在MSRVTT、MSVD和DiDeMo上优于标准8帧编码器。 Conclusion: 通过强调视觉与文本信号的多模态对齐，可实现可扩展、高效且基于字幕的视频理解，无需显式视频摘要。 Abstract: Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos. We propose KeyScore, a multimodal frame scoring framework that jointly leverages captions and visual context to estimate frame-level importance. By combining semantic similarity, temporal diversity, and contextual drop impact, KeyScore identifies the most informative frames for downstream tasks such as retrieval, captioning, and video-language reasoning. To complement KeyScore, we introduce STACFP (Spatio-Temporal Adaptive Clustering for Frame Proposals), which generates compact and diverse frame candidates for long-form videos. Together, these modules achieve up to 99\% frame reduction compared to full-frame inference and substantially outperform standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo. Our results demonstrate that emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding -- without explicit video summarization.

[129] LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

Avishree Khare,Hideki Okamoto,Bardh Hoxha,Georgios Fainekos,Rajeev Alur

Main category: cs.CV

TL;DR: 提出了一种名为LogSTOP的评分函数，用于基于局部属性得分预测序列上的时序属性得分，在对象-视频和情感-语音等任务中显著优于现有方法。

Details

Motivation: 将神经模型对单帧或音频片段的局部属性检测得分提升为序列上的时态属性得分，以支持如查询匹配和排序检索等下游应用。 Method: 形式化了时序属性得分分配问题，并提出LogSTOP评分函数，结合线性时序逻辑高效计算序列上的得分。 Result: 在查询匹配任务中，LogSTOP比大模型和其他基线至少提高16%；在排序检索任务中，平均精度和召回率分别提升至少19%和16%。 Conclusion: LogSTOP能有效将局部检测得分转化为时序属性得分，在多种视觉与语音任务中优于现有方法。 Abstract: Neural models such as YOLO and HuBERT can be used to detect local properties such as objects ("car") and emotions ("angry") in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., "does the speaker eventually sound happy in this audio clip?"), and ranked retrieval (e.g., "retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected"). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

[130] Limited-Angle Tomography Reconstruction via Projector Guided 3D Diffusion

Zhantao Deng,Mériem Er-Rafik,Anna Sushko,Cécile Hébert,Pascal Fua

Main category: cs.CV

TL;DR: 提出了一种基于3D扩散模型的迭代重建框架TEMdiff，用于解决有限角度电子断层扫描中的缺失楔形问题，无需高质量真实数据训练，且在极窄倾转范围内表现优异。

Details

Motivation: 有限角度电子断层扫描因缺失楔形问题导致严重重建伪影，现有深度学习方法依赖难以获取的高质量3D真值数据。 Method: 提出TEMDiff，一种基于3D扩散模型的迭代重建框架，利用FIB-SEM体数据和模拟器生成TEM倾转序列进行训练，直接在3D体积上操作以隐式保持切片间一致性。 Result: 在模拟数据上优于现有最先进方法，并能在仅8度倾转范围（2度步长）下准确重建真实TEM数据，无需微调。 Conclusion: TEMdiff无需真实TEM 3D真值训练，具有强泛化能力，可有效缓解缺失楔形问题，适用于实际复杂条件下的电子断层成像。 Abstract: Limited-angle electron tomography aims to reconstruct 3D shapes from 2D projections of Transmission Electron Microscopy (TEM) within a restricted range and number of tilting angles, but it suffers from the missing-wedge problem that causes severe reconstruction artifacts. Deep learning approaches have shown promising results in alleviating these artifacts, yet they typically require large high-quality training datasets with known 3D ground truth which are difficult to obtain in electron microscopy. To address these challenges, we propose TEMDiff, a novel 3D diffusion-based iterative reconstruction framework. Our method is trained on readily available volumetric FIB-SEM data using a simulator that maps them to TEM tilt series, enabling the model to learn realistic structural priors without requiring clean TEM ground truth. By operating directly on 3D volumes, TEMDiff implicitly enforces consistency across slices without the need for additional regularization. On simulated electron tomography datasets with limited angular coverage, TEMDiff outperforms state-of-the-art methods in reconstruction quality. We further demonstrate that a trained TEMDiff model generalizes well to real-world TEM tilts obtained under different conditions and can recover accurate structures from tilt ranges as narrow as 8 degrees, with 2-degree increments, without any retraining or fine-tuning.

[131] VUGEN: Visual Understanding priors for GENeration

Xiangyi Chen,Théophane Vallaeys,Maha Elbayad,John Nguyen,Jakob Verbeek

Main category: cs.CV

TL;DR: 本文提出VUGEN框架，利用视觉语言模型（VLM）的预训练视觉理解先验，实现高效高质量图像生成。通过将VLM的高维潜在空间转换为低维可处理分布，并使用无需VAE的像素扩散解码器，VUGEN在保持理解能力的同时提升了生成性能。

Details

Motivation: 现有方法依赖自编码器或复杂桥梁机制，导致理解与生成表征不一致或结构复杂，难以兼顾高质量生成与语义一致性。 Method: 1) 将VLM视觉编码器的高维潜在空间降维为信息保留的低维分布；2) 在该空间内采样并训练VLM以对齐其理解能力；3) 使用VAE-free的像素扩散解码器将潜在表示还原为图像。 Result: 在COCO数据集上，DPG Bench从71.17提升至74.32，FID从11.86降至9.06，且完全保留了VLM原有的理解能力。 Conclusion: VUGEN通过显式利用VLM的视觉先验，在简化架构的同时实现了更优的生成质量与语义一致性，验证了理解与生成协同发展的可行性。 Abstract: Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM's pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM's native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM's original understanding capabilities.

[132] Cluster Paths: Navigating Interpretability in Neural Networks

Nicholas M. Kroeger,Vincent Bindschaedler

Main category: cs.CV

TL;DR: 本文提出了cluster paths，一种用于深度神经网络的后验可解释性方法，通过聚类特定层的激活并将输入表示为聚类ID序列来揭示模型决策过程。该方法在多个任务和模型上验证了其在识别虚假线索、保持鲁棒性和检测异常样本方面的有效性，并可扩展到基于大语言模型生成概念路径的Vision Transformer。

Details

Motivation: 现代深度神经网络在视觉任务中表现优异，但其决策过程不透明，可能导致过度信任、隐藏偏见和意外失败。因此需要一种可扩展且人类可读的解释方法来增强模型的可解释性。 Method: 提出cluster paths方法：对选定层的激活进行聚类，将每个输入表示为其在各层的聚类ID序列；并引入四种评估指标——路径复杂度、加权路径纯度、决策对齐忠实度和路径一致性。进一步结合大语言模型生成概念路径，并探索其作为OOD检测器的能力。 Result: 在CIFAR-10的虚假线索实验中成功识别颜色捷径；在CelebA发色分类任务中达到90%的忠实度和96%的噪声下路径一致性；在ImageNet预训练的Vision Transformer上可扩展为概念路径；并能有效检测OOD样本。 Conclusion: Cluster paths能够揭示多层级的视觉概念（如色彩、纹理、上下文），生成简洁且人类可读的解释，适用于从小型CNN到大型Vision Transformer的多种模型，兼具可解释性、稳定性和实用性。 Abstract: While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.

[133] HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution

Qiongyang Hu,Wenyang Liu,Wenbin Zou,Yuejiao Su,Lap-Pui Chau,Yi Wang

Main category: cs.CV

TL;DR: 本文提出了一种名为异构子图网络（HSNet）的新框架，用于图像超分辨率，通过将全局图分解为多个子组件，在保持计算可行性的同时有效利用图建模。

Details

Motivation: 现有的基于CNN和注意力机制的深度学习方法在结构上缺乏灵活性，而基于图的方法虽具有更强的表示能力，但计算复杂度高。因此需要一种兼顾表示能力和计算效率的新方法。 Method: 提出HSNet，包含构造性子图集块（CSSB）生成多样互补的子图，子图聚合块（SAB）自适应融合多图特征，以及节点采样策略（NSS）减少计算开销。 Result: 实验表明HSNet在重建质量和计算效率之间实现了良好平衡，达到了最先进的性能。 Conclusion: HSNet通过分解全局图为子图并引入高效聚合与采样策略，有效解决了传统方法结构僵化和图模型计算复杂的问题，是图像超分辨率任务中的一种高效且强大的解决方案。 Abstract: Existing deep learning approaches for image super-resolution, particularly those based on CNNs and attention mechanisms, often suffer from structural inflexibility. Although graph-based methods offer greater representational adaptability, they are frequently impeded by excessive computational complexity. To overcome these limitations, this paper proposes the Heterogeneous Subgraph Network (HSNet), a novel framework that efficiently leverages graph modeling while maintaining computational feasibility. The core idea of HSNet is to decompose the global graph into manageable sub-components. First, we introduce the Constructive Subgraph Set Block (CSSB), which generates a diverse set of complementary subgraphs. Rather than relying on a single monolithic graph, CSSB captures heterogeneous characteristics of the image by modeling different relational patterns and feature interactions, producing a rich ensemble of both local and global graph structures. Subsequently, the Subgraph Aggregation Block (SAB) integrates the representations embedded across these subgraphs. Through adaptive weighting and fusion of multi-graph features, SAB constructs a comprehensive and discriminative representation that captures intricate interdependencies. Furthermore, a Node Sampling Strategy (NSS) is designed to selectively retain the most salient features, thereby enhancing accuracy while reducing computational overhead. Extensive experiments demonstrate that HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency. The code will be made publicly available.

[134] Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation

Fei Zhang,Rob Chancia,Josie Clapp,Amirhossein Hassanzadeh,Dimah Dera,Richard MacKenzie,Jan van Aardt

Main category: cs.CV

TL;DR: 提出一种半自动、不确定性感知的TLS点云语义分割标注管道，通过球面投影、特征增强、集成学习和针对性标注减少标注工作量并保持高精度，构建了Mangrove3D数据集，并验证了方法在数据效率、特征重要性和跨数据集泛化能力方面的优势。

Details

Motivation: 由于人工标注成本高昂，地面激光扫描（TLS）点云的精确语义分割受到限制，因此需要一种能减少标注 effort 同时保持高准确率的方法。 Method: 将3D点投影到2D球面网格，使用多源特征增强像素，训练集成分割网络生成伪标签和不确定性图，指导对模糊区域进行标注，再将2D输出反投影回3D，并提供三层可视化工具辅助审查。 Result: 性能在约12次标注扫描后趋于饱和，几何特征最重要，九通道特征组合即可接近全部判别能力，mIoU达到约0.76；在ForestSemantic和Semantic3D上验证了特征增强策略的泛化能力。 Conclusion: 所提出的标注管道结合可视化工具、Mangrove3D数据集以及关于数据效率和特征重要性的实证结果，支持可扩展且高质量的TLS点云分割，适用于生态监测等应用。 Abstract: Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.

[135] Improving Artifact Robustness for CT Deep Learning Models Without Labeled Artifact Images via Domain Adaptation

Justin Cheung,Samuel Savine,Calvin Nguyen,Lin Lu,Alhassan S. Yasin

Main category: cs.CV

TL;DR: 该研究评估了域适应方法（特别是DANN）在无需新 artifact 标签的情况下，提升深度学习模型对CT图像中新出现环状伪影的鲁棒性。

Details

Motivation: 深度学习模型在训练分布之外的数据上性能下降，尤其是面对新的CT扫描伪影时。重新标注新分布数据成本高昂，因此需要一种无需标注的解决方案。 Method: 在sinogram空间模拟由探测器增益误差引起的环状伪影，并在OrganAMNIST腹部CT数据集上比较域对抗神经网络（DANN）与基线模型及基于增强的方法。 Result: 仅在干净图像上训练的基线模型无法泛化到含伪影图像，传统增强方法无效；而DANN利用无标签伪影数据成功保持高分类准确率，性能媲美使用标注伪影数据训练的模型，并意外地对均匀噪声也具泛化能力。 Conclusion: 域适应（特别是DANN）能有效应对医学影像中的分布偏移问题，无需昂贵的新伪影标注，具有在临床中应对新型伪影的潜力。 Abstract: Deep learning models which perform well on images from their training distribution can degrade substantially when applied to new distributions. If a CT scanner introduces a new artifact not present in the training labels, the model may misclassify the images. Although modern CT scanners include design features which mitigate these artifacts, unanticipated or difficult-to-mitigate artifacts can still appear in practice. The direct solution of labeling images from this new distribution can be costly. As a more accessible alternative, this study evaluates domain adaptation as an approach for training models that maintain classification performance despite new artifacts, even without corresponding labels. We simulate ring artifacts from detector gain error in sinogram space and evaluate domain adversarial neural networks (DANN) against baseline and augmentation-based approaches on the OrganAMNIST abdominal CT dataset. Our results demonstrate that baseline models trained only on clean images fail to generalize to images with ring artifacts, and traditional augmentation with other distortion types provides no improvement on unseen artifact domains. In contrast, the DANN approach successfully maintains high classification accuracy on ring artifact images using only unlabeled artifact data during training, demonstrating the viability of domain adaptation for artifact robustness. The domain-adapted model achieved classification performance on ring artifact test data comparable to models explicitly trained with labeled artifact images, while also showing unexpected generalization to uniform noise. These findings provide empirical evidence that domain adaptation can effectively address distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, suggesting promise for deployment in clinical settings where novel artifacts may emerge.

[136] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang,DanDan Zheng,Cheng Zou,Rui Liu,Xiaolong Wang,Kaixiang Ji,Weilong Chai,Jianxin Sun,Libin Wang,Yongjie Lv,Taozhi Huang,Jiajia Liu,Qingpei Guo,Ming Yang,Jingdong Chen,Jun Zhou

Main category: cs.CV

TL;DR: 提出MingTok，一种基于连续潜在空间的视觉分词器，通过三阶段架构统一视觉生成与理解任务，在生成和理解任务上均实现先进性能。

Details

Motivation: 现有离散潜在空间的视觉分词器存在量化误差，限制语义表达能力，难以兼顾视觉理解与生成的冲突需求。 Method: 设计MingTok，采用低级编码、语义扩展和视觉重建的三阶段连续分词架构；在此基础上构建Ming-UniVision，将视觉理解和生成统一为共享连续空间中的下一个标记预测任务。 Result: 在多个视觉-语言任务上实现最先进的性能，支持多轮、上下文内理解、生成与编辑，验证了连续统一表示的有效性。 Conclusion: 连续域中的统一视觉分词可有效协调生成与理解任务的需求，有望推动视觉与语言模型的进一步融合。 Abstract: Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

[137] Adaptive Stain Normalization for Cross-Domain Medical Histology

Tianyue Xu,Yanlin Wu,Abhai K. Tripathi,Matthew M. Ippolito,Benjamin D. Haeffele

Main category: cs.CV

TL;DR: 提出一种基于Beer-Lambert定律和非负矩阵分解算法展开的可训练颜色归一化模型，用于病理图像分析中的跨域目标检测与分类，优于现有染色归一化方法。

Details

Motivation: 解决因染色协议和成像条件差异导致的颜色变异问题，缓解深度学习模型在跨域部署时因域偏移引起的性能下降。 Method: 基于Beer-Lambert定律，通过非负矩阵分解（NMF）模型的算法展开设计可训练颜色归一化网络，提取染色不变的结构信息，并可集成到下游任务的骨干网络中。 Result: 在公开病理数据集和内部疟疾血涂片数据上验证，该方法在跨域目标检测与分类任务中优于多种先进的染色归一化方法。 Conclusion: 该物理引导的可训练归一化模型能有效提升病理图像分析中模型的跨域泛化能力，且无需模板图像，避免了人工选择和伪影问题。 Abstract: Deep learning advances have revolutionized automated digital pathology analysis. However, differences in staining protocols and imaging conditions can introduce significant color variability. In deep learning, such color inconsistency often reduces performance when deploying models on data acquired under different conditions from the training data, a challenge known as domain shift. Many existing methods attempt to address this problem via color normalization but suffer from several notable drawbacks such as introducing artifacts or requiring careful choice of a template image for stain mapping. To address these limitations, we propose a trainable color normalization model that can be integrated with any backbone network for downstream tasks such as object detection and classification. Based on the physics of the imaging process per the Beer-Lambert law, our model architecture is derived via algorithmic unrolling of a nonnegative matrix factorization (NMF) model to extract stain-invariant structural information from the original pathology images, which serves as input for further processing. Experimentally, we evaluate the method on publicly available pathology datasets and an internally curated collection of malaria blood smears for cross-domain object detection and classification, where our method outperforms many state-of-the-art stain normalization methods. Our code is available at https://github.com/xutianyue/BeerLaNet.

[138] SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Ayush Zenith,Arnold Zumbrun,Neel Raut,Jing Lin

Main category: cs.CV

TL;DR: 本文提出了一种新的合成数据集质量度量指标SDQM，用于在无需模型完全训练的情况下评估目标检测任务中合成数据的质量，该指标与YOLOv11的mAP得分具有强相关性，显著优于以往方法。

Details

Motivation: 由于大规模标注数据稀缺，合成数据成为提升模型性能的重要手段，但缺乏高效、准确的合成数据质量评估指标，尤其在资源受限的目标检测任务中。 Method: 提出Synthetic Dataset Quality Metric (SDQM)，通过分析合成数据的特征分布、对象分布和多样性等属性，在不依赖模型训练收敛的情况下量化数据质量。 Result: 实验表明SDQM与YOLOv11的mAP得分具有强相关性，而以往指标仅表现出中等或弱相关性；同时SDQM能提供改进数据质量的可操作洞察，减少昂贵的迭代训练需求。 Conclusion: SDQM是一种高效、可扩展的合成数据质量评估指标，为合成数据的生成与选择提供了新标准，有助于提升资源受限场景下的模型开发效率。 Abstract: The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

[139] AIM 2025 Challenge on Real-World RAW Image Denoising

Feiran Li,Jiacheng Li,Marcos V. Conde,Beril Besbinar,Vlad Hosu,Daisuke Iso,Radu Timofte

Main category: cs.CV

TL;DR: AIM 2025挑战赛旨在推动基于合成数据的高效、实用的低光RAW图像去噪技术发展，采用真实拍摄的多相机噪声图像作为评测基准，并结合全参考与非参考指标评估模型性能。

Details

Motivation: 推动在真实低光环境下跨相机通用的RAW图像去噪技术发展，解决现有方法在实际应用中的泛化能力不足问题。 Method: 建立包含五种DSLR相机拍摄的真实低光噪声图像的新评测基准，鼓励参赛者设计新的噪声合成方法、网络结构和训练策略。 Result: 通过PSNR、SSIM、LPIPS等全参考指标及ARNIQA、TOPIQ等非参考指标综合评估去噪性能，强调模型在不同相机间的泛化能力。 Conclusion: 该挑战赛促进了基于合成数据的鲁棒去噪模型发展，有望推动图像恢复、夜间自动驾驶等多个领域的实际应用进步。 Abstract: We introduce the AIM 2025 Real-World RAW Image Denoising Challenge, aiming to advance efficient and effective denoising techniques grounded in data synthesis. The competition is built upon a newly established evaluation benchmark featuring challenging low-light noisy images captured in the wild using five different DSLR cameras. Participants are tasked with developing novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models. Winners are determined based on a combination of performance metrics, including full-reference measures (PSNR, SSIM, LPIPS), and non-reference ones (ARNIQA, TOPIQ). By pushing the boundaries of camera-agnostic low-light RAW image denoising trained on synthetic data, the competition promotes the development of robust and practical models aligned with the rapid progress in digital photography. We expect the competition outcomes to influence multiple domains, from image restoration to night-time autonomous driving.

[140] Self-supervised Physics-guided Model with Implicit Representation Regularization for Fast MRI Reconstruction

Jingran Xu,Yuanyuan Liu,Yanjie Zhu

Main category: cs.CV

TL;DR: 本文提出了一种名为UnrollINR的零样本自监督MRI重建框架，无需外部训练数据即可实现特定扫描的高质量磁共振图像重建，在加速比高达10的情况下仍优于有监督方法。

Details

Motivation: 由于全采样MRI数据难以获取，且传统扫描时间长，亟需一种无需依赖大量标注数据、可快速重建高质量图像的自监督方法。 Method: 采用物理引导的展开迭代重建架构，并引入隐式神经表示（INR）作为正则化先验，结合深度展开网络与INR的隐式表达能力，在无外部训练数据的情况下进行零样本自监督重建。 Result: 实验结果表明，即使在加速比为10的高欠采样条件下，UnrollINR在重建性能上仍优于有监督学习方法，展现出更高的图像质量和模型可解释性。 Conclusion: UnrollINR是一种无需外部训练的高效自监督MRI重建方法，通过结合物理模型与隐式神经表示，在高加速比下实现了优越的重建性能，具有良好的临床应用潜力。 Abstract: Magnetic Resonance Imaging (MRI) is a vital clinical diagnostic tool, yet its widespread application is limited by prolonged scan times. Fast MRI reconstruction techniques effectively reduce acquisition duration by reconstructing high-fidelity MR images from undersampled k-space data. In recent years, deep learning-based methods have demonstrated remarkable progress in this field, with self-supervised and unsupervised learning approaches proving particularly valuable in scenarios where fully sampled data are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction framework named UnrollINR, which enables scan-specific MRI reconstruction without relying on external training data. The method adopts a physics-guided unrolled iterative reconstruction architecture and introduces Implicit Neural Representation (INR) as a regularization prior to effectively constrain the solution space. By combining a deep unrolled structure with the powerful implicit representation capability of INR, the model's interpretability and reconstruction performance are enhanced. Experimental results demonstrate that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to the supervised learning method, validating the superiority of the proposed method.

[141] A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

Zibo Su,Kun Wei,Jiahua Li,Xu Yang,Cheng Deng

Main category: cs.CV

TL;DR: 提出Multilingual Experts (MuEx)框架，通过音素-视觉单元对齐机制实现高质量多语言语音驱动 talking face 生成，并构建包含12种语言的Multilingual Talking Face Benchmark (MTFB)进行评估。

Details

Motivation: 现有TFS模型在非英语语言中表现不佳，主要由于训练数据以英语为主且缺乏跨语言泛化能力。 Method: 提出PG-MoE架构，利用音素和视觉单元作为音频与视频模态之间的通用中介；引入PV-Align机制增强音视频同步；构建MTFB多语言基准数据集。 Result: MuEx在MTFB的12种语言上均表现出优越性能，并展现出对未见语言的有效零样本泛化能力。 Conclusion: MuEx通过音素-视觉单元对齐和多语言专家设计，显著提升多语言TFS的表现，缓解了语言偏差和数据集偏倚问题。 Abstract: Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

[142] MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking

Tao Feng,Tingfa Xu,Haolin Qin,Tianhao Li,Shuaihao Han,Xuyang Zou,Zhan Lv,Jianan Li

Main category: cs.CV

TL;DR: 本文提出了MSITrack，目前最大且最多样化的多光谱单目标跟踪数据集，旨在解决RGB跟踪器在现实场景中因遮挡、相似物干扰和复杂背景而表现不佳的问题。该数据集包含300个视频、超过12.9万帧，涵盖55类对象和300种自然场景，具有更丰富的挑战属性和高精度标注。实验表明，使用多光谱数据显著优于仅使用RGB的方法。

Details

Motivation: 现有RGB跟踪器在复杂现实场景中面临遮挡、相似物干扰和背景混淆等挑战，性能受限；同时缺乏大规模、多样化的多光谱跟踪数据集，限制了相关算法的发展。 Method: 构建了一个大规模多光谱单目标跟踪数据集MSITrack，包含300个视频、超过12.9万帧，覆盖55类对象和300种自然场景。每个帧经过精细处理、人工标注和多阶段验证，确保标注质量。引入更具挑战性的属性，如目标与背景的颜色纹理相似性及相似物干扰。 Result: 在代表性跟踪器上的实验表明，利用MSITrack的多光谱数据显著提升了跟踪性能，相比仅使用RGB的基线方法效果更好，验证了多光谱信息在提升目标区分能力方面的有效性。 Conclusion: MSITrack是当前最大、最丰富的多光谱跟踪数据集，能够有效推动多光谱视觉跟踪领域的发展，为未来研究提供了重要资源。 Abstract: Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.

[143] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Zhihao Wen,Wenkang Wei,Yuan Fang,Xingtong Yu,Hui Zhang,Weicheng Zhu,Xin Zhang

Main category: cs.CV

TL;DR: 本文提出了StaR-KVQA，一种用于隐式知识视觉问答（IK-KVQA）的结构化推理方法，通过监督符号关系路径和路径支撑的自然语言解释，提升多模态大模型在无外部知识检索情况下的推理一致性与可解释性。

Details

Motivation: 现有MLLM在IK-KVQA任务中缺乏显式推理监督，导致推理过程不一致且泛化能力差，难以保证答案的可解释性和准确性。 Method: 提出StaR-KVQA，构建并选择路径支撑的结构化推理轨迹（包括符号关系路径和自然语言解释），通过结构化自蒸馏进行微调，整个过程无需外部检索器或知识库，推理为单次自回归生成。 Result: 在多个基准上显著提升了准确率，例如在OK-VQA上比最强基线高出+11.3%，同时增强了模型的可解释性和跨领域泛化能力。 Conclusion: StaR-KVQA通过结构化推理轨迹有效提升了MLLM在隐式知识VQA中的性能，实现了更透明、可验证且泛化能力强的多模态推理。 Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.

[144] Automated Neural Architecture Design for Industrial Defect Detection

Yuxi Liu,Yunfeng Ma,Yi Tang,Min Liu,Shuai Jiang,Yaonan Wang

Main category: cs.CV

TL;DR: 提出了一种自动化神经网络架构设计框架AutoNAD，用于工业表面缺陷检测，通过联合搜索卷积、Transformer和MLP结构，结合跨权重共享策略和可搜索的多级特征聚合模块，有效应对类别内差异和类别间相似性的挑战，并兼顾检测精度与推理延迟。

Details

Motivation: 工业表面缺陷检测面临类别内差异大和类别间相似度高的挑战，传统手工设计模型效率低且难以同时解决这两个问题，因此需要一种自动化、高效且能兼顾精度与速度的神经网络设计方法。 Method: 提出AutoNAD框架，联合搜索卷积、Transformer和MLP组件；引入跨权重共享策略以加速超网训练并提升子网性能；设计可搜索的多级特征聚合模块（MFAM）增强多尺度特征学习；结合延迟感知先验引导高效架构选择。 Result: 在三个工业缺陷数据集上验证了AutoNAD的有效性，检测性能优越，并成功应用于缺陷成像与检测平台，实现了高精度与低延迟的平衡。 Conclusion: AutoNAD能够自动设计出适应复杂工业缺陷检测任务的高性能网络结构，在提升检测精度的同时保证运行效率，显著降低人工设计成本，具有良好的实际应用价值。 Abstract: Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.

[145] Heptapod: Language Modeling on Visual Signals

Yongxin Zhu,Jiawei Chen,Yuanzhe Chen,Zhuo Chen,Dongya Jia,Jian Cong,Xiaobin Zhuang,Yuping Wang,Yuxuan Wang

Main category: cs.CV

TL;DR: Heptapod是一种基于语言建模原则的图像自回归模型，提出“下一2D分布预测”方法，结合因果注意力和重建导向的视觉分词器，在ImageNet上实现了2.70的FID，显著优于以往的因果自回归方法。

Details

Motivation: 将语言建模的基本原理应用于图像生成，避免依赖分类器自由引导（CFG）和语义分词器的趋势，探索更基础、统一的视觉生成建模范式。 Method: 采用因果注意力机制，提出“下一2D分布预测”学习目标：在每个时间步，模型预测整个2D图像空间网格的概率分布，结合自回归序列建模与掩码自编码的自监督学习优势。 Result: 在ImageNet生成任务上达到2.70的FID分数，显著优于此前的因果自回归模型。 Conclusion: Heptapod通过统一生成训练与整体语义建模，展示了纯自回归框架在图像生成中的强大潜力，推动对视觉信号上语言建模范式的重新思考。 Abstract: We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

[146] DreamOmni2: Multimodal Instruction-based Editing and Generation

Bin Xia,Bohao Peng,Yuechen Zhang,Junjia Huang,Jiyang Liu,Jingyao Li,Haoru Tan,Sitong Wu,Chengyao Wang,Yitong Wang,Xinglong Wu,Bei Yu,Jiaya Jia

Main category: cs.CV

TL;DR: 本文提出了多模态指令编辑与生成新任务，结合文本和图像指令，支持具体与抽象概念，增强了实用性。作者设计了DreamOmni2模型，通过特征混合、数据合成流程和新的编码机制解决数据与框架挑战，并引入联合训练提升复杂指令处理能力，同时建立了综合基准测试，实验结果显著。

Details

Motivation: 现有指令编辑依赖纯文本指令，难以捕捉细节；主体驱动生成局限于具体对象，忽视抽象概念。因此需要支持图文指令并涵盖抽象与具体概念的新方法。 Method: 提出多模态指令编辑与生成任务，构建DreamOmni2模型：采用三步数据合成流程（特征混合、编辑数据生成、提取数据扩展），设计索引编码与位置编码偏移机制以处理多图输入，并通过与视觉语言模型联合训练提升指令理解能力。 Result: 成功构建了支持文本和图像指令的多模态编辑与生成系统，在新提出的综合基准上表现优异，能够有效处理包含抽象概念的复杂编辑与生成任务。 Conclusion: DreamOmni2通过创新的数据合成策略和模型架构，推动了多模态指令编辑与生成的发展，兼顾具体与抽象概念，具有广泛的实际应用前景。 Abstract: Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

[147] Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion

Jie Luo,Yuxuan Jiang,Xin Jin,Mingyu Liu,Yihui Fan

Main category: cs.CV

TL;DR: 提出首个融合光场数据和点云数据的多模态语义分割网络Mlpfseg，通过特征补全和深度感知模块提升复杂场景下的分割性能。

Details

Motivation: 解决自动驾驶中复杂条件下（如遮挡）语义分割的挑战，克服光场与LiDAR模态融合中的视点多样性不足和模态差异问题。 Method: 构建首个融合光场和点云的多模态数据集，设计Mlpfseg网络，包含特征补全模块（差分重建点云特征图）和深度感知模块（增强遮挡注意力）。 Result: 相比单图像分割提升1.71 mIoU，相比单点云分割提升2.38 mIoU。 Conclusion: 所提方法有效提升了多模态语义分割性能，尤其在处理遮挡等复杂场景中表现出更强的鲁棒性。 Abstract: Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.

[148] SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis

Jipeng Lyu,Jiahua Dong,Yu-Xiong Wang

Main category: cs.CV

TL;DR: 提出SCas4D，一种用于动态场景建模的级联优化框架，利用3D高斯点阵中的结构模式，通过从粗到细的变形优化，显著减少训练迭代次数并保持高质量结果。

Details

Motivation: 动态场景建模在保持计算效率的同时准确捕捉形变仍具挑战性。 Method: 采用级联优化框架SCas4D，利用3D高斯点阵中高斯分布的层次化形变模式，逐步从部件级到点级优化变形。 Result: 每帧100次迭代内收敛，仅需现有方法1/20的训练迭代次数即达到相当性能，并在自监督关节物体分割、新视角合成和密集点跟踪任务中表现良好。 Conclusion: SCas4D通过利用动态场景中的层次结构，在保证建模精度的同时大幅提升优化效率，适用于多种动态视觉任务。 Abstract: Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.

[149] Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Maria Levchenko

Main category: cs.CV

TL;DR: 提出了一种针对大语言模型（LLM）在历史文档OCR中的评估方法，引入了新的评估指标并揭示了现有模型的“过度历史化”问题。

Details

Motivation: 传统OCR评估指标无法捕捉历史文本中的时间偏差和时期特定错误，数字人文学者缺乏适合LLM-based OCR的评估框架。 Method: 基于18世纪俄罗斯民用字体文本，提出了历史字符保留率（HCPR）和古体插入率（AIR）等新指标，并设计了污染控制与稳定性测试协议，评估了12种多模态大语言模型。 Result: Gemini和Qwen模型优于传统OCR，但存在“过度历史化”现象，即插入错误历史时期的古体字符；OCR后校正反而降低性能。 Conclusion: 所提方法为历史语料库数字化中的模型选择与质量评估提供了实用指南。 Abstract: Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

[150] DeRainMamba: A Frequency-Aware State Space Model with Detail Enhancement for Image Deraining

Zhiliang Zhu,Tao Zeng,Tao Yang,Guoliang Luo,Jiyong Zeng

Main category: cs.CV

TL;DR: 本文提出了一种新的图像去雨方法DeRainMamba，结合频域感知和空间细节增强，在多个基准上优于现有方法。

Details

Motivation: 现有的Mamba模型在处理图像去雨任务时对细粒度细节捕捉不足，且缺乏频域感知能力，限制了性能提升。 Method: 提出DeRainMamba模型，包含频域感知状态空间模块（FASSM）和多方向感知卷积（MDPConv），利用傅里叶变换区分雨线与高频细节，并通过多分支卷积恢复局部结构。 Result: 在四个公开数据集上实验表明，该方法在PSNR和SSIM指标上均优于当前最先进方法，同时参数更少、计算成本更低。 Conclusion: 将频域建模与空间细节增强结合的状态空间框架有效提升了单幅图像去雨的性能。 Abstract: Image deraining is crucial for improving visual quality and supporting reliable downstream vision tasks. Although Mamba-based models provide efficient sequence modeling, their limited ability to capture fine-grained details and lack of frequency-domain awareness restrict further improvements. To address these issues, we propose DeRainMamba, which integrates a Frequency-Aware State-Space Module (FASSM) and Multi-Directional Perception Convolution (MDPConv). FASSM leverages Fourier transform to distinguish rain streaks from high-frequency image details, balancing rain removal and detail preservation. MDPConv further restores local structures by capturing anisotropic gradient features and efficiently fusing multiple convolution branches. Extensive experiments on four public benchmarks demonstrate that DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM, while requiring fewer parameters and lower computational costs. These results validate the effectiveness of combining frequency-domain modeling and spatial detail enhancement within a state-space framework for single image deraining.

[151] OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Junhan Zhu,Hesong Wang,Mingluo Su,Zefang Wang,Huan Wang

Main category: cs.CV

TL;DR: 本文提出OBS-Diff，一种新颖的单次剪枝框架，用于高效压缩大规模文本到图像扩散模型，支持多种剪枝粒度，并通过时间步感知的Hessian构造和分组顺序剪枝策略实现训练免费、高精度的模型压缩。

Details

Motivation: 由于扩散模型迭代去噪的特性，现有的单次网络剪枝方法难以直接应用于大规模文本到图像扩散模型，且计算成本高昂，因此需要一种适应性强、无需再训练的高效压缩方法。 Method: 1) 改进经典Optimal Brain Surgeon (OBS) 方法，适配现代扩散模型复杂架构，支持非结构化、N:M半结构化及结构化（MHA头和FFN神经元）剪枝；2) 从误差累积角度出发，提出时间步感知的Hessian矩阵构建方法，采用对数递减权重强调早期时间步的重要性；3) 设计高效的分组逐次剪枝策略，摊销校准过程的计算开销。 Result: 实验表明，OBS-Diff在单次剪枝中达到最先进的性能，显著加速推理，同时视觉质量下降极小。 Conclusion: OBS-Diff为大规模扩散模型提供了一种高效、无需训练的单次剪枝方案，兼顾精度与压缩效率，适用于复杂的多阶段生成模型。 Abstract: Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

[152] Transforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All

Sheng Fu,Junchao Zhang,Kailun Yang

Main category: cs.CV

TL;DR: 提出一种基于直方图匹配的噪声转换方法，通过与去噪过程相互增强的循环机制，使单一高斯去噪器能有效处理多种分布外噪声。

Details

Motivation: 监督式高斯去噪器在面对分布外噪声时泛化能力有限，因其难以适应不同噪声类型的分布特性。 Method: 采用直方图匹配将任意噪声转换为目标高斯分布，并建立噪声转换与去噪之间的互增强循环；针对不同噪声复杂性，分别使用局部直方图匹配、块内通道重排和频域直方图匹配结合像素混洗下采样进行处理。 Result: 该方法显著提升了单一高斯去噪器对多种合成噪声（如泊松、椒盐、重复模式）和真实复杂噪声的去噪性能与泛化能力。 Conclusion: 所提出的噪声转换框架有效弥合了分布内与分布外噪声之间的差距，使得固定高斯去噪器能够广泛适用于各类非高斯、结构化和信号相关噪声。 Abstract: Supervised Gaussian denoisers exhibit limited generalization when confronted with out-of-distribution noise, due to the diverse distributional characteristics of different noise types. To bridge this gap, we propose a histogram matching approach that transforms arbitrary noise towards a target Gaussian distribution with known intensity. Moreover, a mutually reinforcing cycle is established between noise transformation and subsequent denoising. This cycle progressively refines the noise to be converted, making it approximate the real noise, thereby enhancing the noise transformation effect and further improving the denoising performance. We tackle specific noise complexities: local histogram matching handles signal-dependent noise, intrapatch permutation processes channel-related noise, and frequency-domain histogram matching coupled with pixel-shuffle down-sampling breaks spatial correlation. By applying these transformations, a single Gaussian denoiser gains remarkable capability to handle various out-of-distribution noises, including synthetic noises such as Poisson, salt-and-pepper and repeating pattern noises, as well as complex real-world noises. Extensive experiments demonstrate the superior generalization and effectiveness of our method.

[153] A deep multiple instance learning approach based on coarse labels for high-resolution land-cover mapping

Gianmarco Perantoni,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 本文提出了一种基于深度多实例学习（DMIL）的方法，利用高分辨率影像和弱低分辨率标签训练土地覆盖分类器，通过灵活的池化层将像素级语义与低分辨率标签关联，并在多类和多标签设置下重新构建MIL问题，实验表明该方法优于传统训练策略。

Details

Motivation: 解决高分辨率土地覆盖制图中训练标签数量不足和质量不高的问题，利用现有低分辨率或过时产品获取大量弱标签进行有效训练。 Method: 采用深度多实例学习框架，设计灵活的池化层以连接高分辨率像素语义与低分辨率标签，分别在多类（patch内主要类别）和多标签（patch中存在多个可能类别）设置下建模，并结合正-未标记学习（PUL）策略进行训练。 Result: 在2020年IEEE GRSS数据融合竞赛数据集上的实验结果显示，所提方法在使用弱监督低分辨率标签的情况下，显著优于标准训练策略。 Conclusion: 该方法能有效利用弱监督低分辨率标签训练高分辨率土地覆盖分类模型，无需高精度像素级标注，具有较强的实用性和推广价值。 Abstract: The quantity and the quality of the training labels are central problems in high-resolution land-cover mapping with machine-learning-based solutions. In this context, weak labels can be gathered in large quantities by leveraging on existing low-resolution or obsolete products. In this paper, we address the problem of training land-cover classifiers using high-resolution imagery (e.g., Sentinel-2) and weak low-resolution reference data (e.g., MODIS -derived land-cover maps). Inspired by recent works in Deep Multiple Instance Learning (DMIL), we propose a method that trains pixel-level multi-class classifiers and predicts low-resolution labels (i.e., patch-level classification), where the actual high-resolution labels are learned implicitly without direct supervision. This is achieved with flexible pooling layers that are able to link the semantics of the pixels in the high-resolution imagery to the low-resolution reference labels. Then, the Multiple Instance Learning (MIL) problem is re-framed in a multi-class and in a multi-label setting. In the former, the low-resolution annotation represents the majority of the pixels in the patch. In the latter, the annotation only provides us information on the presence of one of the land-cover classes in the patch and thus multiple labels can be considered valid for a patch at a time, whereas the low-resolution labels provide us only one label. Therefore, the classifier is trained with a Positive-Unlabeled Learning (PUL) strategy. Experimental results on the 2020 IEEE GRSS Data Fusion Contest dataset show the effectiveness of the proposed framework compared to standard training strategies.

[154] TTRV: Test-Time Reinforcement Learning for Vision Language Models

Akshit Singh,Shyam Marjit,Wei Lin,Paul Gavrikov,Serena Yeung-Levy,Hilde Kuehne,Rogerio Feris,Sivan Doveh,James Glass,M. Jehanzeb Mirza

Main category: cs.CV

TL;DR: 提出TTRV方法，通过测试时强化学习在无需标注数据的情况下提升视觉语言模型的性能，在物体识别和视觉问答任务上显著超越现有模型，甚至优于GPT-4o。

Details

Motivation: 现有强化学习奖励信号依赖标注数据，而人类可直接从环境中学习，因此希望实现无监督的、在推理时动态适应的奖励机制。 Method: 在GRPO框架基础上，利用基础模型输出频率设计奖励，并通过多次推理生成输出分布，同时以低熵奖励控制输出多样性，实现测试时自适应优化。 Result: 在16个数据集上平均提升24.6%（识别）和10.0%（VQA），最高提升达52.4%和29.8%；TTRV+InternVL 8B平均超过GPT-4o 2.3%；单个无标签样本下仍可提升5.5%。 Conclusion: 测试时强化学习能有效提升VLM性能，无需标注数据且具备强泛化能力，可媲美甚至超越最强闭源模型。 Abstract: Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model's output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model's output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

[155] Extreme Amodal Face Detection

Changlin Song,Yunzhong Hou,Michael Randall Barnes,Rahul Shome,Dylan Campbell

Main category: cs.CV

TL;DR: 提出了一种基于热图的单图像极端非模态物体检测器，利用上下文线索高效推断不可见人脸的存在。

Details

Motivation: 极端非模态检测在安全和隐私相关应用中具有重要意义，尤其是人脸检测任务，但现有方法依赖多帧序列或生成模型，效率较低。 Method: 设计了一种基于热图的极端非模态目标检测器，采用选择性由粗到精的解码器结构，仅利用单张图像中的上下文线索进行推理，无需采样。 Result: 该方法在新任务上取得了显著效果，甚至优于计算成本更高的生成式方法，验证了其高效性和有效性。 Conclusion: 所提出的样本无关、单图像方法为极端非模态检测提供了更高效的解决方案，具有广泛的应用潜力。 Abstract: Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.

[156] VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Teng Wang,Haojun Jiang,Yuxuan Wang,Zhenguo Sun,Shiji Song,Gao Huang

Main category: cs.CV

TL;DR: 本文提出了一种参数高效的视觉-动作适配器（VA-Adapter），将超声基础模型的知识应用于探头引导任务，帮助初级操作员实时获取高质量心脏超声图像。

Details

Motivation: 由于心脏超声操作难度高，熟练人员短缺，导致患者难以及时接受检查。因此，需要一种能辅助初级超声技师获取高质量图像的智能探头引导系统。 Method: 利用预训练的超声基础模型，并设计了一个参数高效的VA-Adapter，使其能够编码视觉-动作序列，通过微调少量参数学习精确的探头调整策略。 Result: 实验表明，VA-Adapter在探头引导任务上优于现有的强基线模型，具备良好的指导性能和序列推理能力。 Conclusion: VA-Adapter成功地将基础模型的医学知识迁移至探头引导任务，在紧凑设计下实现了高性能和参数效率，有助于提升超声检查的可及性。 Abstract: Echocardiography is a critical tool for detecting heart diseases. Recently, ultrasound foundation models have demonstrated remarkable capabilities in cardiac ultrasound image analysis. However, obtaining high-quality ultrasound images is a prerequisite for accurate diagnosis. Due to the exceptionally high operational difficulty of cardiac ultrasound, there is a shortage of highly skilled personnel, which hinders patients from receiving timely examination services. In this paper, we aim to adapt the medical knowledge learned by foundation models from vast datasets to the probe guidance task, which is designed to provide real-time operational recommendations for junior sonographers to acquire high-quality ultrasound images. Moreover, inspired by the practice where experts optimize action decisions based on past explorations, we meticulously design a parameter-efficient Vision-Action Adapter (VA-Adapter) to enable foundation model's image encoder to encode vision-action sequences, thereby enhancing guidance performance. With built-in sequential reasoning capabilities in a compact design, the VA-Adapter enables a pre-trained ultrasound foundation model to learn precise probe adjustment strategies by fine-tuning only a small subset of parameters. Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models. Our code will be released after acceptance.

[157] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Mitchell Keren Taraday,Shahaf Wagner,Chaim Baskin

Main category: cs.CV

TL;DR: 本文提出了EDJE，一种高效的判别式联合编码器，通过离线预计算和压缩视觉令牌，显著降低存储和在线计算成本，同时保持强大的检索性能。

Details

Motivation: 现有的视觉-语言联合编码器（如BLIP）由于昂贵的视觉特征提取阶段，在大规模部署时受到严重瓶颈，缺乏适用于多模态检索的高效重排序模型。 Method: 提出EDJE模型，离线预计算视觉令牌，并使用轻量级基于注意力的适配器进行压缩；在线推理时仅需在少量视觉令牌和文本上运行紧凑的联合编码器。 Result: EDJE每秒可处理5万对图像-文本，每张图像仅需49kB磁盘存储，在Flickr（零样本）和COCO（微调）检索任务上性能媲美现有技术。 Conclusion: EDJE在保持高检索性能的同时，大幅降低了存储和计算开销，实现了高吞吐量推理，推动了多模态检索中联合编码器的实际应用。 Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

[158] StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance

Jaeseok Jeong,Junho Kim,Gayoung Lee,Yunjey Choi,Youngjung Uh

Main category: cs.CV

TL;DR: 本文提出了一种新的负向视觉查询引导方法（NVQG），通过扩展无分类器引导并利用交换自注意力机制，有效减少了文本到图像生成中的内容泄露问题。

Details

Motivation: 现有的视觉提示方法在风格迁移时容易发生内容泄露，即不希望的元素被一同迁移，影响生成图像的质量和准确性。 Method: 扩展无分类器引导（CFG）以使用交换自注意力，并提出负向视觉查询引导（NVQG），通过在自注意力层中交换查询来模拟内容泄露场景，从而抑制不需要的内容传输。 Result: 该方法在多种风格和文本提示下表现出色，显著减少内容泄露，更好地保持了参考图像的风格，同时确保生成图像与文本描述一致。 Conclusion: NVQG是一种简单而有效的解决方案，能够提升视觉提示下的文本到图像生成质量，尤其在避免内容泄露方面表现突出。 Abstract: In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \href{https://github.com/naver-ai/StyleKeeper}{here}.

[159] Lattice-allocated Real-time Line Segment Feature Detection and Tracking Using Only an Event-based Camera

Mikihiro Ikura,Arren Glover,Masayoshi Mizuno,Chiara Bartolozzi

Main category: cs.CV

TL;DR: 提出一种基于高分辨率事件相机的实时线段检测与跟踪方法，通过格点分配管道实现高效、准确的几何特征提取。

Details

Motivation: 现有方法依赖额外帧相机或难以处理高事件率，限制了事件相机在真实场景中的独立应用。 Method: 设计了一种 lattice-allocated 管道，包括速度不变事件表示、基于拟合得分的线段检测和通过端点扰动的线段跟踪。 Result: 在自建和公开数据集上验证了实时性和高于现有方法的准确性。 Conclusion: 该方法实现了仅使用事件相机的高效线段提取，推动了其在真实环境中独立运行的应用。 Abstract: Line segment extraction is effective for capturing geometric features of human-made environments. Event-based cameras, which asynchronously respond to contrast changes along edges, enable efficient extraction by reducing redundant data. However, recent methods often rely on additional frame cameras or struggle with high event rates. This research addresses real-time line segment detection and tracking using only a modern, high-resolution (i.e., high event rate) event-based camera. Our lattice-allocated pipeline consists of (i) velocity-invariant event representation, (ii) line segment detection based on a fitting score, (iii) and line segment tracking by perturbating endpoints. Evaluation using ad-hoc recorded dataset and public datasets demonstrates real-time performance and higher accuracy compared to state-of-the-art event-only and event-frame hybrid baselines, enabling fully stand-alone event camera operation in real-world settings.

[160] Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization

Kanglei Zhou,Qingyi Pan,Xingxing Zhang,Hubert P. H. Shum,Frederick W. B. Li,Xiaohui Liang,Liyuan Wang

Main category: cs.CV

TL;DR: 本文提出了Continual AQA (CAQA)，将持续学习引入动作质量评估，以应对现实场景中质量分布的动态变化。作者提出MAGR++方法，通过自适应流形对齐图正则化缓解灾难性遗忘，在多个新构建的基准上实现了最先进的性能。

Details

Motivation: 传统动作质量评估方法难以应对现实场景中非平稳的质量分布变化，缺乏持续学习能力，导致模型泛化能力差。 Method: 提出MAGR++，结合全参数微调与两步特征校正机制：流形投影器将历史特征映射到当前空间，图正则化对齐局部和全局分布，并稳定浅层、适应深层网络。 Result: 在四个新建的CAQA基准上进行实验，MAGR++相比最强基线平均相关性提升3.6%（离线）和12.2%（在线），显著优于现有方法。 Conclusion: MAGR++有效解决了持续动作质量评估中的灾难性遗忘与特征流形偏移问题，具备强鲁棒性和推广能力，推动了AQA在动态环境中的实际应用。 Abstract: Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.

[161] Online Generic Event Boundary Detection

Hyungrok Jung,Daneul Kim,Seunggyun Lim,Jeany Son,Jonghyun Choi

Main category: cs.CV

TL;DR: 本文提出了一个新的任务——在线通用事件边界检测（On-GEBD），并设计了一个受事件分割理论启发的框架Estimator，包含一致事件预测器（CEA）和在线边界判别器（OBD），在实时流视频中有效检测通用事件边界，性能优于现有在线方法，并接近离线方法。

Details

Motivation: 现有通用事件边界检测方法需处理完整视频帧，无法实现实时在线检测，与人类实时感知不符，因此需要一种能在流式视频中即时检测事件边界的在线方法。 Method: 提出Estimator框架，基于事件分割理论，利用CEA根据历史帧预测未来帧，OBD通过计算预测误差并结合统计检验自适应调整阈值，实现对细微、无类别的事件变化的实时检测。 Result: 在Kinetics-GEBD和TAPOS数据集上，Estimator优于所有改编的在线基线模型，性能接近现有的离线GEBD方法。 Conclusion: Estimator有效解决了On-GEBD任务中的实时性和敏感性挑战，为流视频中的事件理解提供了新思路，并验证了基于预测误差的机制在模拟人类事件分割方面的有效性。 Abstract: Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

[162] Explaining raw data complexity to improve satellite onboard processing

Adrien Dorise,Marjorie Bellizzi,Adrien Girard,Benjamin Francesconi,Stéphane May

Main category: cs.CV

TL;DR: 本研究探讨了在卫星上直接使用遥感原始数据进行深度学习目标检测的影响，提出了一种从高分辨率L1影像生成类原始数据的模拟流程，并比较了YOLOv11s和YOLOX-S模型在原始数据与L1数据上的表现。结果显示，原始数据训练的模型在高置信度下对物体边界的识别较弱，需改进轮廓提取方法以提升性能。

Details

Motivation: 随着星上AI部署的可行性增加，直接使用未处理的传感器原始数据成为可能，但现有方法多依赖预处理后的图像，缺乏对原始数据直接应用的研究。因此，探索原始数据对深度学习模型性能的影响具有重要意义。 Method: 提出一种仿真工作流，将高分辨率L1影像转换为类原始数据；使用YOLOv11s和YOLOX-S两个目标检测模型，在原始数据和L1数据集上分别训练，并通过标准检测指标和可解释性工具进行性能对比分析。 Result: 两个模型在低至中等置信度阈值下表现相似，但在高置信度水平下，基于原始数据训练的模型在物体边界识别方面表现较差，表明原始数据对精细结构提取构成挑战。 Conclusion: 直接使用原始传感器数据进行星上目标检测是可行的，但现有模型在高置信预测中的边界识别能力有限；需设计具备更强轮廓感知能力的AI架构，以提升原始数据下的检测性能。 Abstract: With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11s and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.

[163] HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation

Samir Abou Haidar,Alexandre Chariot,Mehdi Darouich,Cyril Joly,Jean-Emmanuel Deschaud

Main category: cs.CV

TL;DR: 本文提出了一种名为HARP-NeXt的高速、高精度LiDAR语义分割网络，通过新颖的预处理方法和多尺度特征融合结构，在不依赖测试时增强或集成模型的情况下，在nuScenes和SemanticKITTI数据集上实现了优于现有方法的速度-精度权衡，且运行速度比PTv3快24倍。

Details

Motivation: 现有LiDAR语义分割方法在精度与速度之间存在权衡：基于点或稀疏卷积的方法精度高但速度慢，基于投影的方法速度快但损失几何信息；同时，测试时增强和复杂预处理进一步影响实时性，难以满足嵌入式系统的需求。 Method: 提出HARP-NeXt网络，包括：1）一种降低计算开销的新型预处理方法；2）Conv-SE-NeXt特征提取模块，避免深层堆叠；3）多尺度range-point融合主干网络，保留关键几何细节。 Result: 在nuScenes和SemanticKITTI基准上，HARP-NeXt在不使用集成模型或TTA的情况下，精度可媲美顶级方法PTv3，而推理速度提升24倍，显著优于其他SOTA方法。 Conclusion: HARP-NeXt有效解决了LiDAR语义分割中精度与速度的矛盾，特别适用于资源受限的嵌入式平台，为自动驾驶和移动机器人提供了高效可靠的解决方案。 Abstract: LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt

[164] Lung Infection Severity Prediction Using Transformers with Conditional TransMix Augmentation and Cross-Attention

Bouthaina Slika,Fadi Dornaika,Fares Bougourzi,Karim Hammoudi

Main category: cs.CV

TL;DR: 提出了一种基于Transformer的双编码器模型QCross-Att-PVT和一种条件在线数据增强方法Conditional Online TransMix，用于肺部感染严重程度评估，在CT和X光图像上均优于现有方法。

Details

Motivation: 准确评估肺部感染（如肺炎）的严重程度对临床决策至关重要，尤其是在大流行期间，而现有方法在特征提取和数据不平衡方面存在局限。 Method: 提出QCross-Att-PVT模型，采用并行编码器、交叉门控注意力机制和特征聚合模块；结合Conditional Online TransMix数据增强策略，生成混合标签图像块以缓解数据不平衡。 Result: 在RALO CXR和Per-COVID-19 CT两个基准数据集上，该方法 consistently 优于多种最先进的深度学习模型，验证了数据增强与门控注意力对提升鲁棒性和预测精度的关键作用。 Conclusion: 所提方法为肺部感染严重程度评估提供了可靠且可适应的工具，有助于临床诊断、疾病监测和个性化治疗规划。 Abstract: Lung infections, particularly pneumonia, pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential to support timely clinical decisions and optimize patient outcomes. In this work, we present a novel method applicable to both CT scans and chest X-rays for assessing lung infection severity. Our contributions are twofold: (i) QCross-Att-PVT, a Transformer-based architecture that integrates parallel encoders, a cross-gated attention mechanism, and a feature aggregator to capture rich multi-scale features; and (ii) Conditional Online TransMix, a custom data augmentation strategy designed to address dataset imbalance by generating mixed-label image patches during training. Evaluated on two benchmark datasets, RALO CXR and Per-COVID-19 CT, our method consistently outperforms several state-of-the-art deep learning models. The results emphasize the critical role of data augmentation and gated attention in improving both robustness and predictive accuracy. This approach offers a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The source code of this work is available at https://github.com/bouthainas/QCross-Att-PVT.

[165] Label-frugal satellite image change detection with generative virtual exemplar learning

Hichem Sahbi

Main category: cs.CV

TL;DR: 提出一种基于主动学习的新型变化检测算法，通过可逆图卷积网络生成最具代表性的未标记样本供专家标注，显著提高标注效率和检测性能。

Details

Motivation: 现有深度学习方法依赖大量手工标注数据，且受限于数据获取条件和用户主观性，需要更高效的标注策略。 Method: 提出一种新的主动学习框架，利用可逆图卷积网络生成虚拟样例（virtual exemplars），通过对抗损失函数衡量样本的代表性、多样性和模糊性，选择最具挑战性的样本供人工标注。 Result: 实验表明，该方法在较少标注数据下显著优于对比方法，有效提升变化检测性能。 Conclusion: 所提方法通过智能选择关键样本进行标注，实现了高效准确的变化检测，为遥感图像分析中的标注成本问题提供了有效解决方案。 Abstract: Change detection is a major task in remote sensing which consists in finding all the occurrences of changes in multi-temporal satellite or aerial images. The success of existing methods, and particularly deep learning ones, is tributary to the availability of hand-labeled training data that capture the acquisition conditions and the subjectivity of the user (oracle). In this paper, we devise a novel change detection algorithm, based on active learning. The main contribution of our work resides in a new model that measures how important is each unlabeled sample, and provides an oracle with only the most critical samples (also referred to as virtual exemplars) for further labeling. These exemplars are generated, using an invertible graph convnet, as the optimum of an adversarial loss that (i) measures representativity, diversity and ambiguity of the data, and thereby (ii) challenges (the most) the current change detection criteria, leading to a better re-estimate of these criteria in the subsequent iterations of active learning. Extensive experiments show the positive impact of our label-efficient learning model against comparative methods.

[166] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

Ran Yi,Teng Hu,Zihan Su,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了IAR2，一种改进的自回归图像生成框架，通过语义-细节双码本和分层预测机制，在保持计算效率的同时显著提升了生成质量与条件对齐能力。

Details

Motivation: 现有自回归视觉生成模型忽视了视觉数据的内在结构特性，且受限于预训练码本的刚性和硬聚类的不准确性，难以兼顾语义一致性与细节还原。 Method: 提出IAR2框架，包含语义-细节关联双码本（将图像表示解耦为语义与细节码本）、语义-细节自回归预测机制、局部上下文增强头，以及渐进注意力引导的自适应CFG机制，实现从粗到细的层次化生成。 Result: 在ImageNet上取得1.50的FID分数，达到自回归图像生成的新SOTA，同时具备更优的计算效率和生成对齐能力。 Conclusion: IAR2通过结构化的双码本与分层生成策略，有效提升了自回归模型在图像生成中的表达能力、真实感与条件控制精度。 Abstract: Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

[167] OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects

Bing Li,Wuqi Wang,Yanan Zhang,Jingzheng Li,Haigen Min,Wei Feng,Xingyu Zhao,Jie Zhang,Qing Guo

Main category: cs.CV

TL;DR: 提出一种基于文本到3D的对抗生成方法（Phy3DAdvGen），可生成对LiDAR检测器不可见且物理可实现的3D行人模型，有效规避SOTA检测器，在仿真和真实环境中验证了其攻击效果。

Details

Motivation: 现有3D对抗攻击难以导致目标完全消失且物理实现困难，需探索更有效、可实现的攻击方式以评估LiDAR检测系统的安全性。 Method: 通过在CARLA中系统操控行人模型的拓扑、连接性和强度分析影响因素，提出Phy3DAdvGen方法，迭代优化文本提示中的动词、物体和姿态，并基于包含13种真实物体的模型池生成物理可实现的对抗3D模型。 Result: 生成的3D行人可在CARLA仿真和物理环境中成功逃避六种SOTA LiDAR 3D检测器，表现出高攻击成功率和物理可实现性。 Conclusion: Phy3DAdvGen能有效揭示LiDAR检测系统在安全关键应用中的潜在漏洞，为后续防御机制设计提供重要参考。 Abstract: LiDAR-based 3D object detectors are fundamental to autonomous driving, where failing to detect objects poses severe safety risks. Developing effective 3D adversarial attacks is essential for thoroughly testing these detection systems and exposing their vulnerabilities before real-world deployment. However, existing adversarial attacks that add optimized perturbations to 3D points have two critical limitations: they rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce the text-to-3D adversarial generation method, a novel approach enabling physically realizable attacks that can generate 3D models of objects truly invisible to LiDAR detectors and be easily realized in the real world. Specifically, we present the first empirical study that systematically investigates the factors influencing detection vulnerability by manipulating the topology, connectivity, and intensity of individual pedestrian 3D models and combining pedestrians with multiple objects within the CARLA simulation environment. Building on the insights, we propose the physically-informed text-to-3D adversarial generation (Phy3DAdvGen) that systematically optimizes text prompts by iteratively refining verbs, objects, and poses to produce LiDAR-invisible pedestrians. To ensure physical realizability, we construct a comprehensive object pool containing 13 3D models of real objects and constrain Phy3DAdvGen to generate 3D objects based on combinations of objects in this set. Extensive experiments demonstrate that our approach can generate 3D pedestrians that evade six state-of-the-art (SOTA) LiDAR 3D detectors in both CARLA simulation and physical environments, thereby highlighting vulnerabilities in safety-critical applications.

[168] Generating Surface for Text-to-3D using 2D Gaussian Splatting

Huanning Dong,Fan Li,Ping Kuang,Jianwen Min

Main category: cs.CV

TL;DR: 本文提出了一种名为DirectGaussian的新方法，用于生成基于surfels表示的3D物体表面，通过2D高斯点阵渲染并结合多视角法线和纹理先验，优化过程中引入曲率约束以提高几何一致性，实验表明该方法能实现多样化且高保真的3D内容生成。

Details

Motivation: 由于自然世界中物体复杂的几何形状，现有的文本到3D生成方法在几何恢复或特定3D表示训练方面存在局限，难以生成高质量的3D内容。 Method: 提出DirectGaussian方法，利用条件文本生成模型，通过2D高斯点阵渲染3D物体表面，并引入多视角法线和纹理先验，在优化过程中加入曲率约束以提升多视角几何一致性。 Result: 实验证明，DirectGaussian能够生成多样化且高保真的3D内容，在几何细节和视觉质量方面表现优异。 Conclusion: DirectGaussian通过结合2D扩散先验与曲率约束优化，有效提升了基于surfels的3D内容生成质量，为文本到3D生成提供了新思路。 Abstract: Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.

[169] Learning Global Representation from Queries for Vectorized HD Map Construction

Shoumeng Qiu,Xinrun Li,Yang Long,Xiangyang Xue,Varun Ojha,Jian Pu

Main category: cs.CV

TL;DR: 本文提出了MapGR，一种用于在线构建矢量化高精地图的新型架构，通过引入全局表示学习和引导模块，显著提升了检测精度。

Details

Motivation: 现有基于DETR的方法依赖于独立的可学习对象查询，缺乏对高精地图中全局结构信息的建模能力。 Method: 提出MapGR，包含全局表示学习（GRL）模块和全局表示引导（GRG）模块，通过整体分割任务和全局上下文注入来增强查询的全局感知能力。 Result: 在nuScenes和Argoverse2数据集上验证了方法的有效性，相比现有基线模型显著提升了mAP指标。 Conclusion: MapGR通过显式建模全局表示，有效提升了矢量化高精地图构建的性能，为自动驾驶中的地图生成提供了新思路。 Abstract: The online construction of vectorized high-definition (HD) maps is a cornerstone of modern autonomous driving systems. State-of-the-art approaches, particularly those based on the DETR framework, formulate this as an instance detection problem. However, their reliance on independent, learnable object queries results in a predominantly local query perspective, neglecting the inherent global representation within HD maps. In this work, we propose \textbf{MapGR} (\textbf{G}lobal \textbf{R}epresentation learning for HD \textbf{Map} construction), an architecture designed to learn and utilize a global representations from queries. Our method introduces two synergistic modules: a Global Representation Learning (GRL) module, which encourages the distribution of all queries to better align with the global map through a carefully designed holistic segmentation task, and a Global Representation Guidance (GRG) module, which endows each individual query with explicit, global-level contextual information to facilitate its optimization. Evaluations on the nuScenes and Argoverse2 datasets validate the efficacy of our approach, demonstrating substantial improvements in mean Average Precision (mAP) compared to leading baselines.

[170] Addressing the ID-Matching Challenge in Long Video Captioning

Zhantao Yang,Huangji Wang,Ruili Feng,Han Zhang,Yuting Hu,Shangwen Zhu,Junyan Li,Yu Liu,Fan Cheng

Main category: cs.CV

TL;DR: 本文提出了一种新的长视频字幕生成方法RICE，旨在提升身份匹配（ID-Matching）的准确率，通过增强图像信息利用和个体描述信息量，在GPT-4o基础上将ID-Matching的精确率从50%提升至90%，召回率从15%提升至80%。

Details

Motivation: 长视频字幕生成中存在身份识别不一致的问题（即ID-Matching问题），现有方法泛化能力有限且依赖点对点匹配，难以有效处理不同帧中同一人物的识别。 Method: 基于大型视觉语言模型（LVLMs），构建新的ID-Matching评估基准，并通过分析发现提升图像信息使用和丰富个体描述可增强ID-Matching能力，据此提出RICE方法。 Result: 在多个实验中验证了RICE在字幕质量和ID-Matching性能上的优越性，显著提升了精确率和召回率，实现了对长视频中不同人物的持续跟踪。 Conclusion: RICE有效释放了LVLMs自身的ID-Matching潜力，显著提升了长视频字幕中人物身份识别的一致性和准确性，为多模态理解和文本生成视频任务提供了重要支持。 Abstract: Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.

[171] No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

Girolamo Macaluso,Lorenzo Mandelli,Mirko Bicchierai,Stefano Berretti,Andrew D. Bagdanov

Main category: cs.CV

TL;DR: 提出一种基于强化学习的后训练框架，用于通过文本提示微调预训练的运动扩散模型，无需额外动作捕捉数据或完全重新训练，实现高效、灵活且保护隐私的动作适应。

Details

Motivation: 现有的扩散模型在生成人体运动时虽然表现优秀，但适应新动作或风格通常需要大量新的动作捕捉数据和完全重新训练，成本高且难以扩展。因此，需要一种更高效、低成本的方法来实现对未见动作或风格的适应。 Method: 采用基于强化学习的微调框架，利用预训练的文本-运动检索网络作为奖励信号，通过去噪扩散策略优化（Denosing Diffusion Policy Optimization）调整扩散模型的生成分布，仅依赖文本提示而无需真实运动数据。 Result: 在HumanML3D和KIT-ML数据集上验证了方法的有效性，涵盖跨数据集适应和留一出实验，结果表明该方法在定量指标和用户研究中均能持续提升生成动作的质量与多样性，同时保持原始分布上的性能。 Conclusion: 所提出的方法是一种灵活、数据高效且保护隐私的运动适应解决方案，能够在无需配对动作数据的情况下有效迁移预训练扩散模型至目标领域。 Abstract: Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.

[172] Bayesian Modelling of Multi-Year Crop Type Classification Using Deep Neural Networks and Hidden Markov Models

Gianmarco Perantoni,Giulio Weikmann,Lorenzo Bruzzone

Main category: cs.CV

TL;DR: 提出了一种结合深度学习与贝叶斯建模（HMM与Transformer Encoder）的新方法，用于年度卫星影像时间序列分类，显著提升了多作物类型分类的时序一致性和F1分数。

Details

Motivation: 为了提高年度土地覆盖图的时间一致性，准确建模多年土地覆盖变化，特别是在复杂作物类型序列中的分类性能。 Method: 将隐马尔可夫模型（HMM）与基于Transformer Encoder的深度神经网络结合，采用级联分类结构，利用HMM对TE输出的时序标签进行优化，捕捉长时间序列中的状态转移模式。 Result: 在包含47种作物类型和六年Sentinel-2数据的数据集上验证，该方法显著提升了整体性能和F1分数，证明了建模时序一致性的有效性。 Conclusion: HMM与Transformer结合的方法能有效提升年度土地覆盖分类的时序一致性与精度，尤其适用于多作物类型长期监测任务。 Abstract: The temporal consistency of yearly land-cover maps is of great importance to model the evolution and change of the land cover over the years. In this paper, we focus the attention on a novel approach to classification of yearly satellite image time series (SITS) that combines deep learning with Bayesian modelling, using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs. The proposed approach aims to capture both i) intricate temporal correlations in yearly SITS and ii) specific patterns in multiyear crop type sequences. It leverages the cascade classification of an HMM layer built on top of the TE, discerning consistent yearly crop-type sequences. Validation on a multiyear crop type classification dataset spanning 47 crop types and six years of Sentinel-2 acquisitions demonstrates the importance of modelling temporal consistency in the predicted labels. HMMs enhance the overall performance and F1 scores, emphasising the effectiveness of the proposed approach.

[173] U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Fenghe Tang,Chengqi Dong,Wenxin Ma,Zikang Xu,Heqin Zhu,Zihang Jiang,Rongsheng Wang,Yuhao Wang,Chenxu Wu,Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: U-Bench是首个大规模、统计严谨的基准，评估了100种U-Net变体在28个数据集和10种成像模态上的表现，提出U-Score指标衡量性能与效率的权衡，并提供模型选择建议和公开资源，推动未来U-Net分割模型的公平、可复现评估。

Details

Motivation: 尽管U-Net及其变体在医学图像分割中广泛应用，但缺乏系统性、统计严谨的综合评测，尤其在效率、零样本泛化和跨数据集泛化方面存在不足，因此需要建立一个全面可靠的基准。 Method: 构建U-Bench基准，评估100种U-Net变体在28个数据集、10种模态下的表现，从统计鲁棒性、零样本泛化和计算效率三个维度进行分析，提出U-Score指标，并开发模型推荐代理以指导模型选择。 Result: 发现现有评估存在显著缺陷，识别出高性能且高效的模型，验证U-Score的有效性，并揭示数据集特征和架构范式对性能的影响。 Conclusion: U-Bench填补了U-Net变体评估的空白，为未来医学图像分割模型的公平、可复现和实用化评估建立了基础。 Abstract: Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.

[174] Concept Retrieval -- What and How?

Ori nizan,Oren Shrout,Ayellet Tal

Main category: cs.CV

TL;DR: 本文提出了一种基于嵌入空间中双模高斯分布建模的新方法，用于从图像中检索共享核心概念的其他图像，超越了传统的视觉或语义相似性匹配。

Details

Motivation: 传统图像检索方法主要依赖视觉或语义相似性，难以捕捉图像背后的深层概念和叙事结构，因此需要一种能识别和检索共享抽象概念图像的新方法。 Method: 通过分析嵌入空间中近邻的分布特性，利用双模高斯分布对邻居关系进行建模，从而识别出与查询图像共享核心概念的图像。 Result: 在定性、定量及人工评估中均验证了该方法的有效性，能够更准确地发现具有共同概念但视觉上可能差异较大的图像。 Conclusion: 所提出的方法能有效识别并检索共享核心概念的图像，为图像检索提供了超越外观和表层语义的新路径。 Abstract: A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/

[175] DADO: A Depth-Attention framework for Object Discovery

Federico Gonzalez,Estefania Talavera,Petia Radeva

Main category: cs.CV

TL;DR: 提出了一种结合注意力机制和深度模型的无监督物体发现方法DADO，通过动态加权策略自适应地融合注意力和深度特征，在标准基准上优于现有方法。

Details

Motivation: 无监督物体发现面临缺乏标注数据和复杂场景中物体识别困难的问题，需要一种能够自适应利用注意力和深度信息的方法。 Method: 提出了DADO模型，结合注意力机制和深度估计，并引入动态加权策略，根据图像全局特征自适应调整注意力与深度特征的权重。 Result: 在标准基准上的实验表明，DADO在物体发现准确性和鲁棒性方面优于现有的最先进方法，且无需微调。 Conclusion: DADO通过动态融合注意力和深度信息，有效提升了无监督物体发现的性能，具有良好的泛化能力和应用潜力。 Abstract: Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.

[176] Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Rémi Kazmierczak,Steve Azzolin,Eloïse Berthier,Goran Frehse,Gianni Franchi

Main category: cs.CV

TL;DR: 本文提出了一种名为CHILI的新方法，用于抑制CLIP在概念瓶颈模型中产生的概念幻觉问题，并通过局部可解释性实现更准确和可解释的基于显著性的解释。

Details

Motivation: 由于现有基于CLIP的零样本概念提取方法容易产生概念幻觉，导致解释不忠于模型决策过程，因此需要一种更可靠的方法来提升解释的准确性与可信度。 Method: 提出Concept Hallucination Inhibition via Localized Interpretability (CHILI)，通过解耦图像嵌入并定位与目标概念相关的像素区域，抑制CLIP的概念幻觉。 Result: CHILI能够有效减少概念误判，提升概念定位精度，并生成更具可解释性的显著性图解释。 Conclusion: CHILI提高了无标注概念提取场景下CBMs的解释忠实性和可解释性，为基于CLIP的XAI方法提供了重要改进。 Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.

Dongki Jung,Jaehoon Choi,Yonghan Lee,Sungmin Eum,Heesung Kwon,Dinesh Manocha

Main category: cs.CV

TL;DR: 提出了一种无需训练的单目几何优化方法MoRe，通过图优化框架提升跨视角一致性和尺度对齐，改善3D重建与新视角合成。

Details

Motivation: 单目3D基础模型存在尺度模糊和跨帧不一致问题，限制了其在3D视觉任务中的性能。 Method: 利用特征匹配建立帧间对应关系，并基于估计的3D点和法向量构建图优化框架，进行局部平面逼近以优化几何一致性。 Result: MoRe在3D重建和稀疏视角下的新视角合成任务中均表现出性能提升，有效缓解了尺度歧义问题。 Conclusion: MoRe作为一种训练-free的方法，能有效增强单目3D基础模型的几何一致性与尺度对齐，具有良好的泛化性和应用潜力。 Abstract: Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

[178] Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?

Jan Fiszer,Dominika Ciupek,Maciej Malawski

Main category: cs.CV

TL;DR: 该研究探讨了在非独立同分布（非IID）数据条件下，联邦学习（FL）在脑肿瘤分割中的应用效果，特别是在不同MRI强度归一化方法导致的数据异构性下，FL仍能保持良好性能，Dice分数达到92%，与集中式模型相当。

Details

Motivation: 深度学习在医学影像中应用广泛，但数据隐私、存储和传输问题限制其发展；联邦学习可缓解这些问题，但在非IID数据下的表现尚不明确，因此需要研究其在真实医疗场景中的鲁棒性。 Method: 通过在不同数据子集上应用不同的MRI强度归一化方法来模拟非IID条件，并使用这些子集进行联邦学习和集中式学习的训练与测试，比较模型在脑肿瘤分割任务中的表现。 Result: 联邦学习方法在不一致归一化的数据下表现出强韧性，3D Dice分数达到92%，与使用全部数据训练的集中式模型性能相当。 Conclusion: 联邦学习能够在保护数据隐私的同时有效训练高性能的医学图像分割模型，即使在由不同归一化方法引起的数据异构环境下也具有良好的鲁棒性，是医疗AI应用中的可行方案。 Abstract: Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: https://github.com/SanoScience/fl-varying-normalization.

[179] Graph Conditioned Diffusion for Controllable Histopathology Image Generation

Sarah Cechnicka,Matthew Baugh,Weitong Zhang,Mischa Dombrowski,Zhe Li,Johannes C. Paetzold,Candice Roufosse,Bernhard Kainz

Main category: cs.CV

TL;DR: 提出基于图的对象级表示方法，通过图条件扩散模型实现对医学图像生成的细粒度控制，在真实病理学场景中验证了生成数据可有效替代标注数据用于下游分割任务。

Details

Motivation: 现有扩散概率模型在噪声潜在空间中缺乏语义结构和强先验，难以对医学图像生成进行有意义的控制，尤其是在需要保持空间布局、形状和纹理一致性的敏感领域。 Method: 引入基于图的对象级表示，将图像中主要结构对应的节点特征及其关系编码为图，通过Transformer模块处理，并利用文本条件机制融入扩散模型，实现细粒度生成控制。 Result: 在真实病理学数据上验证，生成图像能可靠地替代标注患者数据用于下游分割任务，提升了生成内容的可控性与语义一致性。 Conclusion: 该方法通过引入结构化图表示，增强了扩散模型在医学图像生成中的可控性和语义准确性，为敏感领域的可控生成提供了新思路。 Abstract: Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.

[180] Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Karim El Khoury,Maxime Zanella,Christophe De Vleeschouwer,Benoit Macq

Main category: cs.CV

TL;DR: 本文提出了首个用于评估遥感视觉-语言模型（RSVLMs）在少样本学习场景下适应能力的结构化基准，通过在十个遥感场景分类数据集上对三种先进RSVLM和五种少样本策略进行实验，发现即使零样本性能相似的模型在少样本设置下表现差异显著，强调需开发更鲁棒且针对遥感领域的少样本方法，并公开了可复现的基准框架代码。

Details

Motivation: 尽管RSVLM在零样本任务中表现优异，但其在低数据条件下的泛化能力（如少样本学习）尚未被充分探索，缺乏统一的评估基准。 Method: 构建了一个涵盖十个遥感场景分类数据集的基准，采用五种常见的少样本适应策略，在三种不同骨干网络的RSVLM上进行系统性实验与分析。 Result: 实验表明，具有相似零样本性能的RSVLM在少样本设置下表现差异大，某些模型更易于适应；现有方法性能波动大，无明显最优方案。 Conclusion: 需要为遥感领域设计更有效的少样本适应方法，所提出的基准和开源框架有助于推动该方向的研究。 Abstract: Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

[181] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

Chenfei Liao,Wensong Wang,Zichen Wen,Xu Zheng,Yiyu Wang,Haocong He,Yuanhuiyi Lyu,Lutao Jiang,Xin Zou,Yuqian Fu,Bin Ren,Linfeng Zhang,Xuming Hu

Main category: cs.CV

TL;DR: 本文提出了一种新的评估框架VTC-Bench，用于更公平、准确地评估多模态大语言模型中的视觉token压缩方法，发现简单的图像下采样在现有基准上常优于先进压缩方法，并指出当前基准存在任务不匹配和噪声问题。

Details

Motivation: 现有基准主要用于评估MLLM的感知与推理能力，而非为视觉token压缩设计，导致评估压缩方法时存在任务不匹配问题。 Method: 通过实验分析现有压缩方法在多个基准上的表现，提出使用图像下采样作为数据过滤手段，并构建包含数据过滤机制的评估框架VTC-Bench以降低噪声。 Result: 发现简单下采样常优于先进压缩方法；验证了当前基准对视觉token压缩任务具有噪声；VTC-Bench能更有效地评估压缩方法性能。 Conclusion: 应重新审视现有基准在视觉token压缩任务中的适用性，VTC-Bench通过数据过滤机制提升了评估的公平性与准确性。 Abstract: Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.

[182] MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

Yihao Zhi,Chenghong Li,Hongjie Liao,Xihe Yang,Zhengwentai Sun,Jiahao Chang,Xiaodong Cun,Wensen Feng,Xiaoguang Han

Main category: cs.CV

TL;DR: 本文提出MV-Performer，一种用于从单目全身视频生成同步多视角视频的创新框架，通过利用MVHumanNet数据集和基于点云渲染的法线图作为条件信号，实现360度视角合成，并在多个数据集上实现了人体中心4D新视角合成的最先进效果。

Details

Motivation: 现有视频扩散模型主要集中在前视图视角重定向，难以实现360度视角变化，尤其在人体为中心的场景中缺乏同步多视角视频生成能力。 Method: 提出MV-Performer，采用相机依赖的法线图（由有向部分点云渲染）作为条件信号，结合参考视频、部分渲染和多视角信息，设计多视图人体中心视频扩散模型，并提供适用于野外视频的鲁棒推理流程以减少单目深度估计误差带来的伪影。 Result: 在三个数据集上的大量实验表明，MV-Performer在360度新视角视频生成中具有卓越的同步性、真实感和鲁棒性，显著优于现有方法。 Conclusion: MV-Performer有效解决了人体中心4D新视角合成中的视角覆盖与同步难题，为基于扩散模型的多视角视频生成提供了新的解决方案。 Abstract: Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer's state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.

[183] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

Soroosh Tayebi Arasteh,Mina Shaigan,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn

Main category: cs.CV

TL;DR: 在胸部X光片分析中，DINOv3结合512x512分辨率和ConvNeXt-B骨干网络表现最佳，显著优于ImageNet初始化和其他设置；但进一步提升至1024x1024分辨率无明显增益，且冻结的7B大模型特征表现不如微调中小模型，强调领域适应的重要性。

Details

Motivation: 探究现代自监督模型（如DINOv3）在胸部X光这种高通量、细粒度任务中的迁移性能，尤其是在不同分辨率、骨干网络和人群（成人vs儿童）下的表现差异。 Method: 在七个数据集（n>814,000）上系统评估DINOv3、DINOv2和ImageNet初始化在ViT-B/16和ConvNeXt-B骨干上的表现，输入分辨率包括224x224、512x512和1024x1024，并评估7B大模型的冻结特征性能，主要指标为平均AUROC。 Result: 在224x224下DINOv3与DINOv2表现相当；512x512时DINOv3显著优于DINOv2和ImageNet；儿童数据集中无显著差异；ConvNeXt-B整体优于ViT-B/16；冻结DINOv3-7B特征表现不如微调中小模型；1024x1024未带来额外增益；分辨率提升对边界相关和小病灶检测最有益。 Conclusion: 在胸部X光任务中，使用DINOv3初始化的ConvNeXt-B模型在512x512分辨率下达到最优性能，是临床应用的推荐配置；更大模型或更高分辨率成本效益低，领域适应（微调）至关重要。 Abstract: Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta's DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.

[184] EigenScore: OOD Detection using Covariance in Diffusion Models

Shirin Shoushtari,Yi Wang,Xiao Shi,M. Salman Asif,Ulugbek S. Kamilov

Main category: cs.CV

TL;DR: 提出了一种名为EigenScore的新型OOD检测方法，利用扩散模型后验协方差的特征值谱进行分布外检测，在多个基准上达到SOTA性能，尤其在近似OOD场景下表现出更强的鲁棒性。

Details

Motivation: 现有的基于扩散模型的OOD检测方法在面对接近分布内数据的OOD样本时表现不佳，缺乏稳定且可靠的检测信号，因此需要一种更鲁棒的检测机制。 Method: 利用扩散模型去噪器诱导的后验协方差矩阵的特征值谱作为检测信号，通过Jacobian-free的子空间迭代方法高效估计最大特征值，仅需前向传播即可完成计算。 Result: EigenScore在多个标准数据集上实现了最先进的性能，AUROC最高提升达5%，并在CIFAR-10 vs CIFAR-100等近OOD场景中表现出显著优于现有方法的鲁棒性。 Conclusion: 后验协方差的谱特性是OOD检测的可靠信号，EigenScore为基于扩散模型的OOD检测提供了一种高效、鲁棒的新范式。 Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves SOTA performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.

[185] GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

Wen Ye,Zhaocheng Liu,Yuwei Gui,Tingyu Yuan,Yunyue Su,Bowen Fang,Chaoyang Zhao,Qiang Liu,Liang Wang

Main category: cs.CV

TL;DR: 提出了一种模型无关、可解释的测试时提示优化策略GenPilot，通过多智能体系统实现错误分析、自适应探索和迭代优化，显著提升复杂长提示下的文本-图像一致性与生成质量。

Details

Motivation: 现有方法在处理复杂长提示时存在语义不一致、细节缺失问题，且多数需训练或缺乏系统性优化机制，亟需一种通用、高效、可解释的提示优化方案。 Method: 设计了一个即插即用的多智能体系统GenPilot，包含错误分析、基于聚类的自适应探索、细粒度验证和记忆模块，直接在输入文本上进行迭代式提示优化。 Result: 在DPG-bench和Geneval上分别取得最高16.9%和5.7%的性能提升，显著增强文本-图像一致性和结构连贯性。 Conclusion: GenPilot是一种灵活、模型无关且可解释的测试时提示优化框架，有效解决了复杂提示下的文本到图像生成问题，为自动提示优化提供了系统性思路和实践经验。 Abstract: Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.

[186] TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Jiaben Chen,Zixin Wang,Ailing Zeng,Yang Fu,Xueyang Yu,Siyuan Cen,Julian Tanke,Yihang Chen,Koichi Saito,Yuki Mitsufuji,Chuang Gan

Main category: cs.CV

TL;DR: 本文提出了TalkCuts，一个大规模多镜头人声视频生成数据集，包含164k个片段、超过500小时的高质量视频，涵盖多种摄像机视角和丰富的多模态标注。同时提出Orator框架作为基线，利用大语言模型指导多模态视频生成，实验证明在TalkCuts上训练能显著提升生成视频的视觉连贯性与表现力。

Details

Motivation: 现有数据集多局限于单镜头、固定视角的人声视频，难以支持复杂、动态的多镜头视频生成研究。因此需要一个具备多样化镜头切换、丰富动作和语音信息的大规模多模态数据集来推动该领域发展。 Method: 构建了包含详细文本描述、2D关键点和3D SMPL-X动作标注的大型数据集TalkCuts，并提出Orator框架：利用大语言模型作为‘导演’，生成摄像机切换、手势动作和语音调节等多模态指令，通过集成的多模态生成模块合成连贯的长视频。 Result: 在姿态引导和音频驱动两种设置下的大量实验表明，使用TalkCuts训练显著提升了生成视频的电影级连贯性和视觉吸引力；Orator框架能够有效协调多模态输出，实现可控的多镜头说话人视频生成。 Conclusion: TalkCuts为可控的多镜头人声视频生成提供了坚实的数据基础，Orator展示了LLM在多模态视频生成中的潜力，二者共同推动了该领域的多模态学习与应用发展。 Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

[187] Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection

Franco Javier Arellano,José Ignacio Orlando

Main category: cs.CV

TL;DR: 本研究系统比较了基础模型（FM）与标准迁移学习方法在糖尿病性黄斑水肿（DME）检测中的表现，发现尽管FM规模庞大，但在多数评估场景中，轻量级CNN（如EfficientNet-B0）仍优于或媲美FM，表明在数据稀缺环境下，传统CNN仍是强有力的基线方法。

Details

Motivation: 由于标注数据有限，深度学习在DME检测中的应用面临挑战，而新兴的基础模型（FM）是否适用于此类精细眼科任务尚不明确，因此需要系统评估其有效性。 Method: 比较了两种主流视网膜图像基础模型（RETFound和FLAIR）与EfficientNet-B0，在IDRiD、MESSIDOR-2和OEFI三个数据集上，采用不同训练策略和评估设置进行DME检测性能对比。 Result: EfficientNet-B0在大多数评估指标（如ROC曲线下面积和精确率-召回率曲线）中排名前二；RETFound仅在OEFI数据集中表现良好；FLAIR在适当提示下展现出有竞争力的零样本性能。 Conclusion: 基础模型在细粒度眼科任务（如DME检测）中并未持续优于微调后的CNN，提示轻量级CNN在数据稀缺场景下仍是可靠且高效的基线选择。 Abstract: Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images--RETFound and FLAIR--and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.

[188] SpecGuard: Spectral Projection-based Advanced Invisible Watermarking

Inzamamul Alam,Md Tanvir Islam,Khan Muhammad,Simon S. Woo

Main category: cs.CV

TL;DR: 本文提出了一种名为SpecGuard的新型图像水印方法，通过频域变换和小波分解将信息嵌入卷积层，具有高鲁棒性和不可见性，能有效抵抗多种攻击，实验表明其性能优于现有方法。

Details

Motivation: 现有水印方法在面对图像变形、重生成和对抗扰动等变换时缺乏足够的鲁棒性，限制了其在真实场景中的应用。 Method: 通过小波投影将图像分解到高频波段，并利用快速傅里叶变换进行频域转换（谱投影），将水印信息嵌入隐藏卷积层；引入强度因子增强抗攻击能力，解码器基于帕塞瓦尔定理提取水印。 Result: 实验表明SpecGuard在不可见性、容量和鲁棒性方面均优于现有最先进模型，尤其在对抗性、几何变换和图像再生攻击下仍能准确恢复水印。 Conclusion: SpecGuard是一种高效且鲁棒的图像水印方案，通过频域嵌入和理论驱动的解码机制，显著提升了水印在复杂变换下的稳定性与可靠性。 Abstract: Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval's theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark's invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{https://github.com/inzamamulDU/SpecGuard_ICCV_2025}{\textcolor{blue}{\textbf{GitHub}}}.

[189] MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin,Seongchan Kim,Dahyun Chung,Jaeho Lee,Hyunwook Choi,Jisu Nam,Jiyoung Kim,Seungryong Kim

Main category: cs.CV

TL;DR: 本文提出MATRIX-11K数据集和MATRIX正则化方法，用于分析和提升视频DiT模型中的多实例及主客体交互建模能力，通过语义对齐与传播机制改善生成质量。

Details

Motivation: 现有视频DiT在建模多实例或主客体交互方面存在不足，缺乏对模型内部交互表示机制的理解。 Method: 构建MATRIX-11K数据集，引入视频到文本和视频到视频注意力分析视角，并提出基于多实例mask track的注意力对齐正则化方法MATRIX。 Result: 实验表明MATRIX能有效提升交互保真度和语义对齐，减少漂移和幻觉现象，且消融研究验证了设计有效性。 Conclusion: 通过针对性的注意力正则化，可在关键层增强视频DiT的交互建模能力，显著提升复杂场景下的视频生成质量。 Abstract: Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

[190] WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Zezhong Qian,Xiaowei Chi,Yuming Li,Shizun Wang,Zhiyuan Qin,Xiaozhu Ju,Sirui Han,Shanghang Zhang

Main category: cs.CV

TL;DR: 提出WristWorld，首个仅从锚点视角生成腕部视角视频的4D世界模型，通过重建和生成两阶段提升视频生成质量与VLA性能。

Details

Motivation: 大规模数据集中缺乏腕部视角记录，导致锚点视角丰富而腕部视角稀缺，现有世界模型无法仅从锚点视角生成腕部视角视频。 Method: 采用两阶段方法：(i) 重建阶段，扩展VGGT并引入空间投影一致性（SPC）损失，估计几何一致的腕部视角姿态和4D点云；(ii) 生成阶段，使用视频生成模型从重建视角合成时间连贯的腕部视角视频。 Result: 在Droid、Calvin和Franka Panda数据集上实现了最先进的视频生成效果，空间一致性更优，提升了VLA性能，在Calvin任务中平均任务完成长度提高3.81%，填补了42.4%的锚点-腕部视角差距。 Conclusion: WristWorld能有效从锚点视角生成高质量腕部视角视频，显著缩小视角差距，并提升机器人操作任务中的性能表现。 Abstract: Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

[191] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu,Haotong Lin,Hongcheng Luo,Xianqi Wang,Jingfeng Yao,Lianghui Zhu,Yuechuan Pu,Cheng Chi,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye,Sida Peng,Xin Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于像素空间扩散生成的单目深度估计模型Pixel-Perfect Depth，通过避免VAE压缩引入的飞行像素问题，生成高质量、无飞行像素的点云。

Details

Motivation: 现有生成式深度估计模型依赖VAE将深度图压缩至潜在空间，导致边缘和细节处出现飞行像素问题，影响点云质量。 Method: 提出直接在像素空间进行扩散生成，并设计语义引导的扩散Transformer（SP-DiT）和级联DiT结构，利用视觉基础模型的语义信息提升生成效果与效率。 Result: 在五个基准上性能优于所有已发表的生成式模型，在边缘感知点云评估中显著领先。 Conclusion: Pixel-Perfect Depth通过像素空间扩散与新型网络设计，有效解决了VAE带来的 artifacts，实现了更精确、细节更丰富的深度估计。 Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

[192] Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms

Natacha Kuete Meli,Shuteng Wang,Marcel Seelbach Benkner,Michele Sasdelli,Tat-Jun Chin,Tolga Birdal,Michael Moeller,Vladislav Golyanik

Main category: cs.CV

TL;DR: 本文综述了量子增强计算机视觉（QeCV）这一新兴交叉领域，涵盖其基本原理、方法、硬件兼容性及发展前景，旨在为计算机视觉研究者提供量子计算的入门参考。

Details

Motivation: 由于经典方法在某些视觉任务中存在时间复杂度高或只能得到近似解的问题，而量子计算能利用量子力学效应提供更优的时间可扩展性和计算能力，因此有必要探索量子计算在计算机视觉中的应用潜力。 Method: 采用综述方式，系统介绍基于门电路和量子退火两种主要量子计算范式的QeCV方法，讨论与量子硬件兼容的算法设计，并介绍可用的编程工具、模拟平台及学习资源。 Result: 提供了QeCV领域的全面概述，包括技术原理、现有工具、研究挑战和发表实践，帮助计算机视觉社区理解并进入该领域。 Conclusion: QeCV具有改变视觉信号处理方式的巨大潜力，但需要开发专用的新算法以充分发挥量子计算优势，未来有望成为经典神经网络的有力替代方案。 Abstract: Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, advantages in terms of better time scalability for multiple problem classes. Parametrised quantum circuits can also become, in the long term, a considerable alternative to classical neural networks in computer vision. However, specialised and fundamentally new algorithms must be developed to enable compatibility with quantum hardware and unveil the potential of quantum computational paradigms in computer vision. This survey contributes to the existing literature on QeCV with a holistic review of this research field. It is designed as a quantum computing reference for the computer vision community, targeting computer vision students, scientists and readers with related backgrounds who want to familiarise themselves with QeCV. We provide a comprehensive introduction to QeCV, its specifics, and methodologies for formulations compatible with quantum hardware and QeCV methods, leveraging two main quantum computational paradigms, i.e. gate-based quantum computing and quantum annealing. We elaborate on the operational principles of quantum computers and the available tools to access, program and simulate them in the context of QeCV. Finally, we review existing quantum computing tools and learning materials and discuss aspects related to publishing and reviewing QeCV papers, open challenges and potential social implications.

[193] Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Ci-Siang Lin,Min-Hung Chen,I-Jieh Liu,Chien-Yi Wang,Sifei Liu,Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: 提出了一种名为Tenet的框架，通过分解RVOS任务并利用现成的目标检测器和跟踪器生成时间提示，结合提示偏好学习评估提示质量，从而高效地将图像基础分割模型适应于指代表视频对象分割任务。

Details

Motivation: 现有方法需要密集掩码标注进行端到端训练，计算开销大且可扩展性差，因此重新思考RVOS问题的关键所在。 Method: 将RVOS任务分解为指代、视频和分割三个因素，利用现成的目标检测器和跟踪器生成与指代句子相关的时间提示，并提出提示偏好学习来评估提示质量，进而指导图像基础分割模型完成分割。 Result: 在RVOS基准上的实验表明，Tenet框架能够高效适应并生成高质量的掩码，显著提升模型在指代表视频对象分割任务中的性能。 Conclusion: Tenet框架通过解耦设计和提示偏好学习，有效实现了对图像基础模型的高效适配，为RVOS提供了一个更具扩展性和实用性的解决方案。 Abstract: Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

Table of Contents

cs.CL [Back]

[1] OpenStaxQA: A multilingual dataset based on open-source college textbooks

[2] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

[3] Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

[4] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

[5] Evaluating Embedding Frameworks for Scientific Domain

[6] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

[7] Scalable multilingual PII annotation for responsible AI in LLMs

[8] Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments

[9] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

[10] A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation

[11] Language models for longitudinal analysis of abusive content in Billboard Music Charts

[12] Reproducibility Study of "XRec: Large Language Models for Explainable Recommendation"

[13] Type and Complexity Signals in Multilingual Question Representations

[14] LLM Bias Detection and Mitigation through the Lens of Desired Distributions

[15] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

[16] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

[17] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

[18] Protecting De-identified Documents from Search-based Linkage Attacks

[19] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion

[20] Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

[21] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

[22] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

[23] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

[24] MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning

[25] A Survey on Agentic Security: Applications, Threats and Defenses

[26] Linguistically Informed Tokenization Improves ASR for Underresourced Languages

[27] Test-Time Scaling of Reasoning Models for Machine Translation

[28] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

[29] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

[30] Flipping the Dialogue: Training and Evaluating User Language Models

[31] The Algebra of Meaning: Why Machines Need Montague More Than Moore's Law

[32] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

[33] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

[34] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

[35] Aligning Large Language Models via Fully Self-Synthetic Data

[36] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

[37] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

[38] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

[39] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks

[40] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

[41] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

[42] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

[43] Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

[44] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

[45] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

[46] A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

[47] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

[48] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition

[49] Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

[50] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

[51] Overview of the Plagiarism Detection Task at PAN 2025

[52] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

[53] Adaptive Tool Generation with Models as Tools and Reinforcement Learning

[54] Mid-Training of Large Language Models: A Survey

[55] GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics

[56] SID: Multi-LLM Debate Driven by Self Signals

[57] OpenJAI-v1.0: An Open Thai Large Language Model

[58] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

[59] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

[60] MeXtract: Light-Weight Metadata Extraction from Scientific Papers

[61] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

[62] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

[63] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

[64] EDUMATH: Generating Standards-aligned Educational Math Word Problems

[65] Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

[66] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

[67] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

[68] Native Hybrid Attention for Efficient Sequence Modeling

[69] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

[70] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

[71] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

[72] Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations

[73] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

[74] LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

[75] Accelerating Diffusion LLM Inference via Local Determinism Propagation

[76] All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

[77] Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

[78] TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription