Table of Contents
cs.CL [Back]
[1] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue
Jinqiang Wang,Huansheng Ning,Jianguo Ding,Tao Zhu,Liming Chen,Chris Nugent
Main category: cs.CL
TL;DR: 本文提出了一种名为ProUtt的LLM驱动的偏好数据合成方法,用于主动预测用户下一句话语。该方法通过构建意图树并显式建模意图推理路径,从利用和探索两个角度预测可能的对话走向,并生成偏好与非偏好的推理过程以训练紧凑型任务特定语言模型,从而在保护隐私和降低计算成本的同时提升对话流畅性。实验表明,ProUtt在多个基准数据集上优于现有方法。
Details
Motivation: 现有基于API的解决方案存在隐私问题,本地部署大模型计算开销大,且当前的用户模拟或数据合成方法缺乏对用户意图推理过程的显式建模,难以真正推进对话。 Method: 提出ProUtt方法,将对话历史转化为意图树,预测下一个合理的路径(利用与探索),并通过扰动或修改路径构建偏好与非偏好的推理过程,用于训练小型专用语言模型。 Result: 在四个基准数据集上,通过LLM打分和人工评估验证,ProUtt在主动预测用户下一句方面显著优于现有数据合成方法、用户模拟器和商业LLM API。 Conclusion: ProUtt通过显式建模意图推理和构建偏好数据,为高效、隐私安全的用户下一句预测提供了有效解决方案,推动了任务型对话系统的进步。 Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user's next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user's next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user's next utterance.To address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.[2] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
Devesh Saraogi,Rohit Singhee,Dhruv Kumar
Main category: cs.CL
TL;DR: 该论文探讨了基于代理的多步工作流是否能生成更具新颖性和可行性的研究计划,发现基于分解和长上下文的工作流在保持可行性的同时显著提升了创意性。
Details
Motivation: 随着大语言模型在科研中的应用,AI生成的研究是否存在真正的原创性成为关键问题,尤其是单步提示中存在的“智能抄袭”现象促使研究者探索更复杂的多步代理工作流。 Method: 通过评估五种推理架构(包括迭代反思、进化算法、多智能体框架、递归分解和多模态长上下文流水线),在新颖性、可行性和影响力三个维度上对30个提案进行基准测试。 Result: 基于分解和长上下文的工作流平均新颖性得分为4.17/5,显著高于基于反思的方法(2.33/5),且高分方法未牺牲可行性。不同领域表现存在差异。 Conclusion: 精心设计的多阶段代理工作流能够有效提升AI辅助科研创意的质量,推动生成更具原创性和可实施性的研究方案。 Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism'' as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows -- multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition -- can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.[3] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents
Adam Bradley,John Hastings,Khandaker Mamun Ahmed
Main category: cs.CL
TL;DR: 本文提出并评估了Axlerod,一个面向保险代理人的AI对话系统,结合NLP、RAG和领域知识,实现高效保单检索与客户服务,显著提升响应准确率并缩短搜索时间。
Details
Motivation: 旨在提升独立保险代理人的工作效率,应对传统保险服务中响应慢、信息检索复杂的问题,推动AI在企业级保险科技中的应用。 Method: 采用自然语言处理(NLP)、检索增强生成(RAG)和领域特定知识集成技术,构建面向任务的对话系统Axlerod,支持意图识别、结构化数据库访问和实时响应生成。 Result: Axlerod在保单检索任务中达到93.18%的准确率,平均搜索时间减少2.42秒,展现出高效的语义理解与信息检索能力。 Conclusion: Axlerod验证了以代理人为中心的AI对话系统在保险领域的可行性与优势,为发展企业级、辅助性AI应用提供了实践范例。 Abstract: The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod's effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.[4] Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research
Derguene Mbaye,Tatiana D. P. Mbengue,Madoune R. Seye,Moussa Diallo,Mamadou L. Ndiaye,Dimitri S. Adjanohoun,Cheikh S. Wade,Djiby Sow,Jean-Claude B. Munyaka,Jerome Chenal
Main category: cs.CL
TL;DR: 本文首次全面综述了塞内加尔六种官方语言(Wolof、Pulaar、Sereer、Joola、Mandingue 和 Soninke)的自然语言处理(NLP)进展与挑战,整合了影响其数字化准备的语言学、社会技术及基础设施因素,指出了数据、工具和基准方面的不足。作者汇总了现有资源并建立了一个集中化的GitHub仓库以促进协作与可复现性,特别探讨了NLP在社会科学中的应用潜力,并提出了以社区为中心、注重伦理数据治理和跨学科合作的可持续NLP发展路线图。
Details
Motivation: 非洲语言在NLP技术发展中长期被边缘化,尽管NLP正在改变各领域的研究方式。塞内加尔有六种官方认可的本土语言,但在数字技术和语言资源方面严重缺乏支持。本文旨在填补这一空白,推动语言多样性与技术公平。 Method: 综合分析语言学特征、社会技术环境和基础设施现状,梳理现有的NLP研究与项目成果,重点关注文本规范化、机器翻译和语音处理任务。同时构建一个集中管理的GitHub资源库,收集公开可用的数据集、工具和基准,支持多任务NLP研究。 Result: 识别出六种塞内加尔语言在NLP发展中的关键障碍,包括数据稀缺、工具不足和评估标准缺失;整理并发布了涵盖多种NLP任务的公开资源集合;展示了NLP在社会科学中用于多语言转录、翻译和信息检索的应用前景。 Conclusion: 实现塞内加尔语言的可持续NLP发展需要以社区为中心的生态系统,强调伦理数据治理、开放资源共享和跨学科合作。未来应优先支持本地参与、能力建设和技术本地化,以确保技术发展的包容性和可持续性。 Abstract: Natural Language Processing (NLP) is rapidly transforming research methodologies across disciplines, yet African languages remain largely underrepresented in this technological shift. This paper provides the first comprehensive overview of NLP progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke. We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks. Building on existing initiatives and research works, we analyze ongoing efforts in text normalization, machine translation, and speech processing. We also provide a centralized GitHub repository that compiles publicly accessible resources for a range of NLP tasks across these languages, designed to facilitate collaboration and reproducibility. A special focus is devoted to the application of NLP to the social sciences, where multilingual transcription, translation, and retrieval pipelines can significantly enhance the efficiency and inclusiveness of field research. The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages, emphasizing ethical data governance, open resources, and interdisciplinary collaboration.[5] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
Yiwei Yan,Hao Li,Hua He,Gong Kai,Zhengyi Yang,Guanfeng Liu
Main category: cs.CL
TL;DR: 本研究提出了一种基于大语言模型的提取管道SALP-CG,用于对在线医疗会话数据中的隐私风险进行分类与分级,符合GB/T 39725-2020标准,结合少样本引导、JSON Schema约束解码和确定性高风险规则,在多种大模型上实现高类别合规性和准确的敏感性分级,在MedDialog-CN基准上表现优异(micro-F1=0.900),有助于医疗数据治理。
Details
Motivation: 在线医疗咨询产生大量包含受保护健康信息的对话数据,现有方法缺乏统一标准和可靠的自动化手段对其进行敏感性分类,难以满足数据治理需求。 Method: 基于GB/T 39725-2020构建健康数据分类分级规则,设计后端无关的SALP-CG提取管道,结合少样本提示、JSON Schema约束解码和确定性高风险规则,实现对多种大语言模型的适配与高效隐私风险识别。 Result: 在MedDialog-CN基准上,模型实现了高实体识别准确率、强Schema合规性和精确的敏感性分级,最佳模型在最高等级预测中达到micro-F1=0.900;分析显示二级和三级数据项占主导,联合时可致重识别;四级和五级较少但危害更大。 Conclusion: SALP-CG能够跨大语言模型可靠地对在线会话健康数据进行类别分类与敏感性分级,为医疗健康数据的自动化治理提供了实用且高效的解决方案。 Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.[6] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model
Jing-Yi Zeng,Guan-Hua Huang
Main category: cs.CL
TL;DR: 本研究探讨了如何基于轻量级LLaMA-3.2-3B模型高效构建面向统计学领域的专业大语言模型StatLLaMA,发现起始模型的选择至关重要,仅从基础模型出发难以获得良好统计推理能力,而以指令微调后的模型为起点可有效实现领域专业化。
Details
Motivation: 构建具备专业统计推理能力且资源高效的领域专用大语言模型,避免从零训练的高成本。 Method: 系统比较三种多阶段训练流程:从无指令能力的基础模型、经后处理指令微调的基础模型、以及具备强通用推理能力的指令微调模型出发,依次进行持续预训练、监督微调(SFT)、基于人类反馈的强化学习(RLHF)偏好对齐和下游任务适配。 Result: 以LLaMA-3.2-3B-Instruct为起点的流程能有效实现领域专业化;SFT中存在领域专长与通用推理能力的权衡;直接偏好优化(DPO)可实现稳定有效的RLHF对齐;下游微调需极低强度以防止灾难性遗忘。StatLLaMA在数学推理、常识推理和统计专长基准上均表现优异。 Conclusion: 选择具备良好指令遵循能力的模型作为起点,结合谨慎设计的多阶段微调策略,是构建高效专业统计LLM的可行路径,为资源受限下的领域模型定制提供了实用蓝图。 Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.[7] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models
Hoyoon Byun,Youngjun Choi,Taero Kim,Sungrae Park,Kyungwoo Song
Main category: cs.CL
TL;DR: 本文提出了Bounded Hyperbolic Tanh(BHyT),一种用于大语言模型的高效且稳定的Pre-LN替代方案,通过结合tanh非线性与数据驱动的输入边界控制,在理论和实验上实现了训练稳定性和推理效率的双重提升。
Details
Motivation: Pre-LN虽然广泛用于大语言模型,但存在计算重复、深度增加时激活值方差和幅值增长导致训练不稳定的问题;现有无归一化方法在深层下仍脆弱,因此需要同时解决稳定性和效率问题。 Method: 提出BHyT,将tanh非线性与数据驱动的输入边界结合,防止激活值饱和,并在一个块内仅计算一次统计量,用轻量级方差近似替代第二次归一化,提供理论稳定性保证。 Result: BHyT在预训练中表现出更高的稳定性和效率,平均比RMSNorm快15.8%训练速度,生成吞吐量提高4.2%,并在语言理解与推理任务上达到相当或更优的推理性能和鲁棒性。 Conclusion: BHyT是一种高效、稳定的Pre-LN替代方案,能够在保持高性能的同时显著提升训练效率和模型可扩展性,适用于大规模语言模型的部署与训练。 Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT[8] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering
Yu Takahashi,Shun Takeuchi,Kexuan Xin,Guillaume Pelat,Yoshiaki Ikai,Junya Saito,Jonathan Vitale,Shlomo Berkovsky,Amin Beheshti
Main category: cs.CL
TL;DR: 本文提出了一种不确定性感知的动态知识图谱(KG)框架,用于提升问答系统在高风险应用中的可靠性与可解释性,特别是在医疗领域中通过电子健康记录构建个性化知识图谱,并结合置信度评分与交互式界面支持临床决策。
Details
Motivation: 现有知识图谱问答系统通常将事实视为静态且确定的,难以捕捉信息的演化性和推理过程中的不确定性,导致在关键应用场景(如医疗)中可靠性不足。 Method: 该框架结合了动态知识图谱的构建、置信度评分与不确定性感知的信息检索,以及支持用户交互的可视化界面;在医疗场景中从电子健康记录中构建个性化知识图谱,并追踪患者随访中的不确定性。 Result: 系统能够生成带有置信度标注的三元组,支持用户对比传统方法与置信度感知下的问答结果,并在死亡率预测任务中评估不确定性建模的影响。 Conclusion: 不确定性感知的动态知识图谱能增强问答系统的鲁棒性与透明度,尤其适用于医疗等高风险领域的可靠决策支持。 Abstract: Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.[9] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox
Vahideh Zolfaghari
Main category: cs.CL
TL;DR: 该研究评估了在家长焦虑驱动的对抗性压力下,大语言模型(LLMs)在儿科医疗咨询中的安全性,发现模型的安全性更多依赖于对齐和架构而非规模,较小的模型在某些情况下表现更优,且所有模型均缺乏应急识别能力,不适合用于分诊。
Details
Motivation: 现有对大语言模型在医疗咨询中安全性的评估多基于中性条件,忽略了用户焦虑情绪下可能引发的系统脆弱性。本研究旨在填补这一空白,特别是在儿科场景中,家长的焦虑可能导致更具挑战性的交互,从而考验模型的安全机制。 Method: 研究使用PediatricAnxietyBench数据集,包含150个真实和150个对抗性问题,覆盖10个儿科主题。通过API测试三种模型:Llama-3.3-70B、Llama-3.1-8B(Groq平台)和Mistral-7B(HuggingFace平台),共生成900条回复。安全性评分采用0-15分量表,评估内容包括克制、转诊建议、不确定性表达、应急识别和非处方行为。使用配对t检验与自助法置信区间进行统计分析。 Result: 平均安全得分介于9.70(Llama-3.3-70B)至10.39(Mistral-7B)。Llama-3.1-8B显著优于Llama-3.3-70B(+0.66, p=0.0001)。对抗性问题反而提升安全性,其中Mistral-7B提升最明显(+1.09, p=0.0002)。安全性在不同平台间具有一致性,但Llama-3.3-70B有8%的失败率。癫痫相关问题中33%出现不当诊断。不确定性表达与安全得分高度相关(r=0.68, p<0.001)。 Conclusion: 模型安全性取决于训练对齐和架构设计,而非参数规模,小模型在特定条件下可超越大模型。新版本模型表现出更强鲁棒性,显示训练策略的进步。但所有模型均未体现应急识别能力,表明其当前不适用于临床分诊。研究结果支持在真实压力场景下进行对抗性测试,并提供开源基准以推动医疗AI安全发展。 Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.[10] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language
Franciszek Górski,Andrzej Czyżewski
Main category: cs.CL
TL;DR: 本文提出了一种利用多语言大模型(如Llama3.1)作为教师模型,为波兰语医学文本提供标注并训练轻量级分类器的框架,有效解决了低资源语言下的标注瓶颈问题。
Details
Motivation: 由于波兰语医学文本标注资源稀缺,难以训练高质量的分类模型,因此需要一种高效且低成本的自动标注方法来构建可靠的训练数据集。 Method: 采用多语言LLM(Llama3.1)对大规模波兰语医学文本进行自动标注,并使用有限的人工标注数据构建测试集;基于该数据微调三种基于BERT的模型:DistilBERT、BioBERT和HerBERT,用于多类别临床文本分类。 Result: DistilBERT模型表现最佳,在五个临床类别中F1分数均超过0.80,其中三个类别超过0.93,同时模型体积小、GPU显存消耗低、推理速度快,效率远高于大模型。 Conclusion: 通过知识蒸馏方式利用多语言大模型生成标注数据,可有效构建高性能、轻量化的领域特定文本分类器,为低资源语言和计算资源受限场景提供了可行解决方案。 Abstract: In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.[11] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels
Guancheng Du,Yong Hu,Wenqing Wang,Yaming Yang,Jiaheng Gao
Main category: cs.CL
TL;DR: 本文提出了SagaScale,一个基于长篇小说构建的真实、可扩展且高质量的双语长上下文基准测试,用于评估大语言模型在处理超长文本时的性能,并通过12种前沿模型和三种长上下文方法的实验揭示了直接输入全上下文的优势及检索瓶颈的解决方案。
Details
Motivation: 现有的长上下文基准存在任务真实性、数据可扩展性和数据质量等方面的局限性,难以有效评估大语言模型在真实复杂文档中的理解能力。 Method: 构建了一个自动化的数据收集流程,利用外部资源(如Wikipedia)从完整小说中生成问答对;外部资源仅用于构建阶段,不参与模型评估,从而确保问题复杂度超过模型在评估时的知识范围;支持中英文双语,平均上下文长度超过25万(英文)和32万(中文)token。 Result: 在12个前沿大模型和三种长上下文方法上的评估表明:(1)直接提供完整上下文显著优于其他方法;(2)大多数模型仍难以处理极长上下文,但Gemini-2.5-Pro表现突出;(3)Agentic RAG能有效缓解Naïve RAG的检索瓶颈。 Conclusion: SagaScale是一个高现实性、高质量且大规模的长上下文评测基准,能够更全面地评估大模型的长文本理解能力,作者已公开发布该基准及其代码以促进后续研究。 Abstract: Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods -- Naïve RAG, Agentic RAG, and Long Context -- yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.[12] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins,Jon Chun
Main category: cs.CL
TL;DR: 本文提出了Syntactic Framing Fragility (SFF) 框架,用于评估大语言模型在不同语法结构但逻辑等价提示下的伦理判断一致性,发现许多模型因语法极性变化而出现判断反转,尤其对否定提示敏感,开源模型的脆弱性是商业模型的两倍以上;通过思维链推理可显著降低该问题,研究强调语法一致性应成为LLM安全评估的标准组成部分。
Details
Motivation: 大语言模型被广泛应用于重要决策场景,但其对提示词的语法变化(如否定、条件结构)是否保持伦理判断一致仍不清楚,现有评估方法难以分离语法与语义影响,因此需要一种专门衡量纯语法影响的鲁棒性框架。 Method: 提出Syntactic Framing Fragility (SFF) 框架,结合Logical Polarity Normalization (LPN) 技术消除语义漂移,比较正负语法框架下的模型决策;在14个伦理场景中设计四种受控提示变体,评估23个主流模型共39,975个决策,并分析思维链推理的影响及不同场景下的脆弱性分布。 Result: 发现广泛且统计显著的不一致性:许多模型仅因语法极性改变而反转伦理判断;部分模型在‘should not’提示下赞成行为的比例高达80-97%;开源模型的语法脆弱性超过商业模型两倍;思维链推理能显著减少不一致;金融和商业场景的风险高于医疗场景。 Conclusion: 语法一致性是伦理鲁棒性的一个独立且关键维度,SFF类审计应成为部署型大语言模型安全评估的标准环节。 Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with "should not." We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on github.com.[13] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Kaustubh Shivshankar Shejole,Sourabh Deoghare,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: 本文介绍了Virām,首个用于评估英语到马拉地语机器翻译中标点鲁棒性的诊断基准,并提出通过微调和流水线方法提升翻译可靠性。
Details
Motivation: 针对低资源语言马拉地语中机器翻译对标点敏感的问题,研究如何提高翻译系统在标点歧义情况下的准确性与鲁棒性。 Method: 构建包含54个手动整理的标点歧义实例的Virām基准,评估两种策略:基于恢复后翻译的流水线方法和直接在标点变化数据上微调模型。 Result: 实验表明,专用微调模型和流水线系统在Virām基准上显著优于基线模型;当前大语言模型在此类任务上表现较差。 Conclusion: 为提升低资源语言MT系统的标点鲁棒性,需采用任务特定的微调或流水线方法,大语言模型尚需改进以应对标点歧义问题。 Abstract: Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.[14] Forgetting as a Feature: Cognitive Alignment of Large Language Models
Hien Tran,Quinten Steenhuis,Alexandros Christoforos,Chadbourne Davis
Main category: cs.CL
TL;DR: 本文提出将大语言模型中的“遗忘”视为一种功能性认知机制,而非缺陷,通过引入受人类记忆动态启发的指数衰减概率记忆模型,重新解释LLM的上下文推理过程,并设计基准测试验证其与人类认知模式的相似性,进而提出“概率记忆提示”策略以提升长程推理性能。
Details
Motivation: 作者试图挑战将大语言模型(LLM)与完美贝叶斯推理对比的传统评估方式,指出系统性遗忘并非模型缺陷,而可能是适应性智能的一部分,因而需从人类记忆机制中汲取灵感重新理解LLM的推理行为。 Method: 提出一个基于指数衰减的概率记忆模型来模拟LLM的推理过程,设计包含时间推理、概念漂移适应和关联回忆的基准测试套件,用于比较LLM与人类认知模式的遗忘特征,并开发“概率记忆提示”方法以调控证据整合过程。 Result: 实验表明LLM的遗忘速率与人类记忆在稳定性与适应性之间的权衡类似;所提概率记忆提示策略能有效改善模型在长时程推理任务中的表现。 Conclusion: 遗忘不应被视为LLM的失败模式,而是一种实现适应性智能的原则性机制,该研究为理解并优化LLM的推理动态提供了新的认知框架。 Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.[15] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis
Sauhard Dubey
Main category: cs.CL
TL;DR: 本文提出SciNets,通过构建文献导出的概念图,将机制性综合问题建模为图约束下的多跳推理问题,以实现跨领域科学合成,并引入行为评估框架衡量推理深度、机制多样性和基础稳定性。
Details
Motivation: 现有基于检索的系统和语言模型在跨文献连接机制性解释方面存在推理深度控制不足和结构基础薄弱的问题,难以有效支持跨领域科学综合。 Method: 将机制性综合视为基于文献导出概念图的图约束多跳推理问题;针对科学问题和局部语料构建有向概念图,并通过识别连接罕见共现概念的多跳路径来生成机制性解释,比较了最短路径、带多样性约束的k最短路径、随机游走及检索增强语言模型等方法。 Result: 实验表明,显式的图约束支持可控的多跳推理,但存在权衡:更深更丰富的符号推理会增加基础不稳定性,而最短路径推理虽稳定却结构保守。 Conclusion: 图结构与大语言模型结合的方法在科学综合中具有潜力,但需在推理深度与结果稳定性之间进行权衡,研究为当前图-LLM融合方法提供了系统的行为特征刻画。 Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.[16] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens
Meicong Zhang,Tiancheng su,Guoxiu He
Main category: cs.CL
TL;DR: 本文提出了一种名为STIG(Stage Token for Introduction Generation)的新方法,通过将文献综述的多阶段逻辑结构直接参数化到大语言模型中,实现单次推理生成完整引言,避免了传统基于外部代理工作流的错误累积和连贯性差问题。
Details
Motivation: 现有基于预定义代理工作流的方法在生成科研引言时存在推理链过长、错误累积和文本连贯性差的问题,难以满足引言写作对逻辑严谨性和结构一致性的高要求。 Method: 提出STIG方法,将原本多步的工作流分解为显式的阶段标记(Stage Token),并通过指令调优让模型学习各阶段的功能角色、逻辑顺序与转换模式,从而将整个流程内化到模型参数中,实现端到端的引言生成。 Result: 实验表明,STIG能够在无需显式调用工作流的情况下,在单次推理中生成多阶段引言,并在语义相似性和句子级结构合理性等指标上优于传统代理工作流及其他基线方法。 Conclusion: 将工作流逻辑结构直接编码进模型参数是比依赖外部代理更高效、更连贯的科学写作生成路径,STIG为大语言模型的结构化文本生成提供了新范式。 Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.[17] Enhancing Business Analytics through Hybrid Summarization of Financial Reports
Tohida Rehman
Main category: cs.CL
TL;DR: 本文提出了一种结合抽取式和生成式的混合摘要框架,用于从财务电话会议记录中自动生成准确、简洁的Reuters风格摘要,并比较了不同模型在财务文本长距离依赖与事实一致性上的表现。
Details
Motivation: 财务报告和业绩电话会议内容庞大且复杂,手动分析效率低且易出错,现有自动摘要方法在事实准确性和处理长文本方面存在不足,因此需要一种高效、可靠的方法来自动化提取关键商业洞察。 Method: 采用两阶段混合框架:首先使用LexRank算法抽取关键句子,然后利用微调后的BART和PEGASUS模型进行抽象摘要;同时并行微调Longformer Encoder-Decoder(LED)模型以直接处理长文本并捕捉长距离上下文依赖。 Result: 通过ROUGE、METEOR、MoverScore、BERTScore及领域特定指标(如SciBERTScore、FinBERTScore)评估,并结合基于实体的源精度和F1目标衡量事实准确性,结果显示长上下文模型整体性能最优,而混合框架在计算资源受限时具有更好的事实一致性和竞争力的表现。 Conclusion: 长上下文模型在财务文本摘要中表现最佳,但所提出的混合框架在资源受限环境下能提供更优的事实一致性,支持构建实用的金融文本摘要系统以高效生成可用的业务洞察。 Abstract: Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm's performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights.[18] Clinical Document Metadata Extraction: A Scoping Review
Kurt Miller,Qiuhao Lu,William Hersh,Kirk Roberts,Steven Bedrick,Andrew Wen,Hongfang Liu
Main category: cs.CL
TL;DR: 该研究对临床文档元数据提取进行了范围综述,识别了方法学趋势、应用方向及研究空白,发现方法已从基于规则和传统机器学习转向基于Transformer的架构,并预期未来将向更丰富的元数据表示和临床集成发展。
Details
Motivation: 临床文档元数据对于准确解读临床信息至关重要,但文档异质性和随时间变化导致元数据标准化困难,亟需系统梳理现有提取方法并识别研究缺口。 Method: 遵循PRISMA-ScR指南,筛选2011年1月至2025年8月发表的文献,最终纳入67篇相关文章进行综合分析,分类其研究类型(方法学、应用或组成分析),并总结数据来源、任务目标和方法演进。 Result: 在266篇初筛文章中,67篇被深入分析:45篇为方法学研究,17篇将元数据用于下游任务,5篇分析元数据构成;标注公开数据仍稀缺,除结构化章节数据外;方法从基于规则和传统机器学习(需大量特征工程)发展到基于Transformer的模型(特征工程少);大语言模型推动了跨任务和数据集的泛化能力探索。 Conclusion: 临床文档元数据提取研究正朝着更丰富表示形式和更深度临床整合的方向发展,未来需加强公共数据共享与通用模型开发以支持高级临床文本处理系统。 Abstract: Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.[19] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings
Wen G Gong
Main category: cs.CL
TL;DR: 提出了一种多层级分析框架Semanscope,利用PHATE流形学习揭示多语言嵌入中的语义几何结构,发现在不同语言层级上存在系统性几何模式及现有模型的局限性。
Details
Motivation: 探索多语言嵌入空间中语义几何结构的组织方式,并评估当前嵌入模型在捕捉语义关系上的有效性。 Method: 构建了一个涵盖四个语言层级(子字符、字符、词、数字)的多级分析框架,使用PHATE流形学习进行可视化分析,并在多种语言和语义领域数据集上进行验证。 Result: 发现子字符层级上语义与结构成分发生几何坍塌;不同文字系统在字符层级呈现独特几何特征;词汇层级上内容词在20个语义域中形成聚类-分支结构;阿拉伯数字则呈现螺旋轨迹而非聚类,违背传统分布假设。 Conclusion: PHATE流形学习是分析嵌入空间语义几何结构和验证模型有效性的关键工具,揭示了当前嵌入模型在区分语义与结构信息方面的不足。 Abstract: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.[20] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings
Wen G. Gong
Main category: cs.CL
TL;DR: 本文提出了语义亲和度(Semantic Affinity, SA)指标和Semanscope框架,用于评估多语言嵌入模型的跨语言语义对齐质量,发现训练目标比模型规模或架构更关键,只有经过翻译对监督训练的模型才能实现良好对齐。
Details
Motivation: 现有任务驱动基准(如MTEB)可能掩盖多语言嵌入模型在跨语言语义对齐上的根本缺陷,导致 practitioners 难以判断哪些模型真正具备跨语言理解能力。因此需要一种更直接、可解释的语义对齐评估方法。 Method: 提出语义亲和度(SA)指标,通过余弦距离计算跨语言与同语言表示的分布比率,并结合PHATE可视化技术构建Semanscope分析框架,在4个数据集上对13个模型进行了52次实验评估。 Result: 实验揭示了三类模型表现:(1) 顶级BERT模型(如LaBSE、USE、S-BERT)在翻译对监督下达到高SA值(0.68–0.70);(2) 大语言模型嵌入层SA值稳定在0.55–0.61,不随参数规模提升;(3) 仅使用MLM预训练的模型(如mBERT、XLM-R)SA值低于0.50,显示其跨语言对齐失败。此外,甲骨文原型分析表明模型学习的是语料模式而非认知基本语义。 Conclusion: 跨语言语义对齐的质量主要取决于是否采用显式的翻译对监督训练,而非模型规模、架构或多语言数据量。本研究为从数百种可用模型中选择高质量多语言嵌入提供了有效评估工具。 Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.[21] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets
Xin Gao,Xiaoyang Wang,Yun Zhu,Mengzhang Cai,Conghui He,Lijun Wu
Main category: cs.CL
TL;DR: 提出了一种基于OpenDataArena(ODA)的闭环数据工程框架,通过价值锚定排序和多维分析指导监督微调(SFT)数据集构建,显著提升大模型在数学推理和多领域任务上的性能与数据效率。
Details
Motivation: 现有的SFT数据集构建依赖启发式聚合,缺乏对样本如何影响模型性能的系统性理解,亟需一种更科学、反馈驱动的数据工程方法。 Method: 构建了OpenDataArena(ODA)框架,采用价值锚定排名和多维分析将基准评估转化为数据构建的反馈信号;设计了两阶段难度感知流程构建ODA-Math-460k,并通过“锚点-补丁”策略构建多领域指令数据集ODA-Mixture。 Result: ODA-Math-460k在AIME和HMMT等数学基准上达到SOTA;ODA-Mixture系列在多领域任务中优于更大规模的开源基线,且具备更高数据效率。 Conclusion: 验证了以透明评估为核心的闭环数据工程可有效推动数据中心化AI发展,为高质量SFT数据集构建提供了可复现、可优化的新范式。 Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.[22] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis
Yanyi Liu,Qingwen Yang,Tiezheng Guo,Feiyu Qu,Jun Liu,Yingyou Wen
Main category: cs.CL
TL;DR: 本文提出了一种新的“幻觉诊断”范式,超越传统的二元检测,要求模型进行错误定位、因果解释和内容修正,并通过自动化流水线HDG生成高质量训练数据,训练出具备优秀诊断能力的40亿参数模型HDM-4B-RL。
Details
Motivation: 现有大模型幻觉研究多集中于二元检测,缺乏可解释和可操作的反馈,限制了实际应用。需要一种更深入、更具建设性的方法来提升模型可靠性。 Method: 提出了幻觉诊断任务,包含检测、定位、归因与纠正;构建HDG自动生成带丰富元数据的训练样本;采用GRPO强化学习训练HDM-4B-RL模型,结合结构、准确性和定位奖励信号。 Result: HDM-4B-RL在HaluEval基准上超过以往最优检测模型,诊断能力媲美更大通用模型,且模型体积更小。 Conclusion: 幻觉诊断是可行且有价值的,为构建更可信、可靠的生成式AI系统提供了有效路径。 Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary "detection" approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from "detection" to "diagnosis". The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.[23] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations
Xiaoxu Ma,Xiangbo Zhang,Zhenyu Weng
Main category: cs.CL
TL;DR: 提出了一种基于内部激活的稳定且可解释的大语言模型人格特质评估方法——Persona-Vector Neutrality Interpolation (PVNI)。
Details
Motivation: 现有基于问卷的人格评估方法稳定性差、可解释性低,结果对提示词微小变化敏感。 Method: 使用对比提示从模型内部激活中提取与目标人格特质相关的人格向量,并通过沿该向量插值估计中性得分,实现可解释评估。 Result: 在多种大语言模型上的实验表明,PVNI比现有方法更稳定,对问卷和角色扮演变体具有更强鲁棒性。 Conclusion: PVNI为大语言模型提供了更稳定、可解释的人格特质评估框架,有助于模型理解与负责任部署。 Abstract: Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model's internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.[24] Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences
Sriram Padmanabhan,Siyuan Song,Kanishka Misra
Main category: cs.CL
TL;DR: 本研究探讨了视觉语言模型是否能在归纳推理中表现出与人类儿童相似的语言约束敏感性,通过复现Gelman等人的实验发现,模型在行为上与人类对“所有”、“泛指”和“某些”等语义的处理方式一致,且这种差异源于归纳约束而非表面形式。
Details
Motivation: 探究视觉语言模型是否具备类似人类儿童对不同类型语言表述(如泛指、全称量化)在归纳推理中的敏感性差异。 Method: 复现Gelman等人(2002)的实验,先进行预测试以验证模型对图像类别识别及对“all”和“some”的敏感性,再进行主实验测试模型对新属性归纳的倾向,并进行表征的后验分析。 Result: 视觉语言模型在行为上表现出与人类儿童相似的推理模式:全称量化 > 泛指 > 存在量化;后验分析显示这种差异基于归纳约束而非语言表面形式。 Conclusion: 视觉语言模型展现出与人类相似的语言引导的归纳推理能力,表明其可能具备一定程度的抽象语义表征和归纳偏置。 Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements ("Bears are daxable"), universally quantified NPs ("all bears are daxable") and indefinite plural NPs ("some bears are daxable") in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.[25] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication
Sraavya Sambara,Yuan Pu,Ayman Ali,Vishala Mishra,Lionel Wong,Monica Agrawal
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)在面对包含错误前提的真实医疗问题时的表现,发现LLM常未能正确引导用户纠正误解,存在安全隐患。
Details
Motivation: 患者常在提问中隐含错误假设,安全的医疗沟通需要先纠正这些误解;然而当前LLM尚未被评估在此类情境下的表现。 Method: 构建了一个包含1100多个来自Reddit的真实医疗问题的数据集MedRedFlag,并系统比较了最先进LLM与临床医生对这些问题的回应。 Result: 分析显示,即使检测到错误前提,LLM仍常未能进行有效引导,可能给出导致不良医疗决策的回答。 Conclusion: LLM在真实医疗沟通场景中存在显著的安全缺陷,亟需改进以确保面向患者的AI医疗系统的安全性。 Abstract: Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.[26] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing
Yilin Bao,Ziyao He,Zayden Yang
Main category: cs.CL
TL;DR: 提出了一种基于强化学习的科学论文提纲生成框架,通过分阶段优化提升文档结构一致性、引用忠实度和事实准确性。
Details
Motivation: 现有大语言模型在生成科学论文时存在全局结构不合理、输入覆盖不全和引用不一致的问题,需要更有效的文档级规划方法。 Method: 将科学提纲构建建模为分层文档结构上的长视野规划问题,采用两阶段优化:反向提纲重构和前向价值引导的强化学习,结合科学正确性、话语连贯性和引用保真度的奖励机制。 Result: 在新提出的科学论文生成基准上,该方法在结构连贯性、引用可靠性及输入利用等方面显著优于现有基线模型。 Conclusion: 所提框架有效提升了科学论文生成的文档级规划能力和事实一致性,为自动化科研写作提供了可靠解决方案。 Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.[27] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
Yifei Shen,Yilun Zhao,Justice Ou,Tinglin Huang,Arman Cohan
Main category: cs.CL
TL;DR: CLINSQL是一个基于MIMIC-IV v3.1的临床文本到SQL基准测试,包含633个专家标注任务,要求模型处理多表连接、临床意义过滤和多步查询生成,评估结果显示现有模型在临床可靠性方面仍有不足。
Details
Motivation: 现有的文本到SQL模型在真实世界电子健康记录(EHR)分析中面临挑战,缺乏对多表结构、时间窗口和患者队列推理的支持,因此需要一个更具临床代表性的基准来推动实际应用。 Method: 构建CLINSQL基准,包含633个基于MIMIC-IV v3.1的专家标注文本-SQL对,涵盖多表连接、临床编码系统和复杂查询逻辑;采用思维链自优化提示策略评估22个闭源与开源模型,并结合基于评分标准的SQL分析与执行验证。 Result: 在测试集上,GPT-5-mini达到74.7%的执行准确率,DeepSeek-R1以69.2%成为最佳开源模型,Gemini-2.5-Pro在简单任务上为85.5%,但在困难任务上降至67.2%,表明当前模型在处理复杂临床查询时性能显著下降。 Conclusion: 尽管大模型在CLINSQL上取得一定进展,但其性能仍远未达到临床可靠水平,该基准为推进面向真实EHR分析的可靠文本到SQL系统提供了重要衡量标准。 Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.[28] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal
Sathvik Nair,Byung-Doh Oh
Main category: cs.CL
TL;DR: 语言模型的概率在预测语言处理难度上优于基于人类填空任务(cloze)的概率,主要因其具有更高分辨率、能区分语义相似词,并更准确地估计低频词概率。
Details
Motivation: 需要明确语言模型概率优于人类cloze概率的原因,以避免因使用不同预测因子而得出不同的科学结论。 Method: 通过比较语言模型概率与cloze概率在预测处理努力方面的表现,检验三个假设:高分辨率、区分语义相似词的能力、对低频词的准确概率分配。 Result: 语言模型概率在上述三个方面均优于cloze概率,解释了其预测优势。 Conclusion: 应改进cloze研究的分辨率,并进一步探究人类语言预测是否也对语言模型所捕捉的细微差别敏感。 Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.[29] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Christabel Acquaye,Yi Ting Huang,Marine Carpuat,Rachel Rudinger
Main category: cs.CL
TL;DR: 本研究探讨了使用开源大语言模型(LLMs)通过模拟学生课堂表现来预测数学题目难度的可行性,结合项目反应理论(IRT)模型,取得了与真实学生成绩高度相关的结果。
Details
Motivation: 传统数学测试题难度评估依赖昂贵的人类试点研究,亟需一种低成本、高效的替代方法。利用LLMs模拟不同水平的学生作答行为,有望自动化评估题目难度。 Method: 通过提示LLM扮演不同年级(4、8、12年级)和能力水平的学生进行角色扮演,模拟班级作答;利用模拟结果拟合项目反应理论(IRT)模型,提取题目难度参数,并与NAEP提供的真实学生作答数据对比;实验还考察了班级规模、学生命名方式(具名 vs 编号)、性别与种族分层等因素的影响。 Result: 模拟结果与真实世界题目难度的相关系数分别达到0.75(4年级)、0.76(8年级)和0.82(12年级);使用具名学生比编号更优,进一步按性别和种族分层可提升预测效果;数学能力较弱的开源模型(如Gemma)反而比更强的模型(如Llama、Qwen)在预测真实难度上表现更好。 Conclusion: 尽管LLMs本身不是直接判断题目难度的好工具,但基于角色扮演的模拟方法结合IRT建模,能有效预测现实学生面对数学题的难度,尤其适用于开源模型,为标准化测试开发提供了低成本、可扩展的新途径。 Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different "classroom sizes," showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.[30] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan,Raphaël Merx,Jey Han Lau
Main category: cs.CL
TL;DR: 提出了一种结合微调NMT模型和基于检索增强生成的大语言模型的混合框架,有效缓解了低资源语言在领域迁移下的神经机器翻译性能下降问题,在Dhao语上实现了接近原域性能的翻译质量恢复。
Details
Motivation: 低资源语言由于数据稀缺,神经机器翻译模型在面对领域迁移时表现显著下降,本文旨在解决这一挑战,特别是在几乎没有数字足迹的语言(如Dhao语)上提升跨领域翻译性能。 Method: 采用混合框架:首先使用在新约数据上微调的NMT模型生成初始翻译草案,然后利用基于检索增强生成(RAG)的大语言模型对草案进行 refine。通过改变检索示例数量和检索算法进行分析,研究各因素对性能的影响。 Result: 在旧约测试集上,原始微调NMT模型的chrF++从36.17降至27.11,而所提混合系统恢复至35.21,提升了8.10分;性能提升主要来自检索示例的数量而非检索算法的选择;定性分析显示LLM能有效修复NMT在零样本领域的严重错误。 Conclusion: 该混合框架能有效应对低资源语言在领域迁移下的翻译性能退化问题,大语言模型结合检索增强可作为鲁棒的“安全网”,在缺乏多样化训练数据的情况下显著提升跨领域翻译质量。 Abstract: Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.[31] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction
Sanghyeok Choi,Woosang Jeon,Kyuseok Yang,Taehyeong Kim
Main category: cs.CL
TL;DR: SocraticKG是一种基于问答对的自动化知识图谱构建方法,通过5W1H引导的问答扩展,在三元组提取前系统化展开文档语义,有效平衡了事实覆盖与结构连贯之间的权衡。
Details
Motivation: 现有基于大语言模型的知识图谱构建方法在事实覆盖和关系连贯性之间存在权衡问题:过度分割导致关系碎片化,过早合并则造成信息丢失。 Method: 提出SocraticKG,利用5W1H引导的问答对作为结构化中间表示,在文档级别展开语义,再进行三元组提取,从而保留上下文依赖和隐含关系。 Result: 在MINE基准上的实验表明,SocraticKG在显著增加知识抽取量的同时,仍保持高结构凝聚性,并优于现有方法的事实保留能力。 Conclusion: 问答中介的语义支架在知识图谱构建中起关键作用,有助于实现更连贯、可靠的知识结构化。 Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.[32] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records
Lingfei Qian,Mauro Giuffre,Yan Wang,Huan He,Qianqian Xie,Xuguang Ai,Xeuqing Peng,Fan Ma,Ruey-Ling Weng,Donald Wright,Adan Wang,Qingyu Chen,Vipina K. Keloth,Hua Xu
Main category: cs.CL
TL;DR: EHRNavigator是一个多智能体框架,用于在异构和多模态电子健康记录(EHR)数据中进行患者级别的问答,实现在真实临床环境下的高效准确表现。
Details
Motivation: 现有自然语言问答系统多在基准数据集上评估,缺乏实际临床应用的相关性,需要一个能在真实医院环境中处理复杂EHR数据的系统。 Method: 提出EHRNavigator,一种多智能体框架,利用AI智能体在多模态、异构EHR数据中进行导航和问答,并在公共基准和机构数据集上进行评估。 Result: 在真实世界病例中达到86%的准确率,并满足临床可接受的响应时间,表现出良好的泛化能力。 Conclusion: EHRNavigator有效弥合了基准评估与临床部署之间的差距,为现实世界的EHR问答提供了强大、自适应且高效的解决方案。 Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.[33] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
Wan Jou She,Lis Kanashiro Pereira,Fei Cheng,Sakiko Yahata,Panote Siriaraya,Eiji Aramaki
Main category: cs.CL
TL;DR: 本文介绍了EmplifAI,一个用于支持慢性病患者应对复杂情绪的日本共情对话数据集,包含28种细粒度情绪类别的情境对话,并通过大规模语言模型评估和微调验证其有效性。
Details
Motivation: 慢性病患者在疾病管理不同阶段会经历复杂多变的情绪,现有数据集难以捕捉这些情感动态,因此需要构建更具情境化和情感细分的共情对话数据集。 Method: 基于GoEmotions分类体系改编并验证28种细粒度情绪类别,构建280个医学相关情境及4125组两轮对话,通过众包收集与专家评审完成数据集构建,并利用BERTScore评估多个大模型在情境-对话对上的表现,同时对日语大模型进行微调以验证效果。 Result: 在情境-对话对评估中,模型取得0.83的F1分数;微调后的日语大模型在流畅性、整体共情和特定情绪共情方面均有显著提升;LLM-as-a-Judge与人工评分的对比显示两者具有一定相关性但也存在潜在风险。 Conclusion: EmplifAI是一个高质量、情境化、细粒度的日本共情对话数据集,能有效支持面向慢性病患者的情感对话系统研究,且其评估流程为未来共情生成任务提供了参考。 Abstract: This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation--dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.[34] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment
Zhenghao Liu,Zhuoyang Wu,Xinze Li,Yukun Yan,Shuo Wang,Zulong Chen,Yu Gu,Ge Yu,Maosong Sun
Main category: cs.CL
TL;DR: 提出了一种名为P-ALIGN的蒸馏框架,通过自适应前缀对齐来提升小型模型在数学推理任务中的表现,有效利用教师模型的推理路径并去除冗余部分。
Details
Motivation: 教师模型生成的推理路径过长且复杂,导致学生模型难以学习,监督信号与学生模型的学习能力之间存在不匹配。 Method: 提出P-ALIGN框架,通过判断剩余后缀是否足够简洁有效来自适应截断教师生成的推理路径,并利用前缀部分对学生模型进行监督,实现前缀对齐。 Result: 在多个数学推理基准上的实验表明,P-ALIGN比所有基线方法性能高出3%以上,分析显示其构建的前缀提供了更有效的监督信号,避免了冗余和不确定推理成分的负面影响。 Conclusion: P-ALIGN能有效提升小规模模型的推理能力,通过自适应地利用教师模型的推理路径,为知识蒸馏提供了更优的监督方式。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.[35] Deriving Character Logic from Storyline as Codified Decision Trees
Letian Peng,Kun Zhou,Longfei Yun,Yupeng Hou,Jingbo Shang
Main category: cs.CL
TL;DR: 提出Codified Decision Trees (CDT)框架,从大规模叙事数据中自动生成可执行、可解释的决策树型行为配置文件,提升角色扮演智能体的行为一致性与可靠性。
Details
Motivation: 现有角色扮演智能体的行为配置文件多为非结构化、不可执行且缺乏验证,导致智能体行为脆弱、不一致。 Method: 提出CDT框架,通过迭代生成候选场景-动作规则、基于数据验证并进行层次化细化,从大规模叙事数据中学习条件规则树,其中内部节点为验证过的场景条件,叶节点为 grounded 的行为陈述。 Result: 在16个作品共85个角色的多个基准上,CDT显著优于人工编写配置文件和先前的配置文件生成方法。 Conclusion: 结构化、可执行且经过数据验证的行为表征能有效提升角色扮演智能体的可靠性和上下文适应性。 Abstract: Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.[36] Is MT Ready for the Next Crisis or Pandemic?
Vipasha Bansal,Elizabeth Brown,Chelsea Kendrick,Benjamin Pong,William D. Lewis
Main category: cs.CL
TL;DR: 本研究评估了四种商业机器翻译系统在低资源语言中翻译疫情相关文本的效果,使用TICO-19数据集衡量其在下一次大流行中的可用性和应对准备程度。
Details
Motivation: 在危机和医疗情境中,政府、援助机构与受影响社区之间常存在语言障碍,而现有的商业机器翻译系统在低资源语言中的表现尚不明确,因此需要评估其实际可用性。 Method: 利用包含多种高优先级语言的疫情相关句子的TICO-19数据集,对四个商业机器翻译系统进行评估,并分析其翻译输出的可读性与准确性。 Result: 研究揭示了当前商业MT系统在低资源语言特别是危机和医疗领域中的翻译质量存在显著差异,部分系统输出不够可靠,影响信息传达。 Conclusion: 目前的机器翻译技术在应对下一次 pandemic 的多语言沟通需求方面仍存在不足,需进一步改进以提升低资源语言的支持能力。 Abstract: Communication in times of crisis is essential. However, there is often a mismatch between the language of governments, aid providers, doctors, and those to whom they are providing aid. Commercial MT systems are reasonable tools to turn to in these scenarios. But how effective are these tools for translating to and from low resource languages, particularly in the crisis or medical domain? In this study, we evaluate four commercial MT systems using the TICO-19 dataset, which is composed of pandemic-related sentences from a large set of high priority languages spoken by communities most likely to be affected adversely in the next pandemic. We then assess the current degree of ``readiness'' for another pandemic (or epidemic) based on the usability of the output translations.[37] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking
Viet Cuong Nguyen,Nhi Yen Nguyen,Kristin A. Candan,Mary Conlon,Vanessa Rumie,Kristen Risola,Srijan Kumar,Munmun De Choudhury
Main category: cs.CL
TL;DR: 本文提出了CALM-IT框架,用于生成和评估长程动机性访谈对话,通过建模双向状态空间过程提升大语言模型在心理治疗对话中的连贯性和目标对齐能力。
Details
Motivation: 大语言模型在心理健康场景中应用广泛,但难以维持长期、目标导向的对话,易出现语义漂移和策略不连贯的问题。 Method: 提出CALM-IT框架,将治疗师与来访者的互动建模为双向状态空间过程,双方持续更新对齐状态、心理状态和短期目标,以指导策略选择与语句生成。 Result: 在大规模评估中,CALM-IT在有效性与目标对齐方面优于强基线模型,且随对话增长表现更稳定;尽管治疗师重定向次数较少,但客户接受率最高(64.3%),表明干预时机更精准。 Conclusion: 建模动态演化的对话状态对于生成高质量的长程合成对话至关重要。 Abstract: Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.[38] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
Yiming Ren,Junjie Wang,Yuxin Meng,Yihang Shi,Zhiqiang Lin,Ruihang Chu,Yiran Xu,Ziming Li,Yunfei Zhao,Zihan Wang,Yu Qiao,Ruiming Tang,Minghao Liu,Yujiu Yang
Main category: cs.CL
TL;DR: 提出“海洋中的鱼”(FITO)范式,通过构建跨模态证据链来评估多模态大语言模型在科学论文理解中的真实推理能力,并发布SIN-Data和SIN-Bench以推动可追溯、有依据的评估。
Details
Motivation: 现有评估方法如仅看答案匹配或合成的‘针海寻针’测试难以衡量模型是否真正理解科学论文中的多模态信息并进行因果推理,缺乏对证据链的支持要求。 Method: 提出FITO评估范式,构建保留图文原生交错结构的SIN-Data语料库,并设计包含四个渐进任务的SIN-Bench:证据发现、假设验证、基于证据的问答和证据锚定的摘要生成;引入‘无证据,无分数’评分机制,从匹配性、相关性和逻辑性评估证据质量。 Result: 在八个MLLM上的实验表明,Gemini-3-pro整体表现最佳(平均0.573),GPT-5在问答准确率最高(0.767)但证据对齐得分较低,揭示正确性与可追溯支持之间的差距。 Conclusion: 模型在科学文献理解中的主要瓶颈在于证据 grounding 能力,强调评估应重视可验证的推理路径而非单纯答案匹配。 Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.[39] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation
Lechen Zhang,Yunxiang Zhang,Wei Hu,Lu Wang
Main category: cs.CL
TL;DR: 提出了一种技能中心的蒸馏框架,通过基于技能的数据选择和技能感知微调,用少量数据高效地将推理能力从大模型迁移到小模型。
Details
Motivation: 现有的推理模型蒸馏方法通常需要大量标注数据进行监督微调,缺乏数据效率,因此需要一种更高效的方法来减少数据需求并提升特定技能的迁移效果。 Method: 提出技能中心的蒸馏框架,包括两部分:一是基于技能的数据选择,优先选择针对学生模型薄弱技能的样本;二是技能感知微调,鼓励在问题求解中进行显式的技能分解。 Result: 仅使用从10万样本教师生成语料库中选出的1000个训练样本,在五个数学推理基准上,该方法比随机SFT基线在Qwen3-4B上高出+1.6%,在Qwen3-8B上高出+1.4%,且性能提升集中在训练中强调的技能上。 Conclusion: 技能中心的蒸馏框架能有效提升推理能力迁移的数据效率,验证了针对性技能训练在模型蒸馏中的优势。 Abstract: Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model's weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.[40] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends
Ye Wang,Jiaxing Chen,Hongjiang Xiao
Main category: cs.CL
TL;DR: 本文系统综述了角色扮演语言智能体(RPLA)的发展脉络、关键技术与未来方向,涵盖从早期规则模板到认知模拟的技术演进、人格建模与记忆机制、数据构建挑战及多维评估体系。
Details
Motivation: 随着大语言模型的快速发展,角色扮演语言智能体在自然语言处理与人机交互中日益重要,亟需系统性梳理其技术进展与挑战。 Method: 通过文献综述方式,梳理RPLA的技术演进路径,分析人格建模、记忆增强提示、行为决策控制等核心技术,总结数据构建与多维度评估方法。 Result: 归纳出RPLA的关键技术框架,包括心理量表驱动的人格建模、记忆增强机制、动机-情境行为控制;系统分析了专用语料构建方法与版权问题;整理了涵盖角色知识、人格一致性、价值对齐和幻觉控制的评估体系。 Conclusion: RPLA正向认知模拟与多模态沉浸交互发展,未来研究应关注人格演化、多智能体协作叙事、与认知神经科学融合等方向,为后续研究提供系统视角与方法论支持。 Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.[41] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
Yutao Mou,Zhangchi Xue,Lijun Li,Peiyang Liu,Shikun Zhang,Wei Ye,Jing Shao
Main category: cs.CL
TL;DR: 本文提出了一种用于检测和防止LLM代理在执行过程中调用不安全工具的新方法,包括构建基准TS-Bench、训练守护模型TS-Guard,以及引入反馈驱动的推理框架TS-Flow,显著提升了安全性与任务完成率。
Details
Motivation: 随着LLM代理通过调用外部工具与环境交互的能力增强,其潜在的安全风险也随之上升,亟需在执行前实时监测并干预不安全的工具调用行为。 Method: 构建了TS-Bench作为步级工具调用安全检测的基准;采用多任务强化学习训练TS-Guard模型,基于交互历史判断请求有害性和动作-攻击关联;设计TS-Flow框架,利用守护模型的反馈引导代理推理过程。 Result: TS-Guard能有效识别不安全工具调用;TS-Flow使ReAct式代理的有害调用平均减少65%,并在提示注入攻击下使良性任务完成率提升约10%。 Conclusion: 该工作为LLM代理提供了可解释、可泛化的实时安全防护机制,推动了安全可控的智能代理发展。 Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.[42] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models
Guimin Hu,Meng Li,Qiwei Peng,Lijie Hu,Boyan Xu,Ruichu Cai
Main category: cs.CL
TL;DR: 本文研究了MoE大模型中专家激活的机制,提出通过熵和因果效应指标识别具有领域偏好或强因果影响的专家,并发现早期token更易触发关键专家,调整这些专家权重可提升模型性能。
Details
Motivation: 受人类大脑功能特化的启发,现有解释性工作多关注Transformer中的层或神经元机制,而对MoE模型中专家级行为的研究不足,因此本文聚焦于理解专家在不同领域中的激活模式及其作用。 Method: 引入基于熵的指标衡量专家对特定领域的偏好,使用因果效应指标识别对输出有显著影响的驱动专家,并分析token与专家激活之间的关联关系。 Result: (1)部分专家表现出明显的领域偏好,另一些则对模型输出有强因果影响;(2)句子中靠前的token更可能触发驱动专家;(3)调整领域和驱动专家的权重可在三个模型和领域上均带来性能提升。 Conclusion: 该研究揭示了MoE模型内部专家分工的机制,增强了对这类模型的可解释性,为优化专家选择与模型设计提供了依据。 Abstract: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.[43] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice,Puria Radmard,Samuel Ratnam,Andy Kim,David Africa,Kyle O'Brien
Main category: cs.CL
TL;DR: 本文通过预训练69亿参数的语言模型,研究了预训练语料中关于AI对齐讨论对模型行为的影响,发现负面描述会加剧模型的错位行为,而正面描述则显著降低错位,提出“自我实现的对齐”概念,并建议在预训练阶段就考虑对齐问题。
Details
Motivation: 理解预训练语料中关于AI行为的叙述(尤其是负面叙述)如何影响模型的对齐性,防止模型因内化负面先验而导致自我实现的错位。 Method: 预训练6.9B参数的语言模型,通过调整包含AI对齐或错位内容的合成文档比例,控制预训练语料中相关论述的数量,并评估其对模型对齐性的影响。 Result: 增加关于AI错位的讨论会导致模型表现出更多错位行为;相反,增加关于对齐行为的讨论可将错位评分从45%降至9%。该效应在后训练阶段虽被削弱但仍持续存在。 Conclusion: 预训练数据中的对齐相关内容会影响模型的对齐先验,支持‘自我实现的对齐’观点,建议在预训练阶段就进行对齐优化,而不仅仅依赖后训练。 Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai[44] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik,Ashish Anand
Main category: cs.CL
TL;DR: AWED-FiNER是一个开源生态系统,支持36种全球语言的细粒度命名实体识别(FgNER),特别关注低资源和脆弱语言,提供代理工具包、Web应用和小型专家模型,适用于多语言环境和资源受限场景。
Details
Motivation: 现有大语言模型在低资源语言和细粒度NLP任务上表现不佳,缺乏对脆弱语言的技术支持,因此需要一个专门针对多语言FgNER的高效、可访问且可离线部署的解决方案。 Method: 构建包含代理工具、Web应用和49个小型开源专家模型的生态系统,通过代理系统将多语言文本路由至对应语言的专用模型,实现快速FgNER标注,并支持离线部署于边缘设备。 Result: 实现了覆盖超66亿人使用语言的FgNER系统,支持包括Bodo、Manipuri等脆弱语言,提供秒级标注响应、用户友好的Web服务及轻量模型的广泛部署能力。 Conclusion: AWED-FiNER有效填补了多语言尤其是低资源与脆弱语言在细粒度命名实体识别领域的技术空白,推动了NLP技术的普惠化与去中心化发展。 Abstract: We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED-FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED-FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed-finer).[45] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection
Nhung Nguyen Thi Hong,Cuong Nguyen Dang,Tri Le Ngoc
Main category: cs.CL
TL;DR: 本文提出了Credit C-GPT,一个专用于越南语债务催收对话理解的七 billion 参数领域大语言模型,整合了多种对话智能任务,实验表明其优于传统流水线方法。
Details
Motivation: 由于越南语催收对话中存在非正式口语、情绪变化和复杂领域推理,传统自然语言处理系统难以有效应对,因此需要专门化的模型来提升理解能力。 Method: 构建了一个七 billion 参数的大语言模型 Credit C-GPT,并通过多任务学习框架集成对话理解、情感识别、意图检测、通话阶段分类和槽位提取;采用特定数据构建、标注和微调策略进行训练。 Result: 在私有标注数据集上的实验显示,Credit C-GPT 在各项任务上均优于传统流水线方法,具备更强的可扩展性和隐私保护能力。 Conclusion: 领域专业化的大语言模型能够有效提升越南语债务催收场景中的对话理解性能,为企业的实时辅助和事后分析提供了高效解决方案。 Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.[46] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning
Ziang Cui,Mengran Yu,Tianjiao Li,Chenyu Shi,Yingxuan Shi,Lusheng Zhang,Hongwei Lin
Main category: cs.CL
TL;DR: 提出HOMURA框架,利用强化学习在保持语义的同时实现符合音节时长约束的翻译。
Details
Motivation: 解决大语言模型在多语言翻译中存在的跨语言冗长偏差问题,使其适用于字幕和配音等时间受限任务。 Method: 构建Sand-Glass基准用于评估音节级时长约束下的翻译效果;提出HOMURA强化学习框架,采用KL正则化目标和动态音节比例奖励来优化语义保持与时间合规之间的权衡。 Result: 实验结果表明,HOMURA显著优于强基线模型,能精确控制输出长度,符合语言密度层次且不损害语义充分性。 Conclusion: HOMURA有效解决了LLM在时间敏感场景下的翻译冗长问题,实现了语义与时长的平衡。 Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.[47] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang,Jian Yang,Weiyuan Li,Rui Xie,Jen-tse Huang,Jun Gao,Shuai Huang,Yueping Kang,Liyuan Gou,Hongwei Feng,Yanghua Xiao
Main category: cs.CL
TL;DR: HUMANLLM是一个新框架,通过建模心理模式间的因果互动来提升语言模型在角色扮演中的真实性,相比传统方法更准确地模拟人类认知与行为动态。
Details
Motivation: 现有角色扮演语言代理难以真实对齐人类的认知和行为模式,缺乏对复杂心理机制的建模。 Method: 从约12,000篇学术论文中构建244种心理模式,并合成11,359个包含2-5种模式相互作用的情境,生成多轮对话、内心活动与行为;提出双层评估清单衡量单一及多模式表现。 Result: HUMANLLM-8B在多模式动态建模上优于Qwen3-32B,且人类对齐度高(r=0.91),发现整体指标易混淆仿真准确性与社会期望性。 Conclusion: 实现真正的人类拟态需深入认知建模,不仅要模拟人类行为,更要模拟产生这些行为的心理过程。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling--simulating not just what humans do, but the psychological processes generating those behaviors.[48] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?
Arya Shah,Himanshu beniwal,Mayank Singh
Main category: cs.CL
TL;DR: 提出首个面向印度12种语言的统一基准PI-Indic-Align,评估多语言嵌入模型在文化适配的个性-指令对齐任务中的表现,涵盖四种检索与分类任务,并提供可复现基线。
Details
Motivation: 现有基准多局限于单一语言或混淆检索与生成过程,缺乏对多语言嵌入模型是否能独立编码个性与指令兼容性的评估,尤其在印度多语种、多文化背景下亟需更精准的对齐方法。 Method: 构建覆盖12种印度语言的统一基准,包含单语与跨语言的个性-指令双向检索及二元兼容性分类任务,在冻结编码器设置下使用逻辑回归头评估八种多语言嵌入模型。 Result: E5-Large-Instruct在单语和跨语言检索中Recall@1分别为27.4%和20.7%,BGE-M3在反向检索达32.1% Recall@1,LaBSE在分类任务取得75.3% AUROC且校准良好。 Conclusion: 不同模型在不同任务中表现各异,表明需根据具体应用场景选择合适模型,该工作为印度多语言环境下的文化感知对齐系统提供了实用指导和可复现基线。 Abstract: Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.[49] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients
Kentaro Kazama,Daiki Shirafuji,Tatsuhiko Saito
Main category: cs.CL
TL;DR: 本文提出了GeoSteer,一种基于流形的框架,用于提升大语言模型在多步推理中的中间推理质量。
Details
Motivation: 现有的大语言模型虽然在最终答案上可能正确,但其链式思维(CoT)推理过程中常出现逻辑不一致,影响推理可靠性。 Method: 构建带有分段评分的CoT数据集,训练变分自编码器(VAE)和质量评估模型以学习高质量CoT轨迹的低维流形,并引导目标LLM的隐藏状态向潜在空间中更高质量区域移动。 Result: 在GSM8k数据集上使用Qwen3系列模型进行评估,GeoSteer使准确率最高提升了2.6点,成对胜率提高了5.3点。 Conclusion: GeoSteer提供了一种有效且可控的方法来提升大语言模型中间推理步骤的质量,具有几何一致性引导优势。 Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.[50] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?
Guanxu Chen,Dongrui Liu,Jing Shao
Main category: cs.CL
TL;DR: Looped Transformers (LTs) aim to bridge the gap between internal knowledge and linguistic outputs in LLMs through iterative computation, but increasing iterations only partially narrows the gap due to representation degradation, and introspective abilities do not improve across loops.
Details
Motivation: To investigate whether Looped Transformers can use their iterative structure to improve introspection and align internal knowledge with explicit outputs in large language models. Method: Empirical analysis of Looped Transformers by examining how increasing loop iterations affect the gap between internal representations and linguistic outputs, along with assessing the model's ability to perceive its own representations across loops. Result: Increasing loop iterations reduces the knowledge-output gap but is partly due to degradation of internal representations; the model's ability to perceive representations does not improve during loops and is only evident in the final loop. Conclusion: While Looped Transformers show potential for scaling computational depth, they currently lack effective introspection to consistently link internal representation space with natural language generation. Abstract: Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational depth by iterating shared layers--can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs' ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.[51] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts
Prottay Kumar Adhikary,Reena Rawat,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: coTherapist是一个基于小型语言模型的统一框架,通过领域特定微调、检索增强和代理推理来模拟核心治疗能力,能够在临床查询中生成相关、可信且安全的回应,展现出高共情和治疗师一致的人格特质。
Details
Motivation: 由于心理健康服务面临劳动力短缺和需求上升的压力,亟需智能系统辅助心理治疗专家,提升服务可及性与效率。 Method: 提出coTherapist框架,采用小型语言模型,结合领域微调、检索增强和代理式推理,并通过T-BARS评分标准和心理测量分析进行评估,辅以临床专家的人工评价。 Result: coTherapist在临床问题回答中比现有基线模型更相关且更具临床依据,表现出高共情与治疗师一致性人格,专家评估确认其响应准确、可信且安全。 Conclusion: 经过工程化的小型模型可展现类专家行为,为数字心理健康工具提供可扩展的发展路径。 Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.[52] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs
Nan Li,Bo Kang,Tijl De Bie
Main category: cs.CL
TL;DR: 该论文研究了大语言模型在不同语言下判断道德困境时的差异,提出了一种分离输入语言和推理语言影响的新方法,并结合道德基础理论解释模型判断,发现推理语言的影响是输入语言的两倍,且近一半模型存在上下文依赖性。
Details
Motivation: 探究大语言模型在不同语言中进行道德判断时是否会产生不同结论,并区分是输入语言还是推理语言导致了这些差异。 Method: 通过独立操控道德困境的输入语言和模型的推理语言(包括匹配与不匹配条件),并基于道德基础理论对判断结果进行解释分析。 Result: 在13个大语言模型上测试英-中文道德判断,发现推理语言贡献的方差是输入语言的两倍;检测到近半数模型存在标准评估无法发现的上下文依赖性;提出了一个诊断分类法以指导模型部署。 Conclusion: 大语言模型的道德判断受推理语言显著影响,需在多语言评估中区分输入与推理语言的作用,所提方法具有较强的诊断能力和实际应用价值。 Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.[53] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel
Hiroaki Yamagiwa,Yusuke Takase,Hidetoshi Shimodaira
Main category: cs.CL
TL;DR: 本文提出了一种基于主角度的子空间相似性度量方法——投影核(PK),用于量化Transformer中注意力头之间的关系,并通过实验验证其在IOI任务上优于现有指标的表现,同时引入了一个评估PK分布信息量的框架,应用于GPT2-small模型时发现L4H7注意力头具有枢纽作用。
Details
Motivation: 现有的注意力头关系度量方法未能很好地捕捉Transformer内部结构,因此需要一种更有效的度量方式来理解注意力头之间的相互作用。 Method: 利用注意力头权重矩阵张成的子空间,基于主角度定义投影核(PK)作为子空间相似性的度量,并构建一个与随机正交子空间比较的参考分布框架以评估PK分布的信息量。 Result: PK在IOI任务上比组成分数等先前指标更清晰地再现了已知的头-头交互;应用该方法构建的有向图显示,在GPT2-small中L4H7作为一个恒等头起到了枢纽作用。 Conclusion: 投影核(PK)是一种有效的新度量方法,能够更好地揭示Transformer模型中注意力头之间的结构关系,为模型解释性提供了新的分析工具。 Abstract: Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.[54] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
Yuxuan Lou,Kai Yang,Yang You
Main category: cs.CL
TL;DR: 本文提出了MoST,一种基于模态感知专家混合架构(MAMoE)的新型语音-文本多模态大模型,通过模态特定与共享专家协同工作,实现高效的跨模态理解与生成,并完全基于开源数据训练,性能优于同规模现有模型。
Details
Motivation: 现有多模态模型通常使用相同参数处理不同模态,忽视了语音与文本在表示上的本质差异,导致建模效率与性能受限。 Method: 提出MAMoE架构,包含模态特定专家组和共享专家,通过路由机制将输入分配给合适的专家;构建高效的迁移训练流程,在预训练MoE语言模型基础上进行ASR/TTS后训练及指令微调,全程使用开源数据集。 Result: 在ASR、TTS、音频语言建模和口语问答等多个任务上,MoST均优于同等参数量级的现有模型;消融实验验证了模态特定路由和共享专家的有效性。 Conclusion: MoST是首个基于专家混合架构且完全开源的语音-文本大语言模型,其架构设计提升了模态特异性学习与跨模态融合能力,同时保证了数据效率与高性能。 Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST[55] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?
Luoming Hu,Jingjie Zeng,Liang Yang,Hongfei Lin
Main category: cs.CL
TL;DR: 本文提出了一种基于道德基础理论(MFT)的新型方法,通过跨语言线性探测识别并操控大语言模型中的细粒度道德表征,引入可调控的“道德向量”,并提出自适应道德融合(AMF)机制,在推理时动态干预以平衡安全性与可用性。
Details
Motivation: 现有对齐技术多为表面防护,难以改变大语言模型内在的道德表征,导致安全性和响应性之间的权衡问题。因此需要一种能深入调节模型内在道德判断机制的方法。 Method: 基于道德基础理论(MFT),使用跨语言线性探测分析中层中的道德表征,提取可操控的道德向量,并提出自适应道德融合(AMF)方法,结合探测检测与向量注入,在推理时动态调整模型行为。 Result: 验证了英语和中文之间存在共享但不同的道德子空间,成功提取并验证了道德向量在内部表征和行为层面的有效性;AMF显著减少了对良性查询的错误拒绝,同时降低了越狱攻击的成功率。 Conclusion: 该方法提供了一种可解释、可控制的内在道德对齐路径,有效缓解了安全性与有用性之间的冲突,推动了大语言模型在多语言环境下的深层道德对齐。 Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.[56] Multilinguality as Sense Adaptation
Jan Christian Blaise Cruz,David Ifeoluwa Adelani,Alham Fikri Aji
Main category: cs.CL
TL;DR: 提出SENSE-based Symmetric Interlingual Alignment (SENSIA)方法,通过在平行数据上对齐词义混合和上下文表示,实现跨语言的语义适配,在较少目标语言数据下达到与单语模型相当的性能。
Details
Motivation: 现有跨语言模型主要依赖共享参数和大规模数据,难以有效对齐不同语言间的细粒度语义表示,尤其在低资源场景下表现受限。 Method: 提出SENSIA方法,通过显式对齐Backpack语言模型中不同语言的词义级混合分布和上下文表示,并结合目标语言的自回归语言建模损失,实现语义对齐与语言流畅性的平衡。 Result: 在四种类型迥异的语言上实验表明,SENSIA优于现有的多语言对齐方法,且仅用2-4倍少的目标语言数据即可媲美从头训练的单语基线;分析显示其保持了局部词义拓扑和相对于英语的全局结构。 Conclusion: 通过将多语言学习视为语义适配问题,SENSIA提供了一种更高效、鲁棒的跨语言迁移方式,减少了对大规模目标语言数据的依赖。 Abstract: We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.[57] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios
Aniket Deroy
Main category: cs.CL
TL;DR: 本文介绍了Advosynth-500数据集,包含100个合成语音文件,用于研究法庭辩论场景中合成声音的区分能力。
Details
Motivation: 随着大规模语音到语音模型保真度的提高,区分结构化环境中的合成声音变得至关重要。 Method: 使用Speech Llama Omni模型模拟五组不同的律师对辩场景,为每位律师定义特定的声音特征,并构建说话人识别挑战任务。 Result: 发布了包含10种独特律师身份的100个合成语音文件的数据集,可用于评估现代系统对合成语音来源的识别能力。 Conclusion: Advosynth-500为评估合成语音辨别能力提供了新的基准,推动了语音识别与合成领域的研究。 Abstract: As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins. Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500.[58] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis
Songsong Tian,Kongsheng Zhuo,Zhendong Wang,Rong Shen,Shengtao Zhang,Yong Wu
Main category: cs.CL
TL;DR: 本文提出了BAR-SQL,一种将可靠性和边界感知嵌入生成过程的统一NL2SQL训练框架,通过种子变异数据合成和知识引导推理提升SQL生成质量与对模糊/不可答查询的拒绝能力。
Details
Motivation: 现有NL2SQL模型在面对模糊查询、模式限制和不可回答问题时缺乏可靠的边界感知能力,容易产生错误SQL;需要一个能同时优化生成准确性和拒绝能力的统一框架。 Method: 提出BAR-SQL框架,采用Seed Mutation方法构建包含多步分析查询及边界情况的企业级语料库;使用Knowledge-Grounded Reasoning Synthesis生成基于元数据和业务规则的思维链;通过两阶段训练(SFT + 基于群体相对策略优化的强化学习)结合任务条件混合奖励机制,联合优化执行准确率与语义精确的拒绝响应。 Result: 在新构建并公开的Ent-SQL-Bench基准上,BAR-SQL达到91.48%的平均准确率,优于Claude 4.5 Sonnet和GPT-5等先进闭源模型,在SQL生成质量和边界感知拒绝能力方面均表现更优。 Conclusion: BAR-SQL通过显式建模边界感知与可靠性机制,显著提升了NL2SQL模型在企业真实场景下的实用性与可信度,推动了可解释且稳健的自然语言接口发展。 Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.[59] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit
Warren Jouanneau,Emma Jouffroy,Marc Palyart
Main category: cs.CL
TL;DR: 本文提出了一种基于晚期交叉注意力架构的重排序模型,用于实时、高效地匹配简历与职位需求,利用大语言模型生成细粒度监督信号并通过知识蒸馏提升学生模型性能,在多语言长文本场景下实现了更优的人员-岗位匹配效果。
Details
Motivation: 由于简历通常较长、结构复杂且多语言,实时准确匹配人选与职位具有挑战性,同时历史数据中的偏见也影响匹配公平性。 Method: 采用晚期交叉注意力架构对简历和项目简述进行分解处理,以高效应对长上下文输入;利用生成式大语言模型作为教师模型,生成语义丰富的监督信号,并通过改进的蒸馏损失函数将其传递给学生模型。 Result: 实验在相关性、排序和校准指标上均优于现有最先进基线模型,模型生成的技能匹配得分具有良好的可解释性和一致性。 Conclusion: 所提方法在处理长文本、多语言简历与职位匹配任务中表现出色,有效缓解了数据偏见,提升了匹配的准确性与可解释性。 Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.[60] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Deming Ding,Shichun Liu,Enhui Yang,Jiahang Lin,Ziying Chen,Shihan Dou,Honglin Guo,Weiyu Cheng,Pengyu Zhao,Chengjun Xiao,Qunhong Zeng,Qi Zhang,Xuanjing Huang,Qidi Xu,Tao Gui
Main category: cs.CL
TL;DR: OctoBench是一个用于评估在仓库基础代理编码中遵循多样化且持续性框架指令能力的新基准,揭示了任务完成与框架合规之间的系统性差距。
Details
Motivation: 现有LLM作为软件代理在遵循复杂、异构且跨交互持续的编码框架指令方面的能力尚未被充分研究。 Method: 提出OctoBench,包含34个环境、217个任务和三种框架类型,并配备7,098项客观检查清单及自动化观察与评分工具包以追踪完整执行轨迹并进行细粒度评估。 Result: 在八个代表性模型上的实验显示,模型在完成任务的同时普遍存在对框架指令遵守不足的问题,暴露出当前模型在异构指令遵循方面的缺陷。 Conclusion: 需要专门针对异构指令遵循的训练和评估方法,OctoBench的开源将促进可重复的基准测试和更具备框架感知能力的编码代理的发展。 Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.[61] Training-Trajectory-Aware Token Selection
Zhanming Shen,Jiaqi Hu,Zeyu Qin,Hao Chen,Wentao Ye,Zenan Huang,Yihong Zhuang,Guoshan Lu,Junlin Zhou,Junbo Zhao
Main category: cs.CL
TL;DR: 本文提出了一种基于训练轨迹感知的词元选择方法(T3S),以解决在强推理能力学生模型上持续蒸馏效果有限甚至退化的问题,通过词元级优化路径重构,在多种生成和直接学习语言模型中实现了显著性能提升。
Details
Motivation: 在学生模型已具备较强推理能力的情况下,传统的持续蒸馏常导致性能下降或增益有限,其根源在于训练过程中词元信心分裂导致优化受阻。 Method: 提出Training-Trajectory-Aware Token Selection (T3S),在词元级别重建训练目标,区分并动态处理‘模仿锚定词元’与‘待学习词元’,消除二者冲突,疏通优化路径。 Result: T3S在AR和dLLM设置下均取得一致增益:仅用数百示例,Qwen3-8B超越DeepSeek-R1;Qwen3-32B接近Qwen3-235B性能;T3训练的LLaDA-2.0-Mini超过其自回归基线,成为16B规模无思考模型中的SOTA。 Conclusion: 通过分析蒸馏过程中的词元级训练动态,T3S有效解决了强能力模型蒸馏中的性能瓶颈问题,为高效能力迁移提供了新的视角和实用方案。 Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.[62] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
Zhihao Xu,Rumei Li,Jiahuan Li,Rongxiang Weng,Jingang Wang,Xunliang Cai,Xiting Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于文本语料库生成多轮工具使用轨迹的新范式GEM,通过四阶段流程从文本中提取真实、可扩展的多轮交互数据,并设计了一个高效的轨迹合成模型,在降低计算成本的同时显著提升了大模型在多轮工具使用任务上的性能。
Details
Motivation: 获取多样化且真实的多轮工具使用数据十分困难,限制了大语言模型在构建自主智能体方面的应用。因此需要一种可扩展、低成本且贴近现实的数据生成方法。 Method: 提出GEM数据合成流水线,包含相关性过滤、工作流与工具提取、轨迹 grounding 和复杂度优化四个阶段;并通过监督微调训练一个专用的轨迹合成模型,将复杂流程简化为端到端的高效生成器。 Result: GEM-32B在BFCL V3多轮基准上性能提升16.5%,部分超越在特定领域数据(如τ-bench航空和零售)上训练的模型;轨迹合成器在质量相当的情况下大幅降低了推理延迟和成本。 Conclusion: 基于文本语料库的多步问题解决经验是一种高效、真实且可扩展的多轮工具使用数据来源,所提出的GEM范式具有优异的泛化能力与实用价值。 Abstract: Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.[63] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Christina Lu,Jack Gallagher,Jonathan Michala,Kyle Fish,Jack Lindsey
Main category: cs.CL
TL;DR: 本文研究了大语言模型中“助手轴”(Assistant Axis)的结构,发现该轴反映了模型在默认助手模式下的行为程度。通过激活方向调控这一轴,可以增强或减弱模型的助人性和无害性行为,并影响其人格表现。研究还表明,限制助手轴上的激活可稳定模型行为,防止人格漂移和对抗性越狱攻击。
Details
Motivation: 探索大语言模型中不同人格特征的空间结构,理解其默认助手身份背后的机制,并解决模型在特定对话中出现的人格漂移问题。 Method: 通过提取代表不同角色原型的激活方向,分析多个模型中的人格空间结构,识别出‘助手轴’,并测试在其上进行定向调控对模型行为的影响。 Result: 发现了普遍存在的‘助手轴’,调控该轴可改变模型的助人性、无害性及语言风格;该轴在预训练模型中也存在,且与人格漂移现象相关;限制该轴上的激活可提升模型在易感对话和对抗攻击中的稳定性。 Conclusion: 后训练使模型偏向人格空间中的特定区域,但绑定较松散;需进一步研究更稳固锚定模型人格的训练与引导策略。 Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.[64] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
Tarun Sharma,Manikandan Ravikiran,Sourava Kumar Behera,Pramit Bhattacharya,Arnab Bhattacharya,Rohit Saluja
Main category: cs.CL
TL;DR: 本文介绍了INDIC-DIALECT,一个包含11种印度方言的平行语料库,并构建了一个多任务基准,用于推动低资源印度方言的自然语言处理研究。
Details
Motivation: 大多数低资源语言方言在NLP研究中被忽视,尤其是在印度,尽管印地语和奥里亚语使用广泛,但其方言缺乏数字化资源和网络存在。 Method: 构建了一个包含13,000个句子对的平行语料库(INDIC-DIALECT),涵盖印地语和奥里亚语的11种方言,并设计了一个包含方言分类、选择题回答和机器翻译的多任务基准,评估了大模型与微调模型的表现。 Result: 实验表明GPT-4o和Gemini 2.5在方言分类任务上表现差;微调后的印度语言预训练模型将F1值从19.6%提升至89.8%;在方言到语言翻译中,混合AI模型BLEU得分为61.32(基线23.36);而在语言到方言翻译中,“规则+AI”方法取得最佳BLEU得分48.44(基线27.59)。 Conclusion: INDIC-DIALECT为印度方言感知的NLP提供了新基准,未来将开源以促进低资源方言的研究。 Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.[65] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
Mihai Dan Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran
Main category: cs.CL
TL;DR: TF3-RO是一个面向罗马尼亚语的端到端语言建模管道,支持从分词器设计到合成数据生成的全流程,提升了资源匮乏语言的模型训练与应用。
Details
Motivation: 针对形态丰富但计算资源不足的语言(如罗马尼亚语),缺乏统一且可复现的合成数据与模型训练框架,因此需要构建一个涵盖全流程的解决方案。 Method: 基于TF1和TF2数据集,提出TF3-RO框架,设计罗马尼亚语专用BPE和Unigram分词器,从头预训练5165万参数的LLaMA风格Transformer,并通过量化、剪枝和知识蒸馏压缩为2645万参数的学生模型,再利用该模型结合组合提示生成三百万条罗马尼亚语合成寓言。 Result: 成功构建了罗马尼亚语专用的高效分词器,训练出紧凑且部署性能强的模型,在内在指标、语法一致性、实体连贯性和LLM评估中均表现良好,并生成大规模语言原生合成语料。 Conclusion: TF3-RO为资源受限语言提供了可复现、语言学基础扎实的建模范式,兼具模型压缩与高质量合成数据生成能力,具有实际部署与研究推广价值。 Abstract: Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.[66] Are Language Models Models?
Philip Resnik
Main category: cs.CL
TL;DR: 语言模型(LMs)作为认知模型的主张在Marr的三个层次上均存在问题,更适合作为工具而非认知建模,过度宣称其认知相似性会助长大模型炒作。
Details
Motivation: 评估语言模型是否真正适合作为人类认知的模型系统,特别是在Marr的三个分析层次(计算理论、算法-表征、实现)上的适用性。 Method: 基于Marr的三层次分析框架,分别考察语言模型在计算理论、算法-表征和实现层面上与认知模型的对应程度。 Result: 发现语言模型在实现层面明显不匹配,在算法-表征层面缺乏充分动机,在计算理论层面也存在概念性问题。 Conclusion: 语言模型更适合被视为研究工具,而非人类认知的直接模型;将其称为认知模型属于夸大其词,且助长了对大模型能力的过度宣传。 Abstract: Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.[67] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability
Ruochen Li,Kun Yuan,Yufei Xia,Yue Zhou,Qingyu Lu,Weihang Li,Youxiang Zhu,Nassir Navab
Main category: cs.CL
TL;DR: 本文提出了一种基于手术阶段目标可满足性的规划正确性定义,并构建了一个多中心元评估基准来评估视觉语言模型在手术规划中的表现,发现传统的序列相似性指标会误判规划质量,而基于规则的目标可满足性指标能更准确地反映模型性能,揭示了感知错误和推理不足导致的失败,并指出结构化知识对提升模型表现具有重要作用。
Details
Motivation: 当前对手术规划中视觉语言模型(VLMs)的评估协议在安全性关键场景下是否可靠尚不明确,因此需要一种更准确的评估方式来衡量模型在手术规划任务中的真实能力。 Method: 基于专家定义的手术规则,提出以阶段-目标可满足性作为规划正确性的判断标准,构建包含有效变体和无效规划的多中心元评估基准,并采用基于规则的目标可满足性指标对视频大模型进行逐步受限条件下的评估。 Result: 实验表明传统序列相似性指标会错误惩罚有效规划并忽略无效规划;基于规则的评估方法能更精确识别模型失败原因,包括感知错误和欠约束推理;结构化知识显著提升性能,而仅依赖语义引导不可靠,且其优势仅在与结构约束结合时才在大模型中体现。 Conclusion: 现有评估指标不足以可靠评估手术规划中的VLMs,应采用基于规则的目标可满足性作为更高精度的元评估标准;结构化先验知识对于提升模型在复杂、安全敏感型长视野规划任务中的表现至关重要。 Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.[68] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models
Abhinaba Basu,Pavan Chakraborty
Main category: cs.CL
TL;DR: 本文提出了Contextual StereoSet基准和Context Sensitivity Fingerprints(CSF)方法,用于评估语言模型在不同上下文中的偏见敏感性,发现固定条件下的偏见评分可能无法泛化,强调评估应关注‘在什么条件下出现偏见’而非‘模型是否有偏见’。
Details
Motivation: 现有偏见评估方法在实验室环境中可能有效,但在实际部署中因上下文变化而失效,因此需要更鲁棒的评估框架来衡量模型在不同情境下的偏见表现。 Method: 提出Contextual StereoSet基准,保持刻板内容不变,系统性改变上下文(如时间、地点、受众),并通过两种协议测试13个模型;引入CSF方法,提供每维度离散度的简明画像及带置信区间的对比分析。 Result: 实验显示,上下文变化(如设定于1990年而非2030年、八卦语境、外群体观察者视角)显著影响模型的刻板印象选择(p<0.05),效应可达13个百分点,并在招聘、借贷和求助情境中复现;CSF支持两种评估模式:360-上下文诊断网格和涵盖4,229项的预算协议。 Conclusion: 固定的偏见评估条件不足以反映模型在现实中的行为,评估应转向考察模型偏见对上下文的敏感性;CSF提供了一种更具鲁棒性和实用性的偏见评估框架。 Abstract: A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences -- no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases -- a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, "Under what conditions does bias appear?" rather than "Is this model biased?" We release our benchmark, code, and results.[69] DR-Arena: an Automated Evaluation Framework for Deep Research Agents
Yiwen Gao,Ruochen Zhao,Yang Deng,Wenxuan Zhang
Main category: cs.CL
TL;DR: 本文提出了DR-Arena,一个全自动化的评估框架,用于动态评估作为深度研究代理的大型语言模型,通过实时信息树和自适应演化循环实现与人类偏好高度一致的性能评估。
Details
Motivation: 现有的静态数据集基准在任务通用性、时间对齐和数据污染方面存在局限,难以有效评估具备自主研究能力的大型语言模型。 Method: 构建基于实时网络趋势的‘信息树’,设计自动化‘考官’生成测试深度推理和广度覆盖的任务,并采用自适应演化循环根据实时表现动态提升任务难度。 Result: 在六个先进深度研究代理上的实验显示,DR-Arena与LMSYS搜索竞技场排行榜的斯皮尔曼相关系数达到0.94,且无需人工干预。 Conclusion: DR-Arena是一种可靠、高效且无需人工参与的评估方法,能够准确反映深度研究型LLM的真实能力边界,是替代昂贵人工评判的前沿方案。 Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.[70] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models
Xuan Luo,Lewei Yao,Libo Zhao,Lanqing Hong,Kai Chen,Dehua Tao,Daxin Tan,Ruifeng Xu,Jing Li
Main category: cs.CL
TL;DR: 本文提出了AEQ-Bench,一个用于评估多模态大模型在理解和生成情感回应方面能力的新基准,特别关注音频和文本输入中的情感线索,并检验模型在无文本转录情况下判断音频回应共情程度的能力。
Details
Motivation: 由于共情具有内在的情感性,对多模态大模型(OLMs)的共情能力进行自动评估仍是一个重大挑战,现有基准难以全面衡量这一能力。 Method: 构建了一个名为AEQ-Bench的新基准,包含两种新设置:一是通过多模态输入(音频+文本)生成共情回应,二是不依赖文本转录来判断音频回应的共情程度;并在不同语境特异性和语音语调下进行系统评估。 Result: 实验结果显示:(1)具备音频输出能力的OLMs通常优于仅具文本输出能力的模型;(2)尽管OLMs在粗粒度质量评估上与人类判断一致,但在细粒度的副语言表达评估上仍不可靠。 Conclusion: AEQ-Bench为评估多模态大模型的共情能力提供了更全面的框架,揭示了当前模型在理解副语言情感信号方面的局限性,推动未来研究关注更精细的情感表达建模。 Abstract: While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.[71] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models
Chengbing Wang,Wuqiang Zheng,Yang Zhang,Fengbin Zhu,Junyi Cheng,Yi Xie,Wenjie Wang,Fuli Feng
Main category: cs.CL
TL;DR: 本文提出了一种心理学基础的共情奖励建模方法PERM,通过支持者、求助者和旁观者三重视角实现双向共情评估,显著提升了大语言模型在情感支持任务中的表现。
Details
Motivation: 现有共情奖励模型多从单一视角评估,忽视了共情交互的双向性,难以真实反映情感支持质量。 Method: 基于共情循环理论,将共情分解为支持者视角(内在共鸣与表达)、求助者视角(情绪接收)和旁观者视角(整体交互质量),构建三重奖励信号。 Result: 在标准情感智能基准和工业对话数据集上超越现有方法10%以上,盲测用户偏好达70%。 Conclusion: PERM通过多视角建模有效提升LLM的情感支持能力,为构建更具共情力的对话系统提供了新范式。 Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10\%. Furthermore, a blinded user study reveals a 70\% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.[72] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Syed Naveed Mahmood,Md. Rezaur Rahman Bhuiyan,Tasfia Zaman,Jareen Tasneem Khondaker,Md. Sameer Sakib,Nazia Tasnim,Farig Sadeque
Main category: cs.CL
TL;DR: 本文提出了知识免疫框架(KIF),通过针对内部激活特征实现真正知识遗忘,解决了现有方法仅表面抑制而未彻底清除知识的问题。
Details
Motivation: 现有遗忘方法混淆了行为抑制与真实知识删除,导致潜在能力依然存在,难以满足GDPR合规和模型安全需求。 Method: 提出KIF框架,结合对特定主体表征的动态抑制与参数高效微调,直接作用于模型内部激活模式而非表面输出,实现在不重新训练整个模型情况下的持久遗忘。 Result: KIF在多种主流大模型(如Llama、Mistral、Qwen、DeepSeek)上实现了接近全知水平的擦除效果(FQ≈0.99),同时保持实用性(MU=0.62),且标准模型表现出尺度无关的真实遗忘特性。 Conclusion: KIF有效区分了伪装拒绝与真正知识清除,打破了以往稳定性与可遗忘性之间的权衡,为机制级遗忘行为提供了首个系统性评估方案。 Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.[73] Form and Meaning in Intrinsic Multilingual Evaluations
Wessel Poelman,Miryam de Lhoneux
Main category: cs.CL
TL;DR: 本文探讨了在多语言设置中用于条件语言模型的内在评估指标(如困惑度或每字符比特数)的假设及其影响,实验表明当前指标不具备普遍可比性,并从形式-意义争论角度提供了解释。
Details
Motivation: 在多语言环境下,现有评估指标依赖于平行句对语义相同即可比较的假设,但这些指标实际衡量的是信息论意义上的信息量,导致其可比性和有效性存疑。 Method: 通过在两个多语言平行语料库上对六种评估指标进行实验,结合单语和多语言模型,分析不同指标的表现,并引用形式-意义理论进行解释。 Result: 发现当前的内在评估指标在多语言条件下不具备普遍可比性,且其理论假设存在问题。 Conclusion: 现有的内在评估指标不能直接用于多语言模型间的性能比较,需重新审视其理论基础并发展更合适的评估方法。 Abstract: Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.[74] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
Yuxi Xia,Loris Schoenegger,Benjamin Roth
Main category: cs.CL
TL;DR: 本文提出TracVC方法,用于追踪大语言模型(LLM)生成的自信表达来源,发现模型常基于与问题无关的训练数据模仿表面语言形式表达自信,而非基于事实依据,揭示当前训练方式下LLM存在“表现自信”与“应有自信”脱节的问题。
Details
Motivation: 大语言模型常通过表达自信来增强用户信任,但其自信程度常与事实准确性不一致,即存在过度自信问题。为理解这种口头自信的来源,需探究其生成机制是否基于相关内容依据。 Method: 提出TracVC方法,结合信息检索与影响估计技术,将模型输出的自信表达追溯至训练数据;在OLMo和Llama模型上进行问答任务评估,并提出‘内容 grounded 程度’(content groundness)指标,衡量模型自信是基于与问题相关的训练样本还是泛化的自信表达模式。 Result: 分析显示OLMo2-13B模型的自信表达常受与查询无词汇关联的训练数据影响,表明其倾向于模仿无关的表面语言特征而非依赖实际内容支持;模型更多学习了如何‘显得自信’,而非在合理时机表达自信。 Conclusion: 当前训练机制可能导致大语言模型学会表达自信的语言形式,却未学会何时应表达自信;该研究为提升模型自信表达的可靠性提供了分析基础与改进方向。 Abstract: Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.[75] Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Tiziano Labruna,Arkadiusz Modzelewski,Giorgio Satta,Giovanni Da San Martino
Main category: cs.CL
TL;DR: 本研究提出了一种基于多策略说服评分的大型语言模型方法,用于提升文本说服力预测的准确性和可解释性,并发布了带有主题标注的Winning Arguments数据集。
Details
Motivation: 理解文本中的说服策略对于分析人类交流至关重要,但现有方法在捕捉说服机制方面仍存在不足。 Method: 利用大型语言模型,结合六种说服策略(如声誉攻击、转移话题和操纵性措辞)进行策略引导式推理,评估文本的说服力,并在三个标注论证数据集上进行实验。 Result: 实验表明,策略引导的推理能有效提升说服力预测性能;通过将Winning Arguments数据集按主题分类,进一步揭示了内容主题对结果的影响。 Conclusion: 结构化的、基于策略的提示方法有助于提高论证质量评估的可解释性和鲁棒性,为未来说服力研究提供了新方向。 Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.[76] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Gilat Toker,Nitay Calderon,Ohad Amosy,Roi Reichart
Main category: cs.CL
TL;DR: 本文提出了LIBERTy框架,利用基于大语言模型的干预生成结构化反事实数据集,用于评估概念性解释方法的可信度,并引入新的评估指标order-faithfulness,发现现有方法在高风险领域仍有改进空间。
Details
Motivation: 现有的概念性解释评估依赖人工编写的反事实作为参考,成本高且不完美,因此需要一个更可靠、可扩展的基准来评估解释方法的忠实性。 Method: 提出LIBERTy框架,基于明确的结构因果模型(SCM)生成反事实样本,通过在文本生成过程中对概念进行干预,并利用大语言模型生成对应的反事实输出,构建三个真实场景的数据集,并提出order-faithfulness作为新评估指标。 Result: 在五个模型上评估了多种解释方法,发现当前方法在忠实性方面仍有显著提升空间;同时发现专有大语言模型对人口统计学概念的敏感性明显降低,可能归因于后训练缓解策略。 Conclusion: LIBERTy为开发和评估可信的概念性解释方法提供了必要且系统的基准,有助于推动高风险决策场景中可解释AI的发展。 Abstract: Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.[77] Grounding Agent Memory in Contextual Intent
Ruozhen Yang,Yucheng Jiang,Yueqi Jiang,Priyanka Kargupta,Yunyi Zhang,Jiawei Han
Main category: cs.CL
TL;DR: 本文提出了STITCH,一种用于长周期目标导向交互的智能体记忆系统,通过结构化意图索引提升上下文感知的记忆检索性能。
Details
Motivation: 在长周期、目标导向的交互中,现有大语言模型的记忆系统容易因重复实体和事实导致上下文错配,难以准确检索相关信息。 Method: STITCH将每个轨迹步骤与结构化检索线索、上下文意图进行索引,并基于当前步骤的意图进行历史检索;上下文意图包含当前潜在目标、动作类型和显著实体类型,用于过滤和优先排序记忆片段。 Result: 在CAME-Bench和LongMemEval两个基准上,STITCH达到最先进水平,性能超过最强基线35.6%,且随着轨迹长度增加增益更大;分析表明意图索引显著降低了检索噪声。 Conclusion: STITCH通过意图感知的记忆机制有效支持了鲁棒的长周期推理,提升了复杂交互中记忆检索的准确性和上下文一致性。 Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.[78] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Changle Qu,Sunhao Dai,Hengyi Cai,Jun Xu,Shuaiqiang Wang,Dawei Yin
Main category: cs.CL
TL;DR: 本文提出了MatchTIR框架,通过二分图匹配实现细粒度的回合级奖励分配和双层优势估计,以改进大语言模型在复杂任务中的工具集成推理能力。
Details
Motivation: 现有强化学习方法通常依赖于结果或轨迹级别的奖励,导致无法区分有效与冗余或错误的工具调用,尤其在长周期多轮场景中表现不佳。 Method: 将信用分配建模为预测轨迹与真实轨迹之间的二分图匹配问题,采用两种分配策略生成密集的回合级奖励,并结合回合级和轨迹级信号进行双层优势估计。 Result: 在三个基准上的实验表明,MatchTIR显著优于现有方法,尤其在长周期多轮任务中,4B模型超越了大多数8B模型。 Conclusion: MatchTIR通过细粒度监督和双层优势估计,有效提升了大语言模型在工具集成推理中的性能,尤其适用于复杂、多步任务。 Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.cs.CV [Back]
[79] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification
Shahrzad Sayyafzadeh,Hongmei Chi,Shonda Bernadin
Main category: cs.CV
TL;DR: 提出了一种端到端的管道,用于生成、优化和评估对抗性贴片以攻击面部生物识别系统,结合FGSM和扩散模型提升隐蔽性,并利用ViT-GPT2进行语义描述,支持取证分析。
Details
Motivation: 为了测试和增强面部生物识别系统的安全性,研究如何有效生成具有高隐蔽性的对抗性贴片,并支持对攻击过程的 forensic 分析。 Method: 使用FGSM生成针对身份分类器的对抗噪声,结合扩散模型的逆向扩散过程进行高斯平滑和自适应亮度校正以提高不可感知性;将生成的贴片应用于人脸图像,并采用ViT-GPT2模型生成对抗样本的语义描述;通过感知哈希和分割技术检测和分析对抗样本。 Result: 该方法在保持视觉自然性的同时有效逃避了人脸识别系统,SSIM达到0.95,显示出较强的隐蔽性和攻击有效性,并能通过caption生成支持取证解释。 Conclusion: 所提出的管道能够高效生成难以察觉的对抗性贴片,成功攻击面部识别与表情识别系统,同时提供可解释的语义输出,适用于安全测试与数字取证场景。 Abstract: This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person's identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.[80] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving
Carlo Sgaravatti,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi
Main category: cs.CV
TL;DR: 提出了一种名为LCF3D的新型多模态融合框架,结合RGB图像和LiDAR点云进行3D目标检测,通过late fusion减少误检,通过cascade fusion恢复漏检,在KITTI和nuScenes数据集上对行人、骑行者等类别表现出色,并具有良好的域泛化能力。
Details
Motivation: 准确检测3D物体对自动驾驶至关重要,但如何有效融合RGB图像和LiDAR数据仍具挑战,尤其是在不同传感器配置下保持良好性能。 Method: 提出LCF3D框架,采用late fusion过滤未匹配的LiDAR误检,结合cascade fusion基于未匹配的RGB检测生成新的3D frustum提议以恢复漏检物体。 Result: 在KITTI和nuScenes数据集上显著优于纯LiDAR方法,尤其在行人、骑行者、摩托车和自行车等难检测类别上表现突出,并展现出良好的域适应能力。 Conclusion: LCF3D通过有效的多模态融合策略提升了3D目标检测的准确性和鲁棒性,特别适用于复杂真实场景中的自动驾驶系统。 Abstract: Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: https://github.com/CarloSgaravatti/LCF3D.[81] Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images
Adil O. Khadidos,Aziida Nanyonga,Alaa O. Khadidos,Olfat M. Mirza,Mustafa Tahsin Yilmaz
Main category: cs.CV
TL;DR: 本研究比较了DenseNet121和EfficientNet-B0两种卷积神经网络在儿童肺炎自动检测中的性能,使用5863张胸部X光图像进行训练与评估,结果表明EfficientNet-B0在准确率、F1分数和MCC上表现更优,且结合Grad-CAM和LIME解释性方法提升了模型决策的可信度。
Details
Motivation: 肺炎是全球儿童发病和死亡的主要原因,亟需高效、准确的诊断辅助工具,深度学习在医学影像分析中展现出巨大潜力,尤其是在胸部X光片解读方面。 Method: 采用公开的5863张儿童胸部X光图像数据集,通过归一化、调整大小和数据增强进行预处理;基于ImageNet预训练权重,在相同训练条件下对DenseNet121和EfficientNet-B0进行微调,并使用准确率、F1分数、MCC和召回率评估性能,同时引入Grad-CAM和LIME实现模型可解释性。 Result: EfficientNet-B0表现优于DenseNet121,准确率达到84.6%,F1分数为0.8899,MCC为0.6849;DenseNet121分别为79.7%、0.8597和0.5852;两个模型召回率均超过0.99,显示高灵敏度;Grad-CAM和LIME可视化结果显示模型关注临床相关肺部区域,验证了预测可靠性。 Conclusion: EfficientNet-B0相较DenseNet121具有更优的平衡性和计算效率,适合临床部署;结合可解释性技术增强了AI辅助诊断系统的透明度与可信度,有助于推动其在儿科肺炎诊断中的应用。 Abstract: Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.[82] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
Subhajit Sanyal,Srinivas Soumitri Miriyala,Akshay Janardan Bankar,Sravanth Kodavanti,Harshit,Abhishek Ameta,Shreyas Pandith,Amit Satish Unde
Main category: cs.CV
TL;DR: 本文提出了NanoSD,一种从Stable Diffusion 1.5蒸馏而来的轻量级扩散基础模型家族,通过网络手术、特征级生成蒸馏和结构化缩放联合优化U-Net与VAE,实现在边缘设备上的实时图像恢复与生成。
Details
Motivation: 现有轻量化的扩散模型主要压缩去噪U-Net或缩短扩散路径,破坏了潜在流形并限制了泛化能力;同时完整扩散流程在边缘设备上计算开销过大,难以部署。 Method: 提出NanoSD,采用网络手术、特征级生成蒸馏以及对U-Net和VAE编码器-解码器的结构化架构缩放进行联合优化,实现全流水线协同设计,在压缩模型的同时保持生成先验。 Result: NanoSD模型参数量在1.3亿到3.15亿之间,可在移动级NPU上实现低至20ms的实时推理,并在图像超分、去模糊、人脸修复和单目深度估计任务中超越以往轻量模型,兼顾感知质量与部署效率。 Conclusion: NanoSD是一系列兼顾准确性、延迟和模型大小的帕累托最优扩散基础模型,能够在边缘设备上支持多种视觉生成与恢复任务,实现了高效、通用的实时部署。 Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.[83] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval
Xiaoxu Ma,Runhao Li,Hanwen Liu,Xiangbo Zhang,Zhenyu Weng
Main category: cs.CV
TL;DR: 本文提出了UniHash,一种统一的双分支哈希框架,结合点对和成对学习范式的优势,实现对已见和未见类别的高效图像检索。
Details
Motivation: 现有深度哈希方法通常局限于单一训练范式(点对或成对),难以在已见和未见类别上同时取得良好性能,因此需要一种能统一两种范式优势的方法。 Method: 提出UniHash框架,包含基于中心的点对分支和成对分支;引入双向知识迁移机制,采用互学习损失对齐哈希表示,并设计Split-Merge Mixture of Hash Experts(SM-MoH)模块增强跨分支表示交换。 Result: 在CIFAR-10、MSCOCO和ImageNet上的实验表明,UniHash在已见和未见类别的图像检索任务中均达到最先进性能。 Conclusion: UniHash通过融合点对和成对学习范式,实现了对已见和未见类别的平衡且优越的检索性能,理论分析和实验验证了其有效性。 Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.[84] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Po-han Li,Shenghui Chen,Ufuk Topcu,Sandeep Chinchali
Main category: cs.CV
TL;DR: 提出了一种名为ViSIL的信息理论框架,用于量化视频摘要中的信息损失,能够在不同模态摘要格式之间进行统一比较,并在视频问答任务中表现出与人类和视觉语言模型性能显著相关的相关性。
Details
Motivation: 传统指标如BLEU或ROUGE无法跨模态(如文本与关键帧序列)有效衡量信息覆盖度,缺乏对多模态视频摘要质量的准确评估方法。 Method: 提出Video Summary Information Loss (ViSIL)分数,利用视觉语言模型推断来测量从原始视频到多模态摘要之间的信息损失,基于信息论构建统一评估框架。 Result: ViSIL得分与人类及视觉语言模型在视频问答(VQA)任务上的表现具有统计学上显著的相关性;可用于摘要选择,优化信息损失与处理速度之间的权衡,在不增加计算负担的情况下比纯文本摘要提升7%的VQA准确率。 Conclusion: ViSIL是一种有效的统一多模态摘要评估指标,能够反映不同格式摘要的信息保留程度,并支持高效、准确的视频内容检索与生成应用。 Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.[85] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP
Anant Mehta,Xiyuan Wei,Xingyu Chen,Tianbao Yang
Main category: cs.CV
TL;DR: 本文提出了TuneCLIP,一种用于提升开放权重CLIP模型在多种下游任务中性能的自监督微调框架。该方法通过恢复优化统计量的预热阶段和优化新对比损失的微调阶段,有效避免了性能退化,并在多个基准上显著提升了现有模型的表现。
Details
Motivation: 现有的CLIP模型改进通常需要从头训练,成本极高;而直接对开放权重CLIP模型进行微调往往导致性能下降。因此,亟需一种仅利用现有自监督数据即可提升其泛化性能的方法。 Method: 提出TuneCLIP框架,包含两个关键步骤:一是基于理论分析设计的优化统计量恢复预热阶段,以减少冷启动偏差;二是采用新的对比损失函数进行微调,减轻对假负样本对的惩罚。 Result: 实验表明,TuneCLIP在不同架构和规模的模型上均能稳定提升性能。例如,在SigLIP(ViT-B/16)等领先模型上,ImageNet及其分布外基准最高提升+2.5%,DataComp基准提升+1.2%。 Conclusion: TuneCLIP为开放权重CLIP模型提供了一种高效、通用的后预训练适应方法,显著增强了其在多种任务下的表现,建立了新的强基线。 Abstract: CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.[86] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching
Kiarie Ndegwa,Andreas Gros,Tony Chang,David Diaz,Vincent A. Landau,Nathan E. Rutenbeck,Luke J. Zachmann,Guy Bayes,Scott Conway
Main category: cs.CV
TL;DR: VibrantSR是一个基于生成模型的超分辨率框架,用于从10米Sentinel-2影像生成0.5米分辨率的树冠高度模型(CHM),在西部美国22个生态区表现出优于现有卫星基准的性能,支持大范围森林监测与碳核算。
Details
Motivation: 传统基于航空影像的树冠高度建模受限于获取频率低且不规律,难以实现持续大范围监测,因此需要一种可利用全球可用、时序频繁的卫星数据进行高精度树冠高度估计的方法。 Method: 提出VibrantSR框架,采用生成式超分辨率技术,利用Sentinel-2季节性合成影像将10米分辨率提升至0.5米,以预测高分辨率树冠高度模型,并在22个生态区采用空间分离验证策略进行评估。 Result: 在树高≥2米的情况下,VibrantSR的平均绝对误差(MAE)为4.39米,优于Meta(4.83米)、LANDFIRE(5.96米)和ETH(7.05米)等卫星基准;尽管低于航空影像方法VibrantVS(2.71米),但具备更广的覆盖与应用能力。 Conclusion: VibrantSR能够利用广泛可用的Sentinel-2数据实现大陆尺度的森林结构连续监测,为无需依赖昂贵航空数据的大规模碳核算提供了可行方案。 Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.[87] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
Yang Xing,Jiong Wu,Savas Ozdemir,Ying Zhang,Yang Yang,Wei Shao,Kuang Gong
Main category: cs.CV
TL;DR: 本文提出MedVL-SAM2,一个统一的3D医学多模态模型,能够同时支持报告生成、视觉问答和多种分割任务,通过结合图像级推理与像素级感知,在3D医学影像中实现细粒度视觉定位与空间推理。
Details
Motivation: 现有医学视觉语言模型在细粒度视觉定位和3D空间推理方面能力有限,且难以在一个框架内统一多种功能,因此需要一种能同时处理高层语义理解与精确空间定位的通用3D医学VLM。 Method: 提出MedVL-SAM2,采用基于SAM2的体积分割模块,结合3D CT图像-文本对进行预训练,并通过多阶段联合优化语言理解和分割目标,实现文本、点或框提示下的灵活交互。 Result: 模型在报告生成、视觉问答和多种3D分割任务上均达到SOTA性能,具备可靠的3D视觉定位、可控的交互式分割和鲁棒的跨模态推理能力。 Conclusion: MedVL-SAM2成功将高层语义推理与精细3D定位统一于单一框架中,证明了通用3D医学视觉语言模型的可行性与潜力。 Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.[88] Transition Matching Distillation for Fast Video Generation
Weili Nie,Julius Berner,Nanye Ma,Chao Liu,Saining Xie,Arash Vahdat
Main category: cs.CV
TL;DR: 本文提出了一种名为Transition Matching Distillation (TMD) 的新框架,用于将视频扩散模型蒸馏为高效的少步生成器,以实现高质量实时视频生成。
Details
Motivation: 现有的大型视频扩散模型虽然生成质量高,但因其多步采样效率低,难以应用于实时交互场景,因此需要一种高效蒸馏方法。 Method: TMD通过匹配扩散模型的多步去噪轨迹与少步概率转移过程,将扩散模型蒸馏为少步生成器;采用分解结构:主干网络提取语义表示,流式头部分进行内部流更新,并通过分布匹配蒸馏实现知识迁移。 Result: 在Wan2.1 1.3B和14B文本到视频模型上的实验表明,TMD在生成速度与视觉质量之间实现了良好权衡,优于现有蒸馏方法,具备更高的视觉保真度和提示一致性。 Conclusion: TMD是一种有效且灵活的视频生成模型蒸馏框架,能够显著提升推理效率,同时保持高质量生成效果,适用于实时应用。 Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd[89] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport
Zhihua Zhao,Guoqiang Li,Chen Min,Kangping Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于最优传输的多模态融合框架OT-Drive,用于提升自动驾驶中非结构化环境下的可行驶区域分割在分布外(OOD)场景中的泛化能力。
Details
Motivation: 现有数据驱动方法在OOD场景下分割性能下降,影响自动驾驶下游任务的可靠性。 Method: 将RGB与表面法向量的融合建模为分布传输问题,设计场景锚点生成器(SAG)构建语义锚点,并通过最优传输融合模块(OT Fusion)将多模态特征映射到由语义锚点定义的流形上。 Result: 在ORFD的OOD场景上达到95.16% mIoU,超越先前方法6.35%;在跨数据集任务上达到89.79% mIoU,超过基线13.99%。 Conclusion: OT-Drive在少量训练数据下展现出强OOD泛化能力,提升了模型在真实场景中的实用性与部署效率。 Abstract: Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport--driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.[90] The Spatial Blindspot of Vision-Language Models
Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna
Main category: cs.CV
TL;DR: 本文探讨了视觉语言模型(VLMs)在空间关系理解上的不足,提出通过替代训练目标和二维位置编码来增强空间感知能力。
Details
Motivation: 当前的VLMs通常使用将图像展平为一维patch序列的方式进行训练,这丢失了对于空间推理至关重要的二维结构信息。作者认为缺乏空间意识是VLM设计中的一个盲点,并且限制了需要空间定位的应用程序的发展。 Method: 研究了两种方法:一是采用不同的训练目标来训练图像编码器;二是引入2D positional encodings以保留图像的空间结构。 Result: 实验表明,这些架构选择可以在多个基准测试上提高空间推理性能。 Conclusion: 改进图像编码器的训练目标和引入2D位置编码可以有效提升VLMs的空间理解能力,这对于机器人技术和具身AI等应用具有重要意义。 Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.[91] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models
Yulin He,Wei Chen,Zhikang Jian,Tianhang Guo,Wenjuan Zhou,Minglong Li
Main category: cs.CV
TL;DR: 提出DR^2Seg自奖励框架,通过两阶段rollout策略提升多模态大模型在复杂文本查询下的推理效率与分割准确性,无需额外监督。
Details
Motivation: 现有方法在处理复杂文本查询时存在“过思考”问题,生成冗长的推理链,干扰多模态大语言模型中的对象定位。 Method: 提出DR^2Seg,采用两阶段rollout策略:第一阶段生成明确描述目标对象的自包含描述;第二阶段用该描述替代原查询以验证其自包含性,并引入两个自奖励机制来增强目标导向推理并抑制冗余思维。 Result: 在不同规模的多模态大语言模型和分割模型上实验表明,DR^2Seg consistently 提升了推理效率和整体分割性能。 Conclusion: DR^2Seg有效缓解了过思考问题,在无需额外思考监督的情况下实现了更高效准确的推理分割。 Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation performance.[92] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis
Chengjia Liang,Zhenjiong Wang,Chao Chen,Ruizhi Zhang,Songxi Liang,Hai Xie,Haijun Lei,Zhongwei Huang
Main category: cs.CV
TL;DR: 提出了一种动态加权双图注意力网络(DW-DGAT)用于帕金森和阿尔茨海默病的早期诊断,融合多模态数据、双图结构提取特征并缓解类别不平衡,实验显示其性能优越。
Details
Motivation: 针对神经退行性疾病早期诊断中面临的高维多模态数据融合、异构性及类别不平衡等挑战,需要更有效的诊断模型。 Method: 提出DW-DGAT模型,包含通用数据融合策略、基于脑区和样本间关系的双图注意力架构,以及结合类权重生成机制与稳定损失函数以应对类别不平衡。 Result: 在PPMI和ADNI数据集上验证了模型的有效性,表现出优于现有方法的诊断性能。 Conclusion: DW-DGAT能够有效整合多源异构数据,提升帕金森和阿尔茨海默病的早期诊断准确率,具有临床应用潜力。 Abstract: Parkinson's disease (PD) and Alzheimer's disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer's Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.[93] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models
Zefan Zhang,Kehua Zhu,Shijie Jiang,Hongyuan Lu,Shengkai Sun,Tian Bai
Main category: cs.CV
TL;DR: 本文提出了一种新的评估视频事件关系幻觉的基准VERHallu,重点关注因果、时序和子事件关系,并引入关键帧传播(KFP)策略以增强多事件理解,有效缓解了现有VideoLLM在密集事件推理中的幻觉问题。
Details
Motivation: 现有研究忽视了视频中事件间关系的幻觉问题,尤其是因果、时序和子事件关系,而当前VideoLLMs在此类密集事件推理任务中表现不佳,容易依赖先验知识产生幻觉。 Method: 构建包含关系分类、问答和反事实问答任务的VERHallu基准,设计包含反直觉场景和人类标注候选答案的数据集;提出关键帧传播(KFP)策略,在中间层重新分配帧级注意力以提升对多事件关系的理解。 Result: 实验表明,当前主流VideoLLMs在事件关系推理上存在明显不足,常忽略子事件导致理解不完整;所提KFP方法能有效减少事件关系幻觉,且不降低推理速度。 Conclusion: 事件关系幻觉是VideoLLMs的重要挑战,VERHallu为评估该问题提供了新基准,KFP策略通过增强关键帧注意力提升了模型对复杂事件结构的理解能力。 Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.[94] Disentangled Concept Representation for Text-to-image Person Re-identification
Giyeol Kim,Chanho Eom
Main category: cs.CV
TL;DR: 提出DiCo框架,通过解耦的概念表示实现文本到图像行人重识别的层次化跨模态对齐,在多个数据集上表现优异且提升可解释性。
Details
Motivation: 解决文本与图像模态间巨大差异以及细粒度个体区分的挑战。 Method: 设计基于共享槽和概念块的解耦表示结构,每个槽作为跨模态的部分级锚点,并分解为多个概念块以分离颜色、纹理、形状等属性。 Result: 在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上取得与最先进方法相当的性能。 Conclusion: DiCo实现了更精细的跨模态对齐,提升了检索性能和模型可解释性。 Abstract: Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.[95] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow
Nick Truong,Pritam P. Karmokar,William J. Beksi
Main category: cs.CV
TL;DR: 提出了首个用于事件相机的合成水下光流基准数据集,基于物理渲染的RGBD序列生成逼真的事件数据,并提供密集的真实光流、深度和相机运动信息,推动水下事件相机感知算法的发展。
Details
Motivation: 水下成像受光线衰减、散射和非均匀光照影响,传统相机难以获取准确运动信息;同时缺乏配对真实光学流的水下事件相机数据集,限制了相关研究。 Method: 基于物理的光线追踪生成水下RGBD视频序列,利用现代视频到事件(video-to-event)转换 pipeline 生成事件数据,并提供密集的光流、深度和相机运动真值。 Result: 成功构建了首个合成水下事件光流基准数据集,支持对学习型和模型驱动的光流算法进行评估,并分析了水下光传输对事件形成和运动估计的影响。 Conclusion: 该数据集为水下事件相机感知算法的研发和评估建立了新基准,有助于推动该领域的发展。 Abstract: Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at https://robotic-vision-lab.github.io/ueof.[96] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
Chengzhuo Tong,Mingkun Chang,Shenglong Zhang,Yuran Wang,Cheng Liang,Zhizheng Zhao,Ruichuan An,Bohan Zeng,Yang Shi,Yifan Dai,Ziming Zhao,Guanbin Li,Pengfei Wan,Yuanxing Zhang,Wentao Zhang
Main category: cs.CV
TL;DR: 本文提出了CoF-T2I模型,将视频生成中的Chain-of-Frame(CoF)推理引入文本到图像(T2I)生成,通过渐进式视觉优化和中间帧作为显式推理步骤,显著提升了生成质量。
Details
Motivation: 现有T2I模型缺乏清晰的视觉推理起点和可解释的中间状态,限制了生成质量;而视频模型已展现出CoF推理能力,但尚未有效应用于T2I任务。 Method: 提出CoF-T2I模型,利用CoF推理进行渐进式视觉优化,中间帧作为显式推理步骤;构建CoF-Evol-Instruct数据集以建模从语义到美学的生成过程;采用每帧独立编码机制以提升质量并减少运动伪影。 Result: CoF-T2I显著优于基础视频模型,在GenEval上达到0.86,Imagine-Bench上达到7.468,表现出优越的生成性能。 Conclusion: 将CoF推理引入T2I生成是可行且有效的,为高质量图像生成提供了新方向,展示了视频模型在T2I任务中的巨大潜力。 Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.[97] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology
Hyun Do Jung,Jungwon Choi,Hwiyoung Kim
Main category: cs.CV
TL;DR: ReaMIL是一种用于全切片病理图像的多实例学习方法,通过引入轻量选择头和预算充足性目标,实现高效且紧凑的证据选择,同时保持甚至提升分类性能。
Details
Motivation: 在全切片病理分析中,现有MIL方法缺乏对证据效率和可解释性的系统评估,难以识别最小充分证据集并量化模型推理的可靠性。 Method: 提出ReaMIL框架,在强MIL骨干网络上增加一个轻量选择头,生成软的每块门控;采用预算充足性目标(基于铰链损失),在限制选中瓦片数量的同时确保真实类概率不低于阈值τ,并自然产生滑动级覆盖图。 Result: 在TCGA-NSCLC、TCGA-BRCA和PANDA数据集上,ReaMIL达到或略优于基线AUC(如NSCLC上AUC为0.983),同时仅需极少量瓦片(NSCLC上平均MSK≈8.2)即可达到高置信度,AUKC≈0.864,且证据集空间紧凑。 Conclusion: ReaMIL无需额外监督,无缝集成到标准MIL训练中,在不牺牲性能的前提下显著提高证据效率和可解释性,为全切片病理分析提供了更严谨的模型行为评估方式。 Abstract: We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq τ$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $τ= 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.[98] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting
Zhendong Wang,Lebin Zhou,Jingchuan Xiao,Rongduo Han,Nam Ling,Cihan Ruan
Main category: cs.CV
TL;DR: 本文提出了一种基于流场引导的几何对流框架,用于3D高斯点阵的Post-Impressionist风格迁移,通过将2D绘画中的艺术流动信息反向传播到3D空间,实现无需网格先验的几何抽象与结构变形。
Details
Motivation: 现有3D风格迁移方法多侧重于表面纹理投影,忽视了Post-Impressionist艺术中几何形式夸张与结构简化的核心理念,因此需要一种以几何抽象为主导的新型方法。 Method: 提出流场引导的3D高斯点阵几何对流框架:从2D画作提取方向性流场,并将其反向传播至3D空间,调整高斯基元形成符合场景拓扑的流线型笔触;采用亮度-结构解耦策略,分离几何形变与颜色优化;引入基于视觉语言模型(VLM)的艺术性评估框架。 Result: 实现了无需网格先验的3D几何风格化,生成具有强烈结构性夸张和艺术动感的3D场景表达,在视觉上更贴近Post-Impressionist艺术风格;解耦策略有效减少了剧烈形变时的伪影;VLM评估表明其在艺术真实性上优于传统方法。 Conclusion: 几何抽象是实现真实3D艺术风格迁移的关键,所提方法通过流场引导与解耦优化,成功将Post-Impressionist的艺术原则融入3D表示,提升了风格化的表现力与艺术可信度。 Abstract: In 1888, Vincent van Gogh wrote, "I am seeking exaggeration in the essential." This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.[99] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks
Mingzhuo Li,Guang Li,Linfeng Ye,Jiafeng Mao,Takahiro Ogawa,Konstantinos N. Plataniotis,Miki Haseyama
Main category: cs.CV
TL;DR: 本文提出了一种名为难度引导采样(DGS)的方法,以弥合数据蒸馏目标与下游任务之间的差距,提升图像分类等任务的性能。
Details
Motivation: 现有数据蒸馏方法通常忽视下游任务特定信息,导致蒸馏目标与实际任务之间存在目标差距。 Method: 引入“难度”概念,提出DGS作为即插即用的后处理采样模块,并基于目标难度分布从已有方法生成的图像池中采样最终蒸馏数据集;同时提出难度感知引导(DAG)以探索难度在生成过程中的影响。 Result: 大量实验表明所提方法在多种设置下均有效提升了蒸馏数据集的质量和下游任务性能。 Conclusion: 难度信息有助于缩小蒸馏目标与下游任务之间的差距,具有广泛的应用潜力。 Abstract: In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.[100] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Han Wang,Yi Yang,Jingyuan Hu,Minfeng Zhu,Wei Chen
Main category: cs.CV
TL;DR: V-Zero是一种无需人工标注的视觉-语言模型自提升框架,通过问答双角色协同进化和无监督学习显著提升模型性能。
Details
Motivation: 现有视觉-语言模型依赖大量人工标注数据,成本高且耗时,限制了其广泛应用。 Method: 提出V-Zero框架,构建Questioner与Solver两个角色:Questioner通过双轨推理奖励生成高质量问题,Solver基于自采样响应的多数投票伪标签优化;二者通过组相对策略优化(GRPO)进行迭代训练。 Result: 在Qwen2.5-VL-7B-Instruct上实现无需任何人工标注的性能提升:视觉数学推理+1.7,通用视觉任务+2.6。 Conclusion: V-Zero验证了仅用未标注图像即可实现视觉-语言模型自我进化的可行性,为多模态系统的低成本持续优化提供了新路径。 Abstract: Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero[101] InfoSculpt: Sculpting the Latent Space for Generalized Category Discovery
Wenwen Liao,Hang Ruan,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文提出了InfoSculpt,一种基于信息瓶颈原理的广义类别发现框架,通过双条件互信息目标函数在类别级和实例级上协同优化,实现对表示空间的有效解耦与压缩,显著提升了已知和新类别分类的性能。
Details
Motivation: 现有广义类别发现方法依赖伪标签或两阶段聚类,缺乏从实例噪声中显式分离出类别本质特征的机制,限制了模型在开放世界场景下的表现。 Method: 提出InfoSculpt框架,采用信息瓶颈原理,设计类别级和实例级的双重条件互信息(CMI)目标:在有标签数据上使用类别级CMI学习紧凑且判别性强的表示,在所有数据上使用实例级CMI压缩增强噪声以提取不变特征。 Result: 在8个基准数据集上进行了广泛实验,结果表明InfoSculpt在已知和新类别发现任务上均优于现有方法,验证了该方法的有效性。 Conclusion: InfoSculpt通过信息论视角系统地塑造表示空间,实现了类别信息保留与噪声过滤的平衡,为广义类别发现提供了新的思路和有效解决方案。 Abstract: Generalized Category Discovery (GCD) aims to classify instances from both known and novel categories within a large-scale unlabeled dataset, a critical yet challenging task for real-world, open-world applications. However, existing methods often rely on pseudo-labeling, or two-stage clustering, which lack a principled mechanism to explicitly disentangle essential, category-defining signals from instance-specific noise. In this paper, we address this fundamental limitation by re-framing GCD from an information-theoretic perspective, grounded in the Information Bottleneck (IB) principle. We introduce InfoSculpt, a novel framework that systematically sculpts the representation space by minimizing a dual Conditional Mutual Information (CMI) objective. InfoSculpt uniquely combines a Category-Level CMI on labeled data to learn compact and discriminative representations for known classes, and a complementary Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise. These two objectives work synergistically at different scales to produce a disentangled and robust latent space where categorical information is preserved while noisy, instance-specific details are discarded. Extensive experiments on 8 benchmarks demonstrate that InfoSculpt validating the effectiveness of our information-theoretic approach.[102] FlowAct-R1: Towards Interactive Humanoid Video Generation
Lizhen Wang,Yongming Zhu,Zhipeng Ge,Youwei Zheng,Longhao Zhang,Tianshu Hu,Shiyang Qin,Mingshuang Luo,Jiaxu Zhang,Xin Chen,Yulong Wang,Zerong Zheng,Jianwen Jiang,Chao Liang,Weifeng Chen,Xing Wang,Yuan Zhang,Mingyuan Gao
Main category: cs.CV
TL;DR: 本文提出了FlowAct-R1,一个用于实时交互式人形视频生成的框架,基于MMDiT架构实现低延迟、高保真和长时间一致性的视频流合成。
Details
Motivation: 现有视频生成方法在高保真与实时交互之间存在权衡,难以满足连续响应的人机互动需求。 Method: 采用MMDiT架构,引入分块扩散强制策略及其自强制变体,结合高效蒸馏与系统级优化,实现任意时长视频的流式生成与低延迟响应。 Result: 在480p分辨率下稳定达到25fps,首帧时间约1.5秒,支持全身细粒度控制,表现出优异的行为生动性与视觉真实感。 Conclusion: FlowAct-R1在保持高效实时性能的同时,显著提升了交互式人形视频生成的质量与长期时序一致性。 Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.[103] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers
Chenyue Zhou,Jiayi Tuo,Shitong Qin,Wei Dai,Mingxuan Wang,Ziwei Zhao,Duoyang Li,Shiyang Su,Yanxi Lu,Yanbiao Ma
Main category: cs.CV
TL;DR: 本文提出了MathDoc,这是首个针对真实高中数学试卷的文档级信息提取基准,包含3,609个带有现实干扰项的问题,并引入了对模型拒绝不可识别输入能力的评估。实验表明当前最先进的多模态大语言模型在提取性能上表现良好,但无法有效拒绝模糊输入,暴露出可靠性方面的关键缺陷。
Details
Motivation: 现有数据集主要关注干净文档或通用布局分析,忽视了数学问题的结构完整性以及模型对不完整输入的主动拒绝能力,难以反映真实教育场景中的复杂情况。 Method: 构建了一个名为MathDoc的新基准,包含3,609个真实高中数学考试题目,涵盖实际存在的视觉噪声,并明确包含无法识别的样本;提出一个多维度评估框架,涵盖题干准确性、视觉相似性和拒绝能力。 Result: 在Qwen3-VL和Gemini-2.5-Pro等SOTA MLLM上的实验显示,端到端模型虽能较好提取内容,但在面对模糊输入时普遍缺乏拒绝能力,常输出高置信度但无效的结果。 Conclusion: 当前多模态大语言模型在处理退化文档时存在可靠性缺陷,MathDoc为评估模型在真实复杂条件下的表现提供了新标准,并强调了主动拒绝机制的重要性。 Abstract: The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}[104] Enhancing Visual In-Context Learning by Multi-Faceted Fusion
Wenwen Liao,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
Main category: cs.CV
TL;DR: 提出一种新的多组合协同融合框架,通过生成三个上下文表示分支来整合高质量提示信息,结合MULTI-VQGAN架构实现更强大的视觉上下文学习。
Details
Motivation: 现有视觉上下文学习方法通常仅使用单一或融合的顶部提示,丢失了有价值的上下文信息,限制了模型推理能力。 Method: 生成三个由不同高质量提示组合集成的上下文表示分支,并引入MULTI-VQGAN架构联合解释和利用多个来源的协作信息。 Result: 在前景分割、单物体检测和图像着色等任务上表现出更强的跨任务泛化能力和更准确的预测结果。 Conclusion: 多组合协同融合能有效提升视觉上下文学习的性能,优于现有的单一或简单融合提示方法。 Abstract: Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.[105] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL
Wenwen Liao,Jianbo Yu,Yuansong Wang,Shifu Yan,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文提出了一种端到端的视觉上下文学习(VICL)框架,通过融合多个提示信息和利用其排列结构来提升图像修复模型在少量提示下的任务适应能力。
Details
Motivation: 现有VICL方法仅选择最相似的提示而忽略其他高质量提示中的互补信息,并且未能利用不同提示排列所隐含的结构信息。 Method: 提出一个自适应融合模块,聚合多个提示中的关键模式和标注以生成更精确的上下文提示;引入与排列相关的轻量级MLP解耦布局先验;并采用双向微调机制增强融合模块与修复模型之间的协作。 Result: 在前景分割、单目标检测和图像着色任务上实验表明,该方法性能优越,具有强跨任务泛化能力。 Conclusion: 所提方法有效解决了现有VICL中提示信息利用不足和结构信息缺失的问题,显著提升了模型的上下文学习能力。 Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.[106] VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation
Sicheng Yang,Zhaohu Xing,Lei Zhu
Main category: cs.CV
TL;DR: 本文提出VQ-Seg,首个利用向量量化(VQ)进行特征空间离散化并引入可控量化扰动模块(QPM)的半监督医学图像分割方法,取代传统的dropout扰动,提升正则化效果。
Details
Motivation: 现有基于dropout的特征扰动方法依赖手动调节敏感的超参数dropout rate,难以优化且可能导致正则化效果不佳。 Method: 提出Quantized Perturbation Module(QPM),通过打乱码本索引的空间位置来扰动离散特征表示;设计双分支结构共享后量化特征空间用于图像重建与分割任务;引入Post-VQ Feature Adapter(PFA)融合基础模型的高层语义信息以补偿量化损失。 Result: 在自建的大规模肺癌数据集(828例CT扫描)及其他公开基准上实验表明,该方法优于当前最先进的半监督分割方法。 Conclusion: VQ-Seg通过向量量化实现更有效且可控的特征扰动,解决了dropout超参数调优难题,在多个医学图像分割任务中展现出优越性能。 Abstract: Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult to optimize and may lead to suboptimal regularization. To overcome this limitation, we propose VQ-Seg, the first approach to employ vector quantization (VQ) to discretize the feature space and introduce a novel and controllable Quantized Perturbation Module (QPM) that replaces dropout. Our QPM perturbs discrete representations by shuffling the spatial locations of codebook indices, enabling effective and controllable regularization. To mitigate potential information loss caused by quantization, we design a dual-branch architecture where the post-quantization feature space is shared by both image reconstruction and segmentation tasks. Moreover, we introduce a Post-VQ Feature Adapter (PFA) to incorporate guidance from a foundation model (FM), supplementing the high-level semantic information lost during quantization. Furthermore, we collect a large-scale Lung Cancer (LC) dataset comprising 828 CT scans annotated for central-type lung carcinoma. Extensive experiments on the LC dataset and other public benchmarks demonstrate the effectiveness of our method, which outperforms state-of-the-art approaches. Code available at: https://github.com/script-Yang/VQ-Seg.[107] LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Linquan Wu,Tianxiang Jiang,Yifei Dong,Haoyu Yang,Fengji Zhang,Shichaang Meng,Ai Xuan,Linqi Song,Jacky Keung
Main category: cs.CV
TL;DR: 提出LaViT框架,通过对其潜在视觉思维而非静态嵌入来改善多模态推理中的视觉基础,显著提升复杂推理任务表现。
Details
Motivation: 现有方法依赖外部监督且忽略内在视觉注意力动态,导致学生模型虽模仿教师输出但关注不同视觉区域,存在感知鸿沟。 Method: 提出LaViT框架,强制学生模型在文本生成前自回归重建教师的视觉语义和注意力轨迹,并采用课程感知门控机制防止捷径学习。 Result: 实验显示LaViT在复杂推理任务上最多提升+16.9%,且3B小模型超越更大规模开源及GPT-4o等专有模型。 Conclusion: 通过对齐潜在视觉思维可有效缩小感知差距,增强视觉接地性,为多模态推理提供更优的蒸馏范式。 Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.[108] Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method
Chao Huang,Benfeng Wang,Wei Wang,Jie Wen,Li Shen,Wenqi Ren,Yong Xu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了一种新的视频异常推理任务(VAR),旨在提升多模态大模型在视频异常检测中的推理能力,定义了从感知到决策的多层次推理过程,并发布了大规模数据集与基于感知-认知-行动链的标注框架,同时提出了增强弱监督下推理可靠性的方法,构建了支持自适应分层推理的Vad-R1-Plus模型,实验表明其在VAR任务上优于现有基线。
Details
Motivation: 现有的基于MLLM的视频异常检测方法多局限于定位或事后描述,缺乏显式的推理过程、风险意识和决策导向的理解,无法满足复杂场景下对异常事件深入分析的需求,因此需要一种具备结构化多阶段推理能力的新任务和模型。 Method: 提出了视频异常推理(VAR)任务,构建了包含8,641个视频和超过50,000个样本的大规模数据集,采用结构化的Perception-Cognition-Action Chain-of-Thought(PerCoAct-CoT)进行标注;提出Anomaly-Aware Group Relative Policy Optimization以提升弱监督下的推理可靠性;设计了端到端的MLLM-based模型Vad-R1-Plus,支持自适应分层推理与风险感知决策。 Result: 实验结果显示,所提出的方法在VAR任务上显著优于开源及闭源基线模型,验证了新任务设定、数据集设计以及模型架构的有效性,显著提升了MLLM在视频异常理解中的推理能力和决策质量。 Conclusion: 本研究推动了视频异常理解从描述性分析向结构化推理的转变,通过引入VAR任务、大规模标注数据集和基于链式思维的建模范式,为多模态大模型在安全敏感场景下的智能推理提供了新的基准与路径。 Abstract: Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.[109] RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Yue Chang,Rufeng Chen,Zhaofan Zhang,Yi Chen,Sihong Xie
Main category: cs.CV
TL;DR: 本文提出RAG-3DSG方法,通过重拍引导的不确定性估计和检索增强生成,提升开放词汇3D场景图生成的准确性和效率。
Details
Motivation: 现有开放词汇3D场景图生成方法在物体识别准确率和速度方面表现不佳,主要受限于视角约束、遮挡和冗余表面密度。 Method: 提出RAG-3DSG,结合重拍引导的不确定性估计抑制聚合噪声,利用低不确定性物体支持检索增强生成,并设计动态下采样映射策略以加速跨图像物体聚合。 Result: 在Replica数据集上的实验表明,RAG-3DSG显著提升了3D场景图中节点描述的准确性,并将建图时间减少了三分之二。 Conclusion: RAG-3DSG有效改善了开放词汇3D场景图生成的质量与效率,为机器人任务中的语义理解提供了更强支持。 Abstract: Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.[110] From Physical Degradation Models to Task-Aware All-in-One Image Restoration
Hu Gao,Xiaoning Lei,Xichen Xu,Xingjian Wang,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了一种高效的全合一图像恢复框架OPIR,通过物理退化建模预测任务感知的逆退化算子,并引入不确定性感知图引导两阶段恢复,实现了高性能与高效率的统一。
Details
Motivation: 现有全合一图像恢复方法因引入额外学习模块导致系统复杂、难以实时应用,本文旨在从物理退化建模角度出发,设计更高效、紧凑的模型以实现快速且可靠的多任务图像恢复。 Method: 提出OPIR框架,包含两个阶段:第一阶段利用预测的逆退化算子生成初步恢复图像和不确定性感知图;第二阶段在该图指导下进一步精细化恢复。使用同一逆算子预测网络,并通过任务感知参数适配不同退化任务,同时加速卷积运算以提升效率。 Result: 实验表明OPIR在多种图像恢复任务上表现出优越的全合一恢复性能,同时在特定任务上的表现也极具竞争力,且具备良好的实时性。 Conclusion: OPIR通过物理启发的逆退化建模和不确定性引导的两阶段机制,实现了高效、紧凑且高性能的全合一图像恢复,为实际应用提供了可行方案。 Abstract: All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.[111] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation
Kim Youwang,Lee Hyoseok,Subin Park,Gerard Pons-Moll,Tae-Hyun Oh
Main category: cs.CV
TL;DR: ELITE是一种高效的单目视频生成高斯头像方法,结合了3D数据先验和2D生成先验的优势,通过前馈Mesh2Gaussian先验模型实现快速初始化,并引入渲染引导的单步扩散增强器进行测试时生成适应,显著提升合成速度与真实感。
Details
Motivation: 现有方法在单目视频中生成头像时,依赖3D数据先验的方法泛化能力差,而依赖2D生成先验的方法计算量大且易产生身份幻觉,因此需要一种兼具高效性、高保真度和强泛化能力的新方法。 Method: 提出ELITE框架,包含两个核心组件:一是Mesh2Gaussian Prior Model(MGPM),用于快速初始化高斯头像;二是测试时生成适应阶段,采用渲染引导的单步扩散增强器,利用真实和合成图像作为监督来恢复细节。 Result: 实验表明,ELITE在挑战性表情下仍能生成优于先前方法的视觉效果,同时合成速度比基于2D生成先验的方法快60倍。 Conclusion: ELITE通过融合3D与2D先验的优势,实现了高效、高保真且具有强野外泛化能力的可动画头像合成,为单目视频驱动的数字人生成提供了新思路。 Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.[112] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation
Dong-Yu Chen,Yixin Guo,Shuojin Yang,Tai-Jiang Mu,Shi-Min Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为DepthDirector的视频重渲染框架,通过利用深度信息实现对摄像机轨迹的精确控制,同时保持视频内容的一致性。
Details
Motivation: 现有方法在改变摄像机轨迹时难以充分利用视频扩散模型中的3D先验,并常陷入修复陷阱,导致主体不一致和生成质量下降。 Method: 设计了视图-内容双流条件机制,将源视频与目标视角下扭曲后的深度序列注入预训练视频生成模型,并采用基于LoRA的轻量级适配器进行训练。 Result: 实验表明,DepthDirector在摄像机可控性和视觉质量方面优于现有方法。 Conclusion: DepthDirector有效结合了3D几何引导与视频扩散模型的先验知识,实现了高质量、高一致性的视频重渲染。 Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.[113] Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
Sicheng Yang,Yukai Huang,Shitong Sun,Weitong Cai,Jiankang Deng,Jifei Song,Zhensong Zhang
Main category: cs.CV
TL;DR: 提出了一种集成查询/选项预处理、领域特定微调、时序思维链提示和后处理的框架,显著提升MLLMs在复杂视频问答任务上的性能。
Details
Motivation: 现有MLLMs在处理如HD-EPIC VQA等复杂视频问答任务时表现不佳,主要问题包括模糊查询、长程时序推理能力弱和输出不规范。 Method: 结合查询与选项的预处理、针对Qwen2.5-VL模型的领域微调、提出新的时序思维链(T-CoT)提示方法以支持多步推理,并引入强健的后处理机制。 Result: 该框架在HD-EPIC VQA数据集上达到41.6%的准确率,显著优于基线方法。 Conclusion: 复杂视频理解任务需要对整个推理流程进行系统性优化,整体 pipeline 设计对性能提升至关重要。 Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.[114] Attend to what I say: Highlighting relevant content on slides
Megha Mariam K M,C. V. Jawahar
Main category: cs.CV
TL;DR: 本文提出了一种根据演讲者的叙述自动识别并高亮幻灯片中最相关内容区域的方法,以改善听觉与视觉信息的同步性,提升对内容密集型视频的理解。
Details
Motivation: 在观看会议演讲等快节奏、内容密集的演示时,听众常面临听觉信息与视觉焦点不同步的问题,导致理解困难。现有的幻灯片浏览方式缺乏对关键区域的动态引导,增加了认知负担。 Method: 通过分析演讲者的语音内容,并将其与幻灯片中的文本或图形元素进行匹配,定位最相关的视觉区域。探索了多种解决该问题的方法,并评估其有效性及失败情况。 Result: 所提出的方法能够有效实现语音与幻灯片内容的对齐,自动高亮关键区域,减少认知负荷,提升观众对多媒体内容的理解效率。 Conclusion: 该方法有助于增强多媒体文档(如教学视频和会议报告)的可理解性,是迈向智能内容感知系统的重要一步,具备实际应用价值。 Abstract: Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker's narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: https://github.com/meghamariamkm2002/Slide_Highlight[115] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
Hengyu Shen,Tiancheng Gu,Bin Qin,Lan Wu,Yuling Wu,Shuo Tan,Zelong Sun,Jun Wang,Nan Wu,Xiang An,Weidong Cai,Ziyong Feng,Kaicheng Yang
Main category: cs.CV
TL;DR: 本文提出了DanQing,一个包含1亿中文图文对的高质量跨模态数据集,通过更严格的数据筛选流程和基于2024-2025年网页数据构建,显著提升了中文视觉语言预训练模型在下游任务中的表现。
Details
Motivation: 由于缺乏高质量的中文图文数据,中文视觉语言预训练的发展落后于英文领域,本文旨在填补这一数据缺口。 Method: 构建了一个完整的数据处理流程,从Common Crawl中收集并筛选高质量中文图文对,最终形成DanQing数据集,并基于SigLIP2模型进行持续预训练以验证其有效性。 Result: 在零样本分类、跨模态检索和基于大模型的评估等中文下游任务中,使用DanQing训练的模型 consistently 表现优于现有数据集上的结果。 Conclusion: DanQing是一个高质量、时效性强的中文图文数据集,有效推动了中文视觉语言预训练的发展,且将开源以促进相关研究。 Abstract: Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.[116] Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Peng-Fei Zhang,Zi Huang
Main category: cs.CV
TL;DR: 提出了一种层次化优化的通用多模态攻击框架HRA,用于高效攻击视觉-语言预训练模型。
Details
Motivation: 现有对抗攻击多为样本特定,难以扩展到大规模数据集或新场景,计算开销大。 Method: 在样本和优化两个层面提升通用对抗扰动:图像模态上解耦对抗样本并引入ScMix增强策略;文本模态上结合句子内外重要性选择全局关键词作为通用扰动;优化过程中利用历史与预测梯度的时序层次避免局部最优。 Result: 在多种下游任务、VLP模型和数据集上验证了HRA的有效性和泛化能力,显著优于现有方法。 Conclusion: HRA是一种高效的通用多模态攻击框架,能够在不同模态上生成具有强迁移性和鲁棒性的对抗扰动。 Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.[117] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Xueyun Tian,Wei Li,Bingbing Xu,Heng Dong,Yuanzhuo Wang,Huawei Shen
Main category: cs.CV
TL;DR: 提出ROMA,一种用于统一实时音频、视频和文本理解的实时全模态助手,通过同步多模态单元和轻量级发声头实现反应式与主动式交互,在12个基准上表现优越。
Details
Motivation: 现有全多模态模型在流式音视频理解中存在模态支持不完整或缺乏自主主动监控的问题,难以实现统一的实时交互。 Method: 提出ROMA模型,将连续输入处理为同步的多模态单元,对齐密集音频与离散视频帧;引入轻量级speak head解耦响应触发与生成;采用专设流式数据集和两阶段课程学习进行训练,并构建统一评测套件。 Result: 在12个基准测试中,ROMA在主动任务(如警报、叙述)上达到最先进性能,在反应任务(如问答)上表现具有竞争力。 Conclusion: ROMA实现了强大的统一实时全多模态理解能力,兼具反应性与主动性,验证了其在流式多模态场景中的有效性与鲁棒性。 Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.[118] SRAW-Attack: Space-Reweighted Adversarial Warping Attack for SAR Target Recognition
Yiming Zhang,Weibo Qin,Yuntian Liu,Feng Wang
Main category: cs.CV
TL;DR: 提出了一种名为Space-Reweighted Adversarial Warping (SRAW)的新攻击方法,通过优化前景和背景区域的重加权空间变形生成更隐蔽且具有强迁移性的对抗样本,显著降低了SAR-ATR模型的性能。
Details
Motivation: 现有的SAR-ATR对抗攻击方法通常需要明显的视觉失真才能有效,且模型易受背景区域干扰,缺乏足够的隐蔽性和鲁棒性。 Method: 提出SRAW方法,利用空间形变生成对抗样本,并对前景和背景区域进行预算重加权优化,提升攻击的不可感知性和迁移能力。 Result: 实验表明,SRAW能显著降低先进SAR-ATR模型的识别准确率,在不可感知性和迁移性方面均优于现有攻击方法。 Conclusion: SRAW通过结构化的空间扰动策略提升了对抗攻击的有效性与隐蔽性,为评估SAR-ATR系统鲁棒性提供了新思路。 Abstract: Synthetic aperture radar (SAR) imagery exhibits intrinsic information sparsity due to its unique electromagnetic scattering mechanism. Despite the widespread adoption of deep neural network (DNN)-based SAR automatic target recognition (SAR-ATR) systems, they remain vulnerable to adversarial examples and tend to over-rely on background regions, leading to degraded adversarial robustness. Existing adversarial attacks for SAR-ATR often require visually perceptible distortions to achieve effective performance, thereby necessitating an attack method that balances effectiveness and stealthiness. In this paper, a novel attack method termed Space-Reweighted Adversarial Warping (SRAW) is proposed, which generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions. Extensive experiments demonstrate that SRAW significantly degrades the performance of state-of-the-art SAR-ATR models and consistently outperforms existing methods in terms of imperceptibility and adversarial transferability. Code is made available at https://github.com/boremycin/SAR-ATR-TransAttack.[119] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
Siqi Kou,Jiachun Jin,Zetong Zhou,Ye Ma,Yugang Wang,Quan Chen,Peng Jiang,Xiao Yang,Jun Zhu,Kai Yu,Zhijie Deng
Main category: cs.CV
TL;DR: 本文提出了一种新的“先思考后生成”(T2G)范式,利用大语言模型的推理能力重写文本提示,并通过双奖励优化实现文本编码器与扩散模型的协同训练,显著提升了文本到图像生成的事实一致性、语义对齐和视觉真实感。
Details
Motivation: 现有文本到图像扩散模型多将大语言模型仅作为文本编码器使用,未充分利用其推理能力,导致生成图像在语义和事实一致性上存在不足。本文旨在通过引入推理机制,使模型能够理解并推断应生成的视觉内容。 Method: 首先通过轻量级监督微调激活大语言模型的“先思考后重写”模式;随后采用Dual-GRPO框架联合优化大语言模型编码器和扩散模型,其中编码器通过图像接地奖励强化推理能力,扩散模型则优化生成语义一致且视觉连贯的图像。 Result: 实验表明,该方法在多个基于推理的图像生成与编辑基准上显著提升性能,WISE分数达到0.79,接近GPT-4水平,在事实一致性、语义对齐和视觉真实感方面表现优异。 Conclusion: T2G范式有效融合了大语言模型的推理能力与扩散模型的生成能力,推动了具备推理、表达与具现能力的下一代统一模型的发展。 Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.[120] An analytic theory of convolutional neural network inverse problems solvers
Minh Hai Nguyen,Quoc Bao Do,Edouard Pauwels,Pierre Weiss
Main category: cs.CV
TL;DR: 本文提出了一种基于最小均方误差(MMSE)估计器并结合CNN归纳偏置(平移等变性和局部感受野)的可解释性理论框架——LE-MMSE,用于分析监督卷积神经网络在图像逆问题中的表现,并通过大量实验验证其与实际网络输出高度一致。
Details
Motivation: 尽管监督CNN在图像逆问题中表现优异,但其缺乏理论理解,常被视为黑箱。本文旨在通过引入具有CNN结构先验的MMSE变体来弥合理论与实践之间的差距。 Method: 作者从MMSE估计器出发,加入平移等变性和局部感受野的函数约束,推导出一种称为局部-等变MMSE(LE-MMSE)的解析、可解释且可计算的公式,并在多种逆问题、数据集和网络结构上与实际CNN输出进行比较。 Result: 实验表明,LE-MMSE理论预测与实际训练的神经网络输出高度吻合(PSNR ≥25dB),并在不同任务(去噪、补全、去卷积)、数据集和架构(U-Net、ResNet、PatchMLP)上具有一致性。同时揭示了物理感知与非感知估计器的差异及训练分布密度等因素的影响。 Conclusion: 本文为理解监督CNN在图像逆问题中的行为提供了坚实的理论基础,表明其行为可由具备结构约束的MMSE解释,增强了对CNN为何有效的可解释性。 Abstract: Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).[121] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs
Ningyu Sun,Zhaolin Cai,Zitong Xu,Peihang Chen,Huiyu Duan,Yichao Yan,Xiongkuo Min,Xiaokang Yang
Main category: cs.CV
TL;DR: 本文提出HPE-Bench,一个包含1700个样本的标准化基准,用于评估文本引导的人体姿态编辑效果,并结合基于层选择的多模态大语言模型框架,通过对比LoRA微调和层敏感性分析实现更精准的真实性和质量评估。
Details
Motivation: 现有姿态编辑评估方法分离真实性判断与质量评分,缺乏对姿态特异性不一致的细粒度分析,难以满足AIGC应用需求。 Method: 构建HPE-Bench基准,包含来自17种先进模型的1700个样本,提供真实性标签与多维质量评分;提出基于层选择的多模态大语言模型框架,采用对比LoRA微调和层敏感性分析(LSA)确定最优评估特征层。 Result: 该框架在真实性检测和多维质量回归任务上均表现出优越性能,显著优于现有指标,实现了取证检测与质量评估的统一。 Conclusion: HPE-Bench为姿态编辑提供了可靠评估标准,所提框架通过层选择机制提升了评估的准确性与细粒度,推动了AIGC中可控生成技术的发展。 Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.[122] Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement
Yichong Xia,Yimin Zhou,Jinpeng Wang,Bin Chen
Main category: cs.CV
TL;DR: 提出DiffCR框架,通过一致性先验优化实现高效、高保真的扩散模型图像压缩,显著降低比特率并加速解码。
Details
Motivation: 现有基于扩散的图像压缩方法存在采样速度慢和比特分配次优的问题,受限于分散的训练范式。 Method: 提出DiffCR框架,包含频率感知跳跃估计(FaSE)模块和轻量级一致性估计器;FaSE利用频率解耦注意力(FDA)对预训练扩散模型的ε预测先验进行 refine,并与不同时间步的压缩潜在表示对齐,实现两步快速解码。 Result: 在不更新主干扩散模型的情况下,相比当前最先进的扩散压缩方法,实现了27.2%的BD-rate(LPIPS)和65.1%的BD-rate(PSNR)比特率节省,并获得超过10倍的解码速度提升。 Conclusion: DiffCR通过引入一致性先验优化和频率感知模块,有效提升了扩散模型在极低比特率下的图像压缩效率与重建质量,具备实用潜力。 Abstract: Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $ε$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.[123] Global Context Compression with Interleaved Vision-Text Transformation
Dian Jiao,Jiaxin Duan,Shuai Zhao,Jiabing Leng,Yiran Zhang,Feng Huang
Main category: cs.CV
TL;DR: 本文提出了VIST2,一种通过视觉编码实现全局上下文压缩的新型Transformer模型,能够在预填充和推理阶段减少文本token数量,显著提升生成速度并降低内存与计算开销。
Details
Motivation: 现有方法仅在预填充阶段通过视觉编码压缩输入,无法在逐token推理时节省计算成本,因此需要一种能在整个生成过程中持续压缩上下文的方法。 Method: 将文本块渲染为草图图像,设计VIST2模型交替使用原始文本和视觉编码作为输入,并在预上下文中完全依赖视觉token来预测下一个文本token;采用分阶段训练策略,包括课程学习的光学语言建模预训练和模态交错的指令微调。 Result: 在4倍压缩比下,VIST2在长文本生成任务中平均实现首token生成速度提升3倍,内存占用减少77%,FLOPS减少74%。 Conclusion: VIST2通过全局视觉压缩有效降低了Transformer在长序列处理中的计算与内存负担,为高效语言建模提供了新方向。 Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.[124] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer
Filippo Ruffini,Camillo Maria Caruso,Claudia Tacconi,Lorenzo Nibid,Francesca Miccolis,Marta Lovino,Carlo Greco,Edy Ippolito,Michele Fiore,Alessio Cortellini,Bruno Beomonte Zobel,Giuseppe Perrone,Bruno Vincenzi,Claudio Marrocco,Alessandro Bria,Elisa Ficarra,Sara Ramella,Valerio Guarrasi,Paolo Soda
Main category: cs.CV
TL;DR: 提出了一种缺失感知的多模态生存预测框架,结合CT、全切片病理图像和临床变量,用于不可切除II-III期非小细胞肺癌的总体生存建模,具有对缺失模态的鲁棒性,并通过中间融合提升预测性能。
Details
Motivation: 现有深度学习方法在处理多模态数据时受限于样本量小和模态缺失问题,常需删除不完整样本或进行强制填补,限制了临床应用。因此需要一种能有效利用不完整多模态数据的生存预测模型。 Method: 利用基础模型进行模态特异性特征提取,设计缺失感知编码策略,实现中间多模态融合,使模型能在自然缺失模态的情况下进行训练和推理,无需删减患者数据。 Result: 中间融合策略持续优于单模态基线及早晚期融合方法,其中WSI与临床变量融合效果最佳(73.30 C-index),且模型能自适应降低信息量少的模态(如CT)权重。 Conclusion: 该框架能够有效整合异构多模态医学数据,在存在模态缺失的情况下仍保持高性能,提升了多模态深度学习在真实临床场景中的适用性和实用性。 Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.[125] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy
Hassan Eshkiki,Sarah Costa,Mostafa Mohammadpour,Farinaz Tanhaei,Christopher H. George,Fabio Caraffini
Main category: cs.CV
TL;DR: 提出了一种新的计算框架,通过融合多时相荧光显微图像生成高质量单幅图像,显著提升细胞计数和图像质量。
Details
Motivation: 荧光显微镜记录常受噪声、时间变异性和信号波动影响,限制了其在活体生物样本分析中的应用。 Method: 结合多种可解释的计算机视觉技术,将多个时间分辨帧的信息整合为单一高质量图像,同时保留原始视频的生物学内容。 Result: 在包含动态、异质且形态复杂的2D心肌细胞单层的挑战性数据集上验证,相比现有方法平均细胞计数提高44%。 Conclusion: 该框架可用于需要将多时相图像堆栈融合为高质量2D图像的其他成像领域,有助于标注和后续分割任务。 Abstract: Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.[126] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation
Clementine Grethen,Nicolas Menga,Roland Brochard,Geraldine Morin,Simone Gasparini,Jeremy Lebreton,Manuel Sanchez Gestido
Main category: cs.CV
TL;DR: 提出Lunar-G2R,一种从地形几何直接预测月球表面空间变化反射率的几何到反射率学习框架,无需多视角图像或专用硬件,显著提升渲染真实感和导航精度。
Details
Motivation: 现有月球渲染流程依赖简化或空间均匀的BRDF模型,难以估计参数且无法捕捉局部反射率变化,限制了光度真实性。 Method: 提出Lunar-G2R框架,利用U-Net结合可微渲染,直接从月球数字高程模型(DEM)预测空间变化的BRDF参数,并在已知观测与光照条件下最小化与真实轨道图像的光度差异。 Result: 在Tycho环形山的地理外区域实验中,相比最先进基线方法光度误差降低38%,PSNR和SSIM更高,感知相似性更好,能捕捉细尺度反射率变化。 Conclusion: 这是首个仅从地形几何推断空间变化反射率模型的方法,为高保真渲染和视觉导航提供了新途径。 Abstract: We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.[127] Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Yu Wang,Yi Wang,Rui Dai,Yujie Wang,Kaikui Liu,Xiangxiang Chu,Yansheng Li
Main category: cs.CV
TL;DR: 提出了一种基于视觉-语言模型推理的城市社会语义分割方法,引入了SocioSeg数据集和SocioReasoner框架,通过强化学习优化多模态识别与多阶段推理,实现了对学校、公园等社会定义类别的精确分割,并展现出优越的零样本泛化能力。
Details
Motivation: 现有语义分割模型在物理属性实体(如建筑物、水体)上表现良好,但难以识别由社会定义的语义类别(如学校、公园),缺乏能够结合社会语义信息进行像素级分割的数据集和方法。 Method: 构建了包含卫星图像、数字地图和层级化像素级标签的社会语义分割数据集SocioSeg;提出SocioReasoner框架,利用视觉-语言模型进行跨模态识别与多阶段推理,模拟人类识别社会语义实体的过程,并采用强化学习优化不可微的推理过程。 Result: 实验表明,所提方法在社会语义分割任务上优于现有最先进模型,具备强大的零样本泛化能力。 Conclusion: 通过引入社会语义标注数据集和基于视觉-语言模型的推理框架,有效提升了城市遥感图像中社会定义类别的分割性能,为智慧城市分析提供了新思路。 Abstract: As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.[128] mergetune: Continued fine-tuning of vision-language models
Wenqing Wang,Da Li,Xiatian Zhu,Josef Kittler
Main category: cs.CV
TL;DR: 本文提出了一种名为MERGETUNE的持续微调(CFT)新范式,用于在视觉-语言模型微调后恢复预训练知识,通过线性模式连通性(LMC)指导,无需架构修改或大规模数据回放,显著提升模型泛化性能。
Details
Motivation: 微调视觉-语言模型常导致灾难性遗忘,现有方法难以完全避免,因此需要一种能在微调后恢复预训练知识的新方法。 Method: 提出MERGETUNE方法,利用线性模式连通性(LMC)在微调模型基础上继续优化可训练参数,寻找连接零样本与微调模型的低损失路径,并通过二阶代理近似避免预训练数据回放。 Result: MERGETUNE在CoOp基础上将基类-新类泛化的调和平均提升5.6%,无需增加参数;在鲁棒性评估中超越集成基线,推理成本更低,与零样本模型集成后达到SOTA。 Conclusion: MERGETUNE提供了一种有效且通用的后适应策略,能够在不修改模型结构的情况下恢复遗忘的预训练知识,显著提升模型的泛化与鲁棒性。 Abstract: Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.[129] SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction
Kanak Mazumder,Fabian B. Flohr
Main category: cs.CV
TL;DR: 本文提出了SatMap,一种结合卫星地图和多视角相机观测的在线矢量化高精地图估计方法,用于提升自动驾驶中的地图构建精度。
Details
Motivation: 现有的基于车载相机的方法存在深度感知有限和遮挡导致精度下降的问题,因此需要更鲁棒的地图构建方案。 Method: SatMap利用鸟瞰视角的卫星图像提供的车道级语义和纹理作为全局先验,融合多视角相机数据,直接预测矢量化高精地图。 Result: 在nuScenes数据集上,相比纯相机基线提升了34.8% mAP,相比相机-LiDAR融合基线提升了8.5% mAP,并在长距离和恶劣天气条件下验证了其优势。 Conclusion: SatMap通过引入卫星地图先验有效缓解了深度模糊和遮挡问题,显著提升了在线高精地图构建的性能。 Abstract: Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird's Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at https://iv.ee.hm.edu/satmap/.[130] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition
Max A. Buettner,Kanak Mazumder,Luca Koecher,Mario Finkbeiner,Sebastian Niebler,Fabian B. Flohr
Main category: cs.CV
TL;DR: 本文介绍了FUSE-Bike,首个面向骑行者视角的开放感知平台,以及BikeActions多模态数据集,用于提升弱势道路使用者(VRU)行为建模。通过该平台采集的数据建立了基于图卷积和Transformer模型的基准,推动自动驾驶中VRU意图预测的研究。
Details
Motivation: 当前自动驾驶研究主要关注车辆视角下的行人过街行为,而对复杂共享空间中弱势道路使用者(如骑行者)的交互行为研究不足。为填补这一空白,需要从骑行者自身视角获取高质量近距离数据以更好建模其行为意图。 Method: 开发了名为FUSE-Bike的开放式感知平台,配备双LiDAR、摄像头和GNSS,可从骑行者视角采集高保真数据;基于此构建了包含852个样本、5类动作的多模态数据集BikeActions,并公开数据划分;采用先进的图卷积和Transformer模型进行评估,建立首个性能基准。 Result: 成功发布了BikeActions数据集及完整的硬件设计、数据标注工具与基准代码;实验建立了该任务上的首个性能基线,验证了所提数据在VRU行为理解中的有效性。 Conclusion: FUSE-Bike平台和BikeActions数据集为从骑行者视角研究VRU行为提供了重要资源,所建立的基准有助于推动自动驾驶系统在复杂交通环境中更安全地理解和预测非机动车行为。 Abstract: Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle's perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist's viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under https://iv.ee.hm.edu/bikeactions/.[131] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
Chong Liu,Luxuan Fu,Yang Jia,Zhen Dong,Bisheng Yang
Main category: cs.CV
TL;DR: SVII-3D 是一个用于高保真基础设施数字化的统一框架,通过融合开放集检测、几何引导优化和视觉语言模型,实现基于稀疏图像的鲁棒资产识别、精确三维定位与细粒度状态诊断。
Details
Motivation: 现有基于稀疏图像的数字孪生构建方法在鲁棒性、定位精度和细粒度状态理解方面存在不足,难以满足智能城市与设施全生命周期管理的需求。 Method: 提出SVII-3D框架:1)结合LoRA微调的开放集检测与空间注意力匹配网络,实现跨稀疏视角的鲁棒观测关联;2)引入几何引导的优化机制,修正结构误差,实现分米级精确定位;3)集成基于多模态提示的视觉语言模型代理,实现对设备运行状态的自动细粒度诊断。 Result: 实验表明,SVII-3D显著提升了资产识别准确率并降低了定位误差,在稀疏图像条件下实现了高精度的三维资产数字化与状态感知。 Conclusion: SVII-3D为基础设施的自动化数字孪生构建提供了一种可扩展且低成本的解决方案,有效弥合了稀疏感知与智能运维之间的鸿沟。 Abstract: The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.[132] Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning
Oscar H. Ramírez-Agudelo,Akshay N. Shewatkar,Edoardo Milana,Roland C. Aydin,Kai Franke
Main category: cs.CV
TL;DR: 该研究利用FFA-Net和AECR-Net两种深度学习模型,提升烟雾和雾霾环境下模拟仪表图像的可读性,生成包含14000多张图像的合成数据集进行训练,结果表明AECR-Net在去雾任务中表现更优,增强后的图像可用于自动读取仪表数据。
Details
Motivation: 在烟雾和雾霾环境中,图像可见度降低,影响基础设施监控和应急响应,缺乏公开的仪表图像数据集,因此需要开发能提升此类环境下图像质量的方法以支持自动读数。 Method: 采用FFA-Net和AECR-Net两种深度学习架构,使用Unreal Engine生成包含14000多张带有人工雾霾和烟雾的模拟仪表图像的合成数据集,并按80%训练、10%验证、10%测试划分数据集进行模型训练与评估。 Result: 在合成雾霾数据集上,SSIM达到0.98,PSNR约为43dB,接近当前最优水平,AECR-Net表现优于FFA-Net;在烟雾数据集上效果较差,因烟雾具有非均匀性和高密度特性,且模型原为去雾设计,不专为去烟优化。 Conclusion: 深度学习模型能显著提升烟雾和雾霾环境中模拟仪表图像的质量,增强后的图像可成功用于后续自动读取,具有在应急响应中应用的潜力。 Abstract: Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges[133] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Luxuan Fu,Chong Liu,Bisheng Yang,Zhen Dong
Main category: cs.CV
TL;DR: 提出一种领域自适应框架,将大型视觉语言模型(VLMs)转化为专业化的智能基础设施分析代理,结合数据高效微调与知识增强推理,在城市道路设施检测与属性识别中实现58.9 mAP和95.5%准确率。
Details
Motivation: 通用模型难以捕捉城市道路基础设施所需的细粒度属性和领域规则,现有VLM在遵循工程标准方面表现不佳,导致实际应用中可靠性低。 Method: 采用开放词汇微调Grounding DINO进行资产定位,结合LoRA微调Qwen-VL实现语义属性推理,并引入双模态检索增强生成(RAG)模块,动态检索行业标准和视觉示例以减少幻觉并确保合规性。 Result: 在新构建的城市道路场景数据集上,检测性能达58.9 mAP,属性识别准确率达95.5%。 Conclusion: 该框架有效提升了VLM在专业基础设施分析中的准确性与合规性,为智慧城市管理提供了可靠的技术方案。 Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.[134] Inference-time Physics Alignment of Video Generative Models with Latent World Models
Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano
Main category: cs.CV
TL;DR: 本文提出WMReward方法,利用潜在世界模型作为奖励来优化视频生成的推理过程,显著提升生成内容的物理合理性。
Details
Motivation: 现有视频生成模型虽然视觉效果 promising,但常违反基本物理规律,限制了其应用。作者认为这不仅是训练数据的问题,还与推理策略有关。 Method: 引入WMReward框架,在推理时使用预训练的潜在世界模型(如VJEPA-2)作为物理先验,通过奖励机制搜索和引导多个去噪路径,从而提升生成视频的物理合理性。 Result: 在多种生成设置(图像条件、多帧条件、文本条件)下均显著提升了物理合理性,并在ICCV 2025 PhysicsIQ挑战赛中以62.64%的成绩获得第一名,超越此前SOTA 7.42%。 Conclusion: 研究表明,利用潜在世界模型进行推理时对齐,是提升视频生成物理合理性的有效途径,具有广泛适用性。 Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.[135] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery
Constantin Selzer,Fabian B. Flohr
Main category: cs.CV
TL;DR: DeepUrban是一个新的无人机数据集,专注于密集城市环境中的轨迹预测和规划,通过高分辨率图像提取3D交通对象,并包含丰富的地图和场景信息。实验表明,将DeepUrban与nuScenes结合可显著提升车辆预测和规划的准确性。
Details
Motivation: 现有自动驾驶基准在密集交通场景方面存在数据稀缺问题,限制了对复杂道路使用者交互的理解和建模。 Method: 与产业伙伴DeepScenario合作,构建名为DeepUrban的新型无人机数据集,采集自约100米高空的城市交叉路口高分辨率图像,提取3D交通物体,并提供详细的地图与场景信息,用于增强轨迹预测与规划模型的训练与评估。 Result: 在现有SOTA预测与规划方法上验证了DeepUrban的有效性,实验显示将其加入nuScenes训练可使车辆轨迹预测精度在ADE/FDE指标上最高提升44.1%/44.3%,并增强了模型的泛化能力。 Conclusion: DeepUrban填补了密集城市交通场景数据的空白,显著提升了自动驾驶系统在复杂环境下的预测与规划性能,具有重要的研究与应用价值。 Abstract: The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban[136] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation
Serena Grazia De Benedictis,Amedeo Altavilla,Nicoletta Del Buono
Main category: cs.CV
TL;DR: 提出一种基于Jordan曲线定理和数字拓扑的拓扑感知图像分割评估方法,通过Betti数验证分割掩码的结构连贯性,确保图像被划分为两个连通区域。
Details
Motivation: 传统分割评价指标难以捕捉分割结果的结构和拓扑一致性,尤其在医学图像等应用中,边界小误差或碎片化预测可能导致高分但不合理的分割结果。 Method: 基于Jordan曲线定理和数字平面拓扑理论,定义“Jordan-可分割掩码”概念,提取掩码的4-曲线候选,并利用同调理论中的Betti数(β₀=β₁=1)验证其拓扑有效性。 Result: 提供了一种无需监督、数学严谨的分割结构连贯性评估标准,能有效识别出虽评分高但拓扑错误的分割结果。 Conclusion: 结合数字Jordan理论与同调不变量的方法为分割评估提供了新的视角,特别适用于需保持拓扑正确性的应用场景。 Abstract: Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emph{Jordan-segmentatable mask}, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a $4$-curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with $β_0 = β_1 = 1$, or equivalently when its complement splits into exactly two $8$-connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved.[137] Adversarial Evasion Attacks on Computer Vision using SHAP Values
Frank Mollard,Marcus Becker,Florian Roehrbein
Main category: cs.CV
TL;DR: 本文提出了一种基于SHAP值的白盒攻击方法,用于计算机视觉模型,能够有效生成对抗样本,尤其在梯度隐藏场景下比FGSM更具鲁棒性。
Details
Motivation: 为了探索深度学习模型在面对基于解释性方法的对抗攻击时的脆弱性,特别是利用SHAP值进行更隐蔽且有效的攻击。 Method: 利用SHAP值量化输入特征对模型输出的重要性,并据此生成对抗样本,与FGSM方法进行比较。 Result: SHAP攻击在诱导错误分类方面表现更优,尤其是在梯度隐藏的情况下比FGSM更具鲁棒性。 Conclusion: 基于SHAP值的攻击是一种有效的白盒对抗攻击方法,揭示了解释性方法可能被滥用的风险,提示需加强对可解释AI的安全防御。 Abstract: The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.[138] Action100M: A Large-scale Video Action Dataset
Delong Chen,Tejaswi Kasarla,Yejin Bang,Mustafa Shukor,Willy Chung,Jade Yu,Allen Bolourchi,Theo Moutakanni,Pascale Fung
Main category: cs.CV
TL;DR: 提出Action100M,一个大规模、开放词汇的视频动作数据集,包含约1亿个从120万 instructional 视频中提取的时序标注动作片段,通过全自动流程生成结构化注释,用于推动视频理解和世界建模研究。
Details
Motivation: 为了提升机器在物理世界中的智能,需要能够从视觉观察中推断物理动作的大规模开放词汇数据集,现有数据集在规模和多样性上不足。 Method: 构建了一个全自动流水线:利用V-JEPA 2嵌入进行分层时间分割,生成多层级的帧和片段字幕(Tree-of-Captions),并通过基于GPT-OSS-120B的多轮Self-Refine推理模型聚合证据,输出结构化动作标注。 Result: 生成了Action100M数据集,包含1.2百万个教学视频(约14.6年时长),约1亿个时序标注的动作片段,具备丰富的开放词汇动作监督和多级字幕;在多个动作识别基准上,VL-JEPA模型展现出持续的数据扩展性和强零样本性能。 Conclusion: Action100M为视频理解与世界建模提供了可扩展的基础数据支持,验证了大规模自动生成视频标注数据的可行性与有效性。 Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.[139] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
Peng Chen,Xiaobao Wei,Yi Yang,Naiming Yao,Hui Chen,Feng Tian
Main category: cs.CV
TL;DR: 本文提出了RSATalker,首个基于3D高斯点阵(3DGS)的具有社会感知能力的对话头像生成框架,支持多轮对话,兼顾真实感与效率,并通过可学习查询机制建模复杂社会关系。
Details
Motivation: 现有对话头像生成方法在真实性、计算效率或社会关系建模方面存在不足:基于网格的3D方法缺乏真实纹理,基于大模型的2D方法计算开销大,而当前3DGS方法仅限单人说话且忽略社会互动。因此需要一种兼具高效、逼真且能表达社会关系的双人对话头像生成方案。 Method: 首先从语音驱动网格-based 3D面部运动,然后将3D高斯分布绑定到网格面片上以渲染高保真2D视频;提出社会感知模块,通过可学习查询机制将血缘/非血缘、平等/不平等等社会关系编码为高级嵌入;设计三阶段训练范式,并构建包含语音-网格-图像三元组及社会关系标注的RSATalker数据集。 Result: 实验表明RSATalker在视觉真实感和社会感知能力方面均达到SOTA水平,显著优于现有方法,同时保持高效的渲染性能。 Conclusion: RSATalker成功实现了高效、逼真且具社会感知能力的双人多轮对话头像生成,推动了虚拟现实中社交交互的真实性与智能化发展。 Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.[140] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark,Jieyu Zhang,Zixian Ma,Jae Sung Park,Mohammadreza Salehi,Rohun Tripathi,Sangho Lee,Zhongzheng Ren,Chris Dongjoo Kim,Yinuo Yang,Vincent Shao,Yue Yang,Weikai Huang,Ziqi Gao,Taira Anderson,Jianrui Zhang,Jitesh Jain,George Stoica,Winson Han,Ali Farhadi,Ranjay Krishna
Main category: cs.CV
TL;DR: Molmo2是一系列开源视频语言模型,通过引入7个新的视频数据集和2个多图像数据集,在开放源码社区中实现了最先进的性能,并在点驱动的定位任务上表现出色。
Details
Motivation: 当前最强的视频语言模型多为专有模型,开源模型依赖合成数据或未公开训练细节,限制了社区的发展;同时现有模型缺乏像素级 grounding 能力。 Method: 构建全新的高质量视频和多图像数据集,不使用闭源模型生成数据;采用高效的打包和消息树编码方案进行训练,并引入视觉标记的双向注意力机制与新型标记加权策略。 Result: 在短视频理解、计数、字幕生成等任务上优于同类开源模型,在长视频任务上具有竞争力;在视频定位任务中显著超越Qwen3-VL等开源模型,并在部分任务上优于Gemini 3 Pro等专有模型(如视频指向F1达38.4 vs 20.0)。 Conclusion: Molmo2为开源社区提供了可复现、高性能的视频语言建模基础,推动了包含图像、视频在内的细粒度视觉理解与像素级定位能力的发展。 Abstract: Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).[141] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Chengfeng Zhao,Jiazhi Shu,Yubo Zhao,Tianyu Huang,Jiahao Lu,Zekai Gu,Chengwei Ren,Zhiyang Dou,Qing Shuai,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出了CoMoVi,一个耦合2D视频与3D人体动作生成的扩散框架,通过共享表示和交叉注意力实现同步生成,并构建了带文本与动作标注的大规模数据集CoMoVi Dataset。
Details
Motivation: 3D人体动作与2D视频生成本质上相互关联:3D提供结构先验以保证合理性,而预训练视频模型有助于提升动作生成的泛化能力,因此需联合建模。 Method: 提出CoMoVi框架,使用双分支扩散模型,在单个去噪循环中同步生成3D动作与2D视频;设计新的2D动作表示以继承预训练视频扩散模型(VDM)的先验,并引入双向特征交互与3D-2D交叉注意力机制。 Result: 在3D动作与2D视频生成任务上均取得优异表现,验证了方法的有效性;所构建的CoMoVi Dataset具有丰富的真实场景动作、文本和动作标注。 Conclusion: 耦合生成3D动作与2D视频能相互促进,CoMoVi为多模态人体生成提供了新思路,且具备良好的生成质量与一致性。 Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.[142] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Darshan Singh,Arsha Nagrani,Kawshik Manikantan,Harman Singh,Dinesh Tewari,Tobias Weyand,Cordelia Schmid,Anelia Angelova,Shachi Dave
Main category: cs.CV
TL;DR: 本文提出了CURVE,一个用于多文化、多语言视频推理的基准,包含18个地区的高质量人工标注数据,强调对视觉文化背景的深入理解,并揭示了当前最先进视频大模型在跨文化理解上的不足。
Details
Motivation: 现有视频理解基准主要集中于西方文化和英语内容,导致评估存在文化偏见,缺乏对全球多元文化情境下模型推理能力的全面评估。 Method: 构建了一个名为CURVE的新基准,包含来自18个地区、以本地语言编写的问题、答案和多步推理链的全人工标注数据;并利用推理轨迹构建证据图,提出一种基于图的迭代策略来细粒度分析模型错误。 Result: 实验表明,当前最先进的视频大语言模型在CURVE上表现远低于人类水平,主要错误源于对文化相关视觉元素的理解不足。 Conclusion: CURVE为评估视频模型的跨文化理解能力提供了新标准,并揭示了模型在文化情境化推理方面的显著缺陷,推动未来研究关注真正多元化的视频理解。 Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural[143] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements
S M Rayeed,Mridul Khurana,Alyson East,Isadora E. Fluck,Elizabeth G. Campolongo,Samuel Stevens,Iuliia Zarubiieva,Scott C. Lowe,Michael W. Denslow,Evan D. Donoso,Jiaman Wu,Michelle Ramirez,Benjamin Baiser,Charles V. Stewart,Paula Mabee,Tanya Berger-Wolf,Anuj Karpatne,Hilmar Lapp,Robert P. Guralnick,Graham W. Taylor,Sydne Record
Main category: cs.CV
TL;DR: 本研究通过高分辨率成像技术数字化了来自美国30个站点的超过13,200只NEON地面甲虫标本,构建了一个多模态数据集,实现了亚毫米级精度的自动形态特征提取,推动了人工智能在无脊椎动物生态与生物多样性监测中的应用。
Details
Motivation: 全球性性状数据库对无脊椎动物的覆盖严重不足,限制了对高多样性类群(如地面甲虫)的生态分析。尽管地面甲虫是生态系统健康的重要生物指示物种,但其标本主要以实体形式保存,难以进行大规模研究和广泛共享。 Method: 利用国家生态观测网络(NEON)的大量地面甲虫标本,通过高分辨率成像进行数字化,并自动测量每只标本的鞘翅长度和宽度,建立可用于AI分析的形态性状数据库;同时将数字测量结果与人工测量对比,验证其准确性。 Result: 成功构建包含超过13,200只地面甲虫的数字化多模态数据集,覆盖美国本土及夏威夷30个站点;数字测量的鞘翅性状达到亚毫米级精度,具有高度可靠性。 Conclusion: 该数据集有效缓解了无脊椎动物在性状数据库中的代表性不足问题,为基于人工智能的自动物种识别和性状研究提供了基础,有助于提升生物多样性监测与保护的效率与规模。 Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.[144] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Amir Mallak,Erfan Aasi,Shiva Sreeram,Tsun-Hsuan Wang,Daniela Rus,Alaa Maalouf
Main category: cs.CV
TL;DR: 本文提出了一种名为Stochastic-Patch-Selection (SPS) 的方法,通过随机遮蔽部分图像块特征来提升端到端自动驾驶策略在分布外(OOD)场景下的鲁棒性和泛化能力。实验表明该方法不仅超越现有最先进水平,还显著提高效率,并成功迁移到真实车辆上。
Details
Motivation: 由于基础模型中的patch特征存在高度冗余(如自注意力机制导致的信息重叠),直接使用这些特征训练策略容易过拟合虚假相关性,损害OOD鲁棒性。因此需要一种机制减少冗余影响。 Method: 提出Stochastic-Patch-Selection (SPS):在每帧中随机遮蔽一部分patch描述符,保留剩余patch的空间布局,迫使策略基于不变性和更本质的特征做决策。 Result: 在多个OOD场景下,SPS平均性能提升6.2%,最高提升达20.4%,且推理速度快2.4倍;8/9个消融实验系统超过先前SOTA;策略无需调优即可迁移到实车。 Conclusion: SPS通过引入随机patch掩码,有效缓解了基础模型中patch特征冗余带来的过拟合问题,显著提升了自动驾驶策略的鲁棒性、泛化能力和运行效率,并具备实际部署潜力。 Abstract: Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.[145] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Cheng Chen,Yuyu Guo,Pengpeng Zeng,Jingkuan Song,Peng Di,Hang Yu,Lianli Gao
Main category: cs.CV
TL;DR: 提出Cross-Layer Injection (CLI)框架,通过动态多对多跨层连接解决视觉语言模型中的视觉特征瓶颈问题。
Details
Motivation: 现有VLMs仅将视觉编码器输出静态连接至大语言模型输入,导致无法充分对齐层次化视觉知识,限制了对局部细节与全局语义的融合推理能力。 Method: 设计包含自适应多投影(AMP)模块和自适应门控融合(AGF)机制的CLI框架,实现视觉与语言模态间的动态、多层次特征交互,使LLM能根据解码上下文选择性注入最相关的视觉信息。 Result: 在LLaVA-OneVision和LLaVA-1.5上集成CLI后,在18个基准测试中均取得显著性能提升,验证了其有效性与通用性。 Conclusion: CLI提供了一种可扩展的范式,通过按需访问完整视觉层次结构,实现了更深层次的多模态理解。 Abstract: Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.[146] Alterbute: Editing Intrinsic Attributes of Objects in Images
Tal Reiss,Daniel Winter,Matan Cohen,Alex Rav-Acha,Yael Pritch,Ariel Shamir,Yedid Hoshen
Main category: cs.CV
TL;DR: 提出了一种基于扩散模型的图像对象属性编辑方法Alterbute,可在保持对象身份和场景上下文的同时修改颜色、材质等内在属性。
Details
Motivation: 现有方法在编辑对象内在属性时难以兼顾身份保持与属性变化,或因无监督先验导致身份丢失,或因监督过强限制了合理变化。 Method: 采用放松的训练目标,结合身份参考图、文本描述、背景图和掩码来控制内外属性变化;推理时固定背景和掩码以仅改变目标内在属性;引入细粒度视觉实体(VNEs)进行可扩展的身份保持监督。 Result: 在保持对象身份的前提下更有效地编辑颜色、纹理、材质和形状等内在属性,性能优于现有方法。 Conclusion: Alterbute通过VNE和条件扩散机制实现了高质量、身份保持的对象内在属性编辑,推动了图像编辑中属性可控性与身份一致性的平衡。 Abstract: We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.[147] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Xuweiyi Chen,Wentao Zhou,Zezhou Cheng
Main category: cs.CV
TL;DR: 提出WildRayZer,一种用于动态环境中新视角合成的自监督框架,通过分析-合成测试分离静态与动态内容,有效提升合成质量。