Table of Contents
cs.CL [Back]
[1] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue
Jinqiang Wang,Huansheng Ning,Jianguo Ding,Tao Zhu,Liming Chen,Chris Nugent
Main category: cs.CL
TL;DR: 本文提出ProUtt,一种基于大语言模型的偏好数据合成方法,用于主动预测用户下一句话语。该方法通过构建意图树并显式建模意图推理路径,在利用与探索两个角度生成偏好与非偏好推理过程,显著优于现有方法。
Details
Motivation: 现有用户模拟方法主要模仿说话风格而难以推进对话,且缺乏对用户意图推理过程的显式建模;同时,通用大模型本地部署成本高,API方案存在隐私问题。因此需要一种高效、隐私安全且能理解用户意图的预测方法。 Method: 提出ProUtt方法,将对话历史转化为意图树,预测下一个合理的路径以建模意图推理轨迹,并从利用和探索两个角度构建偏好与非偏好推理过程,通过扰动或修改未来轮次的路径来合成训练数据。 Result: 在四个基准数据集上,基于LLM评判和人工评估的结果显示,ProUtt在主动下一句预测任务中 consistently 优于现有的数据合成方法、用户模拟器和商业LLM API。 Conclusion: ProUtt通过显式建模意图推理路径和合成高质量偏好数据,有效提升了任务特定小模型在主动下一句预测上的性能,兼具效率与隐私优势,具备实用价值和研究推广意义。 Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user's next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user's next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user's next utterance.To address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.[2] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines
Devesh Saraogi,Rohit Singhee,Dhruv Kumar
Main category: cs.CL
TL;DR: 该论文探讨了基于代理的多步工作流在生成新颖且可行的研究计划方面的能力,发现基于分解和长上下文的工作流在新颖性上表现最佳。
Details
Motivation: 解决大语言模型在单步提示中存在“智能抄袭”问题,探索多步代理工作流是否能提升AI生成研究想法的原创性和可行性。 Method: 对五种推理架构进行基准测试:基于反思的迭代优化、Sakana AI v2进化算法、Google Co-Scientist多智能体框架、GPT Deep Research递归分解和Gemini 3 Pro多模态长上下文流水线,并由专家评估30个提案的新颖性、可行性和影响力。 Result: 基于分解和长上下文的工作流平均新颖性得分为4.17/5,显著高于反思型方法(2.33/5),且高分工作流在保持可行性的同时具备创造力。 Conclusion: 精心设计的多阶段代理工作流能够有效推动AI辅助科研创新,提升研究计划的原创性和实用性。 Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism'' as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows -- multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition -- can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.[3] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents
Adam Bradley,John Hastings,Khandaker Mamun Ahmed
Main category: cs.CL
TL;DR: 本文提出并评估了Axlerod,一个用于提升保险代理效率的AI对话系统,结合NLP、RAG与领域知识,实现在政策检索中93.18%的准确率,并将搜索时间平均缩短2.42秒。
Details
Motivation: 为提高独立保险代理人的工作效率,解决传统保险服务中响应慢、信息检索复杂的问题,需构建专业的AI辅助系统而非仅面向消费者的聊天机器人。 Method: 采用自然语言处理(NLP)、检索增强生成(RAG)和领域特定知识集成技术,设计并实现了名为Axlerod的AI驱动对话接口,支持意图识别、结构化数据库访问与实时响应生成。 Result: Axlerod在政策检索任务中达到93.18%的准确率,平均搜索时间减少2.42秒,展现出在上下文理解与操作自动化方面的优越性能。 Conclusion: 该研究推动了企业级AI在保险科技中的应用,强调了代理辅助架构的重要性,为未来智能保险服务系统的开发提供了实践范例和技术路径。 Abstract: The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod's effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.[4] Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research
Derguene Mbaye,Tatiana D. P. Mbengue,Madoune R. Seye,Moussa Diallo,Mamadou L. Ndiaye,Dimitri S. Adjanohoun,Cheikh S. Wade,Djiby Sow,Jean-Claude B. Munyaka,Jerome Chenal
Main category: cs.CL
TL;DR: 本文首次全面综述了塞内加尔六种官方语言(Wolof、Pulaar、Sereer、Joola、Mandingue 和 Soninke)的自然语言处理(NLP)发展现状与挑战,整合了影响其数字化准备的语言学、社会技术与基础设施因素,指出了数据、工具和基准方面的缺失。作者分析了文本规范化、机器翻译和语音处理的现有研究,并建立了一个集中式GitHub资源库以促进协作与可复现性。特别强调NLP在社会科学中的应用潜力,并提出以社区为中心、注重伦理数据治理、开放资源和跨学科合作的可持续NLP生态发展路线图。
Details
Motivation: 非洲语言在NLP技术发展中严重边缘化,尽管NLP正在改变各领域的研究方法。塞内加尔有六种官方语言,但其数字资源极度匮乏,限制了技术包容性和本地研究效率。因此,亟需系统评估当前进展,识别关键瓶颈,并推动公平、可持续的语言技术发展。 Method: 综合语言学、社会技术与基础设施视角,系统梳理现有NLP研究与倡议;整理并开源一个涵盖多种NLP任务的公共资源GitHub仓库;分析文本规范化、机器翻译和语音处理等关键技术方向的进展;探讨NLP在社会科学实地研究中的应用场景。 Result: 明确了塞内加尔六种国家语言在NLP领域的主要差距,包括缺乏标准化数据集、工具链不完整、评估基准缺失;建立了首个集中化的公开资源平台以支持未来研究;展示了NLP在多语言转录、翻译与信息检索中提升社会科学研究效率与包容性的潜力。 Conclusion: 实现塞内加尔语言的可持续NLP发展需要以社区为中心的生态建设,强调伦理数据治理、开放资源共享和跨学科合作,确保技术发展服务于本地社群并促进语言多样性保护。 Abstract: Natural Language Processing (NLP) is rapidly transforming research methodologies across disciplines, yet African languages remain largely underrepresented in this technological shift. This paper provides the first comprehensive overview of NLP progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke. We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks. Building on existing initiatives and research works, we analyze ongoing efforts in text normalization, machine translation, and speech processing. We also provide a centralized GitHub repository that compiles publicly accessible resources for a range of NLP tasks across these languages, designed to facilitate collaboration and reproducibility. A special focus is devoted to the application of NLP to the social sciences, where multilingual transcription, translation, and retrieval pipelines can significantly enhance the efficiency and inclusiveness of field research. The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages, emphasizing ethical data governance, open resources, and interdisciplinary collaboration.[5] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data
Yiwei Yan,Hao Li,Hua He,Gong Kai,Zhengyi Yang,Guanfeng Liu
Main category: cs.CL
TL;DR: 本研究提出了一种基于大语言模型的提取管道SALP-CG,用于对在线医疗会话数据中的隐私风险进行分类与分级,符合GB/T 39725-2020标准,具备高类别合规性和敏感性识别准确性,在MedDialog-CN基准上表现优异(micro-F1=0.900),有助于跨模型的健康数据治理。
Details
Motivation: 在线医疗咨询产生大量包含受保护健康信息的对话数据,现有方法缺乏统一标准和可靠的自动化手段对其进行敏感性分类,难以满足数据治理需求。 Method: 结合小样本引导、JSON Schema约束解码和确定性高风险规则,构建后端无关的提取管道SALP-CG,并依据GB/T 39725-2020制定健康数据分类分级规则,实现对多类大语言模型的兼容与高效隐私风险识别。 Result: 在MedDialog-CN基准上,模型实现了稳健的实体识别数量、高Schema合规性以及准确的敏感度分级,最强模型在最高等级预测中达到micro-F1=0.900;分析显示二级至三级数据项占主导,组合后可致再识别;四级至五级虽较少但危害更大。 Conclusion: SALP-CG能够可靠地跨大语言模型实现在线会话健康数据的类别分类与敏感性分级,为健康数据治理提供了实用且高效的自动化解决方案。 Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.[6] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model
Jing-Yi Zeng,Guan-Hua Huang
Main category: cs.CL
TL;DR: 本研究提出了一种高效构建统计领域专用大语言模型(StatLLaMA)的方法,基于轻量级LLaMA-3.2-3B系列模型,发现起始于指令调优后的基础模型(如LLaMA-3.2-3B-Instruct)才能有效实现领域专业化,并通过多阶段训练优化性能。
Details
Motivation: 如何在资源有限的情况下高效构建具备专业统计推理能力的大语言模型,同时保持通用推理能力,是当前领域专业化模型开发中的关键挑战。 Method: 系统比较了三种多阶段训练流程:从无指令跟随能力的基础模型、经后处理指令调优的基础模型、以及具备强通用推理能力的指令调优模型出发,依次进行持续预训练、监督微调(SFT)、基于人类反馈的强化学习(RLHF)偏好对齐和下游任务适配。 Result: 以基础模型为起点的流程无法发展出有效的统计推理能力;而从LLaMA-3.2-3B-Instruct出发可成功实现领域专业化;SFT变体评估揭示了领域专长与通用推理间的权衡;直接偏好优化在RLHF中表现稳定有效;下游微调需极低强度以避免灾难性遗忘。 Conclusion: 起始模型的选择至关重要,使用已具备指令理解能力的模型作为起点是实现高效领域专业化训练的关键,最终构建的StatLLaMA在数学推理、常识推理和统计专长方面均表现出色,为资源受限下的专业大模型开发提供了可行方案。 Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.[7] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models
Hoyoon Byun,Youngjun Choi,Taero Kim,Sungrae Park,Kyungwoo Song
Main category: cs.CL
TL;DR: 本文提出了BHyT,一种用于大语言模型的归一化替代方法,旨在解决Pre-LN效率低和深度增加导致训练不稳定的问题。BHyT结合tanh非线性与数据驱动的输入边界控制,保持激活值在非饱和范围内,并提供理论稳定性保证。实验表明,BHyT在预训练中比RMSNorm平均快15.8%,生成吞吐量高4.2%,同时保持或优于其推理性能。
Details
Motivation: Pre-LN虽然广泛用于大语言模型,但存在计算效率低和随深度增加导致激活值幅值和方差增长的问题,影响训练稳定性;现有高效方法如DyT在深层模型中仍不稳定。因此需要一种兼顾稳定性和效率的替代方案。 Method: 提出Bounded Hyperbolic Tanh(BHyT),作为Pre-LN的即插即用替代方法。BHyT将tanh非线性与数据驱动的输入边界结合,防止激活值进入饱和区,并通过每块计算一次精确统计量、用轻量级方差近似替代第二次归一化来提升效率。 Result: BHyT有效抑制了深层网络中激活值的幅值和方差增长,提供了理论稳定性保证;在预训练中比RMSNorm平均提速15.8%,token生成吞吐量提高4.2%,并在多个语言理解与推理任务上达到相当或更优的推理性能和鲁棒性。 Conclusion: BHyT是一种高效且稳定的Pre-LN替代方案,能够在不牺牲性能的前提下显著提升训练速度和生成效率,适用于大规模语言模型的深度架构。 Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT[8] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering
Yu Takahashi,Shun Takeuchi,Kexuan Xin,Guillaume Pelat,Yoshiaki Ikai,Junya Saito,Jonathan Vitale,Shlomo Berkovsky,Amin Beheshti
Main category: cs.CL
TL;DR: 提出一种不确定性感知的动态知识图谱框架,用于提升问答系统的可靠性与透明度,特别是在医疗领域的高风险应用中。
Details
Motivation: 现有基于知识图谱的问答系统通常将事实视为静态且确定的,难以捕捉信息的动态变化和推理中的不确定性,导致在证据不完整或噪声较多时可靠性下降。 Method: 结合动态知识图谱构建、置信度评分与不确定性感知检索,以及交互式界面,实现对不确定信息的建模与可视化,并在电子健康记录基础上构建个性化知识图谱。 Result: 系统支持用户探索动态图谱、检查带置信度标注的事实三元组,并比较基线与置信感知的答案;在死亡率预测任务中验证了该框架的有效性。 Conclusion: 不确定性感知的动态知识图谱能增强问答系统在高风险场景下的可靠性、可解释性和实用性,尤其适用于临床决策支持等应用。 Abstract: Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.[9] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox
Vahideh Zolfaghari
Main category: cs.CL
TL;DR: 该研究评估了在父母焦虑驱动的对抗性压力下,大型语言模型(LLMs)在儿科医疗咨询中的安全性,发现较小的模型在安全对齐方面表现更优,且所有模型均缺乏应急识别能力,不适合用于分诊。
Details
Motivation: 现有对医疗用大模型的安全性评估多基于中性条件,忽视了真实场景中用户焦虑情绪带来的挑战,本研究旨在填补这一空白,特别是在儿科咨询中模拟家长焦虑状态下的模型表现。 Method: 构建包含150个真实和150个对抗性问题的PediatricAnxietyBench数据集,覆盖10个儿科主题;通过API测试Llama-3.3-70B、Llama-3.1-8B和Mistral-7B三个模型,共分析900条响应;采用0-15分制评估安全性指标(如克制、转诊建议、模糊化处理、紧急情况识别等),并使用配对t检验与自助法置信区间进行统计分析。 Result: 平均安全得分介于9.70(Llama-3.3-70B)至10.39(Mistral-7B)之间;Llama-3.1-8B显著优于Llama-3.3-70B(+0.66, p=0.0001);对抗性提问提升了模型安全性,尤其以Mistral-7B最明显(+1.09, p=0.0002);Llama-3.3-70B有8%的安全失败率;癫痫相关问题存在33%的误诊风险;模糊化策略与安全得分正相关(r=0.68)。 Conclusion: 模型安全性更多依赖于对齐策略与架构设计而非参数规模,较小模型可超越更大模型;版本迭代显示出向更强鲁棒性的演进趋势;但因普遍缺乏紧急情况识别能力,当前LLMs尚不适用于临床分诊;研究结果支持在真实压力场景下进行对抗性测试,并提供开放基准以推动医疗AI安全发展。 Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.[10] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language
Franciszek Górski,Andrzej Czyżewski
Main category: cs.CL
TL;DR: 本研究提出了一种利用多语言大模型(如Llama3.1)作为教师模型,为波兰语医学文本提供标注,并训练轻量级BERT类分类器的方法,在资源有限的情况下实现了高效的多类别临床文本分类。
Details
Motivation: 由于缺乏足够的标注资源来处理波兰语医学文本,难以构建高质量的分类器,因此需要一种高效且低成本的自动标注与模型训练方案。 Method: 使用多语言Llama3.1模型对大规模波兰语医学文本进行自动标注,人工仅验证部分标签以构建测试集;基于该数据训练三种基于BERT的分类器:DistilBERT、BioBERT和HerBERT。 Result: DistilBERT模型表现最佳,五个临床类别中每个的F1分数均超过0.80,其中三个超过0.93,同时模型体积小、GPU显存消耗低、推理速度快。 Conclusion: 通过知识蒸馏方式利用多语言大模型进行标注,可有效构建高性能、高效率的小型专用分类器,为资源稀缺语言的医学自然语言处理提供了可行路径。 Abstract: In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.[11] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels
Guancheng Du,Yong Hu,Wenqing Wang,Yaming Yang,Jiaheng Gao
Main category: cs.CL
TL;DR: 本文提出了SagaScale,一个基于长篇小说构建的真实、可扩展且高质量的双语长上下文基准测试,用于评估大语言模型在处理超长文本时的表现,并发布了相关数据集和代码。
Details
Motivation: 现有的长上下文基准测试存在任务真实性、数据可扩展性和数据质量等方面的局限性,因此需要一个更可靠、更大规模的基准来准确评估大语言模型在复杂长文档理解中的能力。 Method: 通过自动化数据收集管道,利用外部资源(如Wikipedia)从完整小说中构建问答对,外部资源仅用于构建阶段而不参与评估,从而生成超出模型当前回答能力的复杂问题;该基准支持双语(英中文),平均上下文长度超过25万(英文)和32万(中文)token。 Result: 在12个前沿大模型和三种长上下文方法(Naïve RAG、Agentic RAG、Long Context)上的实验表明:直接输入完整上下文有时显著优于RAG方法;多数模型仍难以处理超长上下文,但Gemini-2.5-Pro表现突出;Agentic RAG能有效缓解Naïve RAG的检索瓶颈。 Conclusion: SagaScale是一个高现实性、大规模、高质量的长上下文评测基准,能够更全面地评估语言模型的长文本理解能力,同时其公开发布将推动相关领域的研究发展。 Abstract: Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods -- Naïve RAG, Agentic RAG, and Long Context -- yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.[12] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins,Jon Chun
Main category: cs.CL
TL;DR: 本文提出了Syntactic Framing Fragility (SFF) 框架,用于评估大语言模型在不同语法结构但逻辑等价提示下的伦理判断一致性,发现许多模型因句法极性变化而出现判断反转,尤其对否定提示敏感,开源模型比商业模型更脆弱,而思维链推理可有效缓解该问题。
Details
Motivation: 大语言模型越来越多地被应用于重要决策场景,但其对无害提示变化的鲁棒性尚不清楚,尤其是在伦理判断中是否因语法差异导致不一致,亟需一种能分离纯句法影响的评估方法。 Method: 提出SFF框架和逻辑极性归一化(LPN)技术,控制语义不变的情况下系统性测试23个主流模型在14种伦理情境和4种句法结构中的判断一致性,并分析思维链等缓解策略的效果。 Result: 实验显示多数模型存在显著的句法框架脆弱性,部分模型在‘不应’提示下仍支持某行为达80-97%;开源模型脆弱性是商业模型的两倍以上;思维链可显著降低不一致性;金融和商业场景风险高于医疗场景。 Conclusion: 句法一致性是伦理鲁棒性的一个独立且关键维度,应将SFF类审计纳入部署前LLM安全评估的标准流程。 Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with "should not." We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on github.com.[13] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Kaustubh Shivshankar Shejole,Sourabh Deoghare,Pushpak Bhattacharyya
Main category: cs.CL
TL;DR: 本文提出了Virām,首个用于评估英语到马拉地语机器翻译中标点鲁棒性的诊断基准,并通过实验表明专用微调模型和流水线系统能显著提升翻译质量。
Details
Motivation: 由于标点符号在消除书面语言的语义和结构歧义中起关键作用,而低资源语言如马拉地语在机器翻译中的标点处理研究不足,因此需要专门的评估基准和改进方法。 Method: 构建了一个包含54个手动整理的标点歧义实例的诊断基准Virām,并评估了两种增强可靠性的策略:基于流水线的先修复后翻译方法和直接在标点变化数据上进行微调的方法。 Result: 实验结果显示,与标准基线相比,专用微调模型和流水线系统在Virām基准上显著提高了翻译质量;定性分析表明原始模型可能导致错误翻译,而微调模型大幅提升了整体可靠性;此外,当前大语言模型在处理标点歧义文本时表现落后于任务特定方法。 Conclusion: 针对标点歧义问题,任务特定的微调和流水线方法优于通用大语言模型,未来需进一步研究以提升低资源语言机器翻译中的标点处理能力。 Abstract: Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.[14] Forgetting as a Feature: Cognitive Alignment of Large Language Models
Hien Tran,Quinten Steenhuis,Alexandros Christoforos,Chadbourne Davis
Main category: cs.CL
TL;DR: 本文提出将大语言模型(LLM)中的“遗忘”视为一种功能性认知机制,而非缺陷,通过借鉴人类记忆的指数衰减动态,建立概率性记忆模型,并提出“概率记忆提示”策略以提升长时推理能力。
Details
Motivation: 大语言模型常被期望实现完美的贝叶斯推理,但实际表现出对过往信息的系统性遗忘;作者认为这不应被视为缺陷,而应从人类记忆机制中汲取灵感,重新理解遗忘的功能性价值。 Method: 将LLM的上下文推理建模为受指数衰减控制的概率性记忆过程,设计包含时间推理、概念漂移适应和关联回忆的基准测试,比较模型与人类认知模式的相似性。 Result: 实验表明LLM的遗忘速率与人类记忆在稳定性与适应性之间的权衡类似;提出的概率记忆提示方法能有效改善长时推理表现。 Conclusion: 遗忘不是LLM的失败模式,而是一种实现自适应智能的原则性机制,为构建更高效推理策略提供了新视角。 Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.[15] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis
Sauhard Dubey
Main category: cs.CL
TL;DR: 本文提出SciNets,通过构建文献导出的概念图,将机制性综合问题建模为图约束下的多跳推理任务,以实现跨领域科学合成,并引入行为评估框架揭示符号推理深度与稳定性之间的权衡。
Details
Motivation: 现有基于检索的系统和语言模型在跨文献机制性解释合成中缺乏对推理深度和结构一致性的控制,难以有效连接分散的科学知识。 Method: 将科学查询与局部语料库转化为有向概念图,利用最短路径、k-最短路径(带多样性约束)、随机游走等方法进行多跳推理,并与增强检索的语言模型基线对比。 Result: 实验表明,显式图约束支持可控的多跳推理;更深、更多样化的推理路径虽提升机制多样性,但导致更高的接地不稳定性,而最短路径推理稳定但结构保守。 Conclusion: 图约束与大语言模型结合能有效支持科学合成中的可控推理,但存在推理深度与接地稳定性之间的根本权衡,需根据应用需求进行平衡。 Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.[16] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens
Meicong Zhang,Tiancheng su,Guoxiu He
Main category: cs.CL
TL;DR: 本文提出了一种名为STIG(Stage Token for Introduction Generation)的新方法,通过将传统agentic工作流的逻辑结构直接参数化到大语言模型中,实现单次推理生成完整的学术引言,无需外部工作流调用。
Details
Motivation: 现有的基于预定义代理工作流的方法在生成研究引言时存在推理链过长、错误累积和文本连贯性差等问题,难以满足引言写作对逻辑严谨性和结构一致性的高要求。 Method: 引入阶段标记(Stage Tokens),将引言生成的多个阶段转化为显式的阶段信号,并通过指令微调使模型学习各阶段的功能角色、逻辑顺序及转换模式,从而将整个工作流的结构知识内化到模型参数中。 Result: 实验结果表明,STIG能够在单次推理中生成多阶段引言文本,在语义相似性和句子级结构合理性等指标上优于传统agentic工作流及其他基线方法,且无需显式调用外部流程。 Conclusion: STIG通过将流程逻辑内化于模型之中,有效提升了引言生成的质量与效率,为减少外部工作流依赖提供了新的解决方案。 Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.[17] Enhancing Business Analytics through Hybrid Summarization of Financial Reports
Tohida Rehman
Main category: cs.CL
TL;DR: 本文提出了一种结合抽取式和生成式的混合摘要框架,用于从财务电话会议记录中自动生成简洁且事实准确的Reuters风格摘要,并通过多指标评估验证了其在计算资源受限情况下的有效性与事实一致性。
Details
Motivation: 由于财务报告和业绩电话会议内容庞大且复杂,手动分析效率低且易产生偏差,因此需要一种自动化方法来高效、准确地提取关键商业信息。 Method: 采用两阶段混合框架:第一阶段使用LexRank算法提取关键句子;第二阶段利用微调后的BART、PEGASUS和Longformer Encoder-Decoder(LED)模型进行抽象摘要,其中LED模型专门用于捕捉长距离上下文依赖。 Result: 实验结果表明,长文本模型整体表现最佳,而混合框架在计算资源受限条件下仍具有竞争力,并表现出更高的事实一致性;评估使用ROUGE、METEOR、MoverScore、BERTScore及领域特定指标SciBERTScore、FinBERTScore,同时引入基于实体的源精度和F1目标来衡量事实准确性。 Conclusion: 该研究表明,结合抽取与生成的混合方法能在资源受限环境下有效生成高质量、事实可靠的财务文本摘要,支持实际应用中对长篇财务文档的快速洞察提取。 Abstract: Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm's performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights.[18] Clinical Document Metadata Extraction: A Scoping Review
Kurt Miller,Qiuhao Lu,William Hersh,Kirk Roberts,Steven Bedrick,Andrew Wen,Hongfang Liu
Main category: cs.CL
TL;DR: 该综述系统梳理了临床文档元数据提取的研究现状,发现方法已从基于规则和传统机器学习转向基于Transformer的架构,尤其得益于大语言模型的发展,未来有望在临床文本处理中实现更广泛的应用。
Details
Motivation: 临床文档元数据(如文档类型、结构、作者角色等)对准确解读临床信息至关重要,但文档异质性和随时间变化导致元数据标准化困难,亟需自动化提取方法以实现跨机构数据整合。 Method: 遵循PRISMA-ScR指南,筛选2011年1月至2025年8月发表的文献,初筛266篇,深入评估67篇相关研究,对其方法学趋势、应用类型及数据可用性进行系统归纳。 Result: 纳入的研究中,45项为方法学研究,17项将元数据用于下游任务,5项分析元数据构成;方法演进明显:从规则和特征工程为主的传统模型转向极少依赖特征工程的Transformer架构;公开标注数据仍稀缺,除文档结构类数据外。 Conclusion: 临床文档元数据提取正朝着更丰富的表示形式和更强的泛化能力发展,大语言模型推动了跨任务与跨数据集的应用潜力,未来研究将更深入集成到临床应用与工作流中。 Abstract: Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.[19] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings
Wen G Gong
Main category: cs.CL
TL;DR: 提出了一种基于PHATE流形学习的多层级分析框架(Semanscope),用于研究多语言嵌入中的语义几何结构,揭示了当前模型在不同语言层级上的系统性几何模式与局限性。
Details
Motivation: 现有嵌入模型难以区分语义与结构成分,且缺乏对多语言、多层级语义几何结构的系统分析,亟需有效工具揭示其内在表示特性。 Method: 构建了一个涵盖子字符、字符、词和数字四个语言层级的分析框架,并开发可视化工具Semanscope,结合PHATE非线性降维技术探测嵌入空间中的几何结构。 Result: 发现子字符层级上中文部首出现几何坍塌,表明模型无法区分结构与语义;不同文字系统在字符级呈现独特几何特征;词汇级内容词在20个语义域中形成聚类-分支结构;阿拉伯数字则呈现螺旋轨迹而非聚类,违背传统分布语义假设。 Conclusion: PHATE流形学习是分析嵌入空间语义几何结构及评估模型有效性的关键工具,当前嵌入模型在细粒度语义表示方面仍存在根本性局限。 Abstract: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.[20] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings
Wen G. Gong
Main category: cs.CL
TL;DR: 本文提出了语义亲和度(Semantic Affinity, SA)指标和Semanscope框架,用于评估多语言嵌入模型的跨语言语义对齐质量,发现训练目标比模型规模或架构更关键,只有经过翻译对监督训练的模型才能实现良好对齐。
Details
Motivation: 现有任务驱动基准(如MTEB)可能掩盖多语言嵌入模型在跨语言语义对齐上的根本缺陷,缺乏直接衡量语义对齐质量的方法,导致实践者难以从数百个模型中选择真正具备良好跨语言能力的嵌入模型。 Method: 提出语义亲和度(SA)指标,通过余弦距离计算跨语言与同语言分布的比率,并结合PHATE可视化技术构建Semanscope分析框架;在13个模型、4个数据集上进行了52次实验进行基准测试。 Result: 实验揭示了三类模型表现:顶级BERT模型(如LaBSE、USE、S-BERT)因使用翻译对监督训练达到高SA值(0.68–0.70);LLM嵌入模型SA值稳定在0.55–0.61之间,不随参数规模提升;仅使用MLM训练的模型(如mBERT、XLM-R)SA值低于0.50,显示其跨语言对齐失败;进一步分析表明模型学习的是语料库模式而非认知基本语义。 Conclusion: 跨语言语义对齐的质量主要取决于是否采用显式的翻译监督训练,而非模型架构、规模或多语言数据量;SA指标能有效区分模型的真实对齐能力,为实践者提供了选择高质量多语言嵌入的工具。 Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.[21] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets
Xin Gao,Xiaoyang Wang,Yun Zhu,Mengzhang Cai,Conghui He,Lijun Wu
Main category: cs.CL
TL;DR: 本文提出了一种闭环的数据集工程框架OpenDataArena(ODA),通过价值锚定排序和多维分析,将基准评估转化为指导数据构建的反馈信号,显著提升了大语言模型在监督微调阶段的数据质量和效率。
Details
Motivation: 监督微调(SFT)数据集的构建目前缺乏系统性理论指导,普遍依赖启发式聚合方法,难以理解单个样本对模型性能的具体贡献。 Method: 提出OpenDataArena(ODA)闭环数据工程框架,采用基于价值锚定的排序和多维分析生成反馈信号,指导数据集构建;具体实现包括用于数学推理的两阶段难度感知管道(ODA-Math-460k)和多领域指令数据的“锚点-补丁”策略(ODA-Mixture)。 Result: ODA驱动的数据集在AIME、HMMT等基准上达到SOTA表现,且在领域特定推理和通用能力上优于更大规模的开源基线,同时实现了更高的数据效率。 Conclusion: 研究验证了向数据中心化AI转变的可行性,透明化评估可作为构建高质量训练数据的核心驱动力。 Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.[22] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis
Yanyi Liu,Qingwen Yang,Tiezheng Guo,Feiyu Qu,Jun Liu,Yingyou Wen
Main category: cs.CL
TL;DR: 本文提出了一种从“检测”到“诊断”大语言模型幻觉的新范式,引入了幻觉诊断任务,并构建了具备错误定位、归因和修正能力的HDM-4B-RL模型。
Details
Motivation: 现有幻觉研究局限于二元检测,缺乏可解释和可操作的反馈,难以支持模型改进。 Method: 提出幻觉诊断任务;开发HDG自动生成带诊断元数据的训练样本;采用GRPO强化学习训练HDM-4B-RL模型。 Result: HDM-4B-RL在HaluEval上超越先前检测SOTA模型,在诊断任务中表现媲美更大通用模型。 Conclusion: 幻觉诊断是可行且有价值的,为构建更可信的生成式AI系统提供了有效方法。 Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary "detection" approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from "detection" to "diagnosis". The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.[23] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations
Xiaoxu Ma,Xiangbo Zhang,Zhenyu Weng
Main category: cs.CL
TL;DR: 提出了一种基于内部激活的稳定且可解释的大语言模型人格特质评估方法PVNI,通过对比提示提取人格向量,并沿该向量插值得到中性评分,实现更鲁棒的评估。
Details
Motivation: 现有基于问卷的人格评估方法稳定性差、可解释性低,结果对提示词微小变化敏感,难以支持大模型的可靠部署与比较。 Method: 提出Persona-Vector Neutrality Interpolation (PVNI):利用对比提示从模型内部激活中提取特定人格特质的人格向量,并通过沿该向量插值估计中性得分,以中性表示与人格方向的关系进行可解释评估。 Result: 在多种大语言模型上的实验表明,PVNI相比现有方法在不同问卷和角色扮演变体下均表现出显著更高的评估稳定性,且具备良好的可解释性。 Conclusion: PVNI为大语言模型的人格评估提供了一种更稳定、可解释的新范式,有助于推动模型理解与负责任部署。 Abstract: Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model's internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.[24] Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences
Sriram Padmanabhan,Siyuan Song,Kanishka Misra
Main category: cs.CL
TL;DR: 研究探讨了视觉语言模型是否像儿童一样,在归纳推理中对不同类型的语言表达(如泛指、全称量化和不定复数)表现出不同的敏感性,并发现模型的行为与人类一致,且这种差异源于归纳约束而非表面形式差异。
Details
Motivation: 探索语言如何微妙地影响归纳推理,并检验通用统计学习模型(如视觉语言模型)是否能够体现出与儿童相似的语言命题表征差异。 Method: 通过复现Gelman等人的实验,首先进行预测试以验证模型对图像中类别的识别能力及其对“all”和“some”的敏感性,然后进行主实验测试模型在新属性推广任务中对不同类型语句的反应。 Result: 视觉语言模型在行为上表现出与4岁以上儿童相似的模式(all > generics > some),并且后验分析表明,这种差异基于归纳约束而非语言表面形式。 Conclusion: 视觉语言模型不仅在行为上模拟了人类的语言诱导的归纳偏好,而且其内部表征反映了深层的归纳结构,表明它们具备类似人类的概念推理能力。 Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements ("Bears are daxable"), universally quantified NPs ("all bears are daxable") and indefinite plural NPs ("some bears are daxable") in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.[25] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication
Sraavya Sambara,Yuan Pu,Ayman Ali,Vishala Mishra,Lionel Wong,Monica Agrawal
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型(LLM)在面对包含错误前提的真实医疗问题时的表现,发现其常未能正确引导患者,存在潜在医疗风险。
Details
Motivation: 评估LLM在真实医疗咨询中处理隐含错误假设的能力,确保其安全性与临床适用性。 Method: 构建了一个名为MedRedFlag的数据集(包含1100多个来自Reddit的需引导的医疗问题),并系统比较了最新LLM与临床医生对这些问题的回应。 Result: 分析显示,即使检测到错误前提,LLM仍常未能进行有效引导,可能引发次优医疗决策。 Conclusion: 当前LLM在真实医疗沟通场景中存在显著安全缺陷,亟需改进以确保患者安全。 Abstract: Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.[26] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing
Yilin Bao,Ziyao He,Zayden Yang
Main category: cs.CL
TL;DR: 提出了一种基于强化学习的科学论文提纲生成框架,通过两阶段优化提升文档级规划和事实一致性。
Details
Motivation: 现有大语言模型在生成科学论文时存在全局结构不一致、输入覆盖不足和引用不准确的问题。 Method: 将提纲构建建模为分层文档结构上的长视野规划问题,采用反向提纲重建和前向价值引导的强化学习进行两阶段优化。 Result: 在新提出的科学论文生成基准上,模型在结构连贯性和引用可靠性方面显著优于强基线模型。 Conclusion: 该方法有效提升了科学文本生成中的文档级规划能力和事实准确性。 Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.[27] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL
Yifei Shen,Yilun Zhao,Justice Ou,Tinglin Huang,Arman Cohan
Main category: cs.CL
TL;DR: CLINSQL是一个基于MIMIC-IV v3.1的临床文本到SQL基准测试,包含633个专家标注任务,要求多表连接、临床意义过滤和可执行SQL查询,评估显示现有模型在临床可靠性上仍有不足。
Details
Motivation: 现有的文本到SQL模型在处理真实世界电子健康记录(EHR)时难以满足临床需求,缺乏对多表结构、时间窗口和患者队列的综合推理能力。 Method: 构建CLINSQL基准,包含633个专家标注的复杂临床SQL任务,基于MIMIC-IV v3.1;采用思维链自我优化提示策略,结合评分标准和执行检查评估22个闭源与开源模型。 Result: GPT-5-mini在测试集上达到74.7%的执行准确率,DeepSeek-R1以69.2%成为最佳开源模型,Gemini-2.5-Pro在简单任务上为85.5%,但在困难任务上降至67.2%。 Conclusion: 尽管大型语言模型在CLINSQL上取得进展,但其性能仍远未达到临床可靠性的要求,该基准为推动面向真实EHR分析的文本到SQL技术提供了重要方向。 Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.[28] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal
Sathvik Nair,Byung-Doh Oh
Main category: cs.CL
TL;DR: 语言模型(LM)概率在预测词汇加工难度方面优于基于填空任务(cloze)的人类反应概率,本文探讨了其优势的三个原因:更高的分辨率、更好地区分语义相似词、更准确地估计低频词概率。
Details
Motivation: 需要明确语言模型概率为何比填空任务概率更能有效预测语言加工努力,以确保科学结论的准确性。 Method: 通过比较语言模型概率与填空任务概率在预测语言处理难度上的表现,检验三个假设:分辨率差异、对语义相似词的区分能力、对低频词的概率估计准确性。 Result: 语言模型概率的优势主要源于其更高的数据分辨率、能更好地区分语义相近的词,并能更准确地为低频词分配概率。 Conclusion: 应改进填空研究的分辨率,并进一步探究人类语言预测是否也对语言模型所捕捉的细微差别敏感。 Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.[29] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Christabel Acquaye,Yi Ting Huang,Marine Carpuat,Rachel Rudinger
Main category: cs.CL
TL;DR: 本研究探讨了开源大语言模型(LLMs)在预测现实世界学生数学题目难度中的应用,通过模拟不同年级的学生角色并结合项目反应理论(IRT)模型,取得了与真实数据高度相关的结果。
Details
Motivation: 标准化数学评估通常依赖昂贵的人类试点研究来确定题目的难度,而本研究旨在探索是否可以利用开源大语言模型更高效且低成本地预测题目难度。 Method: 通过提示LLM扮演不同年级(4、8、12年级)和能力水平的学生,模拟“课堂”环境,并利用这些模拟结果拟合项目反应理论(IRT)模型,从而估计题目的难度参数。比较模型预测的难度与NAEP提供的实际难度数据之间的相关性,并测试不同课堂规模、学生命名方式及性别与种族分层对预测效果的影响。 Result: 模拟结果与真实世界难度的相关系数分别达到0.75(4年级)、0.76(8年级)和0.82(12年级)。使用具名学生比使用编号学生的预测效果更好,进一步按性别和种族分层可提升预测性能。令人意外的是,数学能力较弱的Gemma模型比更强的Llama和Qwen模型表现更优。 Conclusion: 尽管LLM不能直接准确判断题目难度,但基于角色扮演的模拟方法结合IRT模型能有效预测现实学生面对的题目难度,尤其适用于开源模型,为低成本教育评估提供了可行路径。 Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different "classroom sizes," showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.[30] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan,Raphaël Merx,Jey Han Lau
Main category: cs.CL
TL;DR: 提出了一种结合微调NMT模型和基于检索增强生成的大型语言模型的混合框架,有效缓解了低资源语言在领域迁移下的性能下降问题,在Dhao语翻译任务中显著恢复了性能。
Details
Motivation: 低资源语言在领域迁移下神经机器翻译性能显著下降,尤其在缺乏多样化训练数据时表现更差,本文旨在解决这一挑战。 Method: 采用混合框架:首先使用在新约数据上微调的NMT模型生成初稿,然后利用基于检索增强生成(RAG)的大语言模型进行修正,重点分析检索样例数量与算法对性能的影响。 Result: 在旧约测试集上,模型性能从27.11 chrF++提升至35.21 chrF++,实现了8.10的恢复,接近原始领域内的翻译质量;结果表明性能提升主要依赖于检索示例的数量而非检索算法的选择。 Conclusion: 该混合框架能有效弥补低资源语言在跨领域场景下的翻译性能损失,LLM结合RAG可作为强大的“安全网”,纠正NMT在零样本领域的严重错误。 Abstract: Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.[31] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction
Sanghyeok Choi,Woosang Jeon,Kyuseok Yang,Taehyeong Kim
Main category: cs.CL
TL;DR: SocraticKG 是一种基于问答对的自动化知识图构建方法,通过 5W1H 引导的问题扩展,在三元组提取前系统化展开文档语义,提升知识覆盖与结构连贯性。
Details
Motivation: 现有基于大语言模型的知识图构建方法在事实覆盖与关系连贯性之间存在权衡:过度分割导致关系碎片化,过早合并则造成信息丢失。 Method: 提出 SocraticKG,利用 5W1H 指导的问答对作为结构化中间表示,在源文档中显式地捕捉上下文依赖和隐含关系,再进行三元组提取。 Result: 在 MINE 基准上的实验表明,SocraticKG 在显著增加知识提取量的同时,保持了更高的事实保留率和结构凝聚力。 Conclusion: 问答中介的语义支架在知识图谱构建中起关键作用,有助于实现更连贯、可靠的知识结构化。 Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.[32] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records
Lingfei Qian,Mauro Giuffre,Yan Wang,Huan He,Qianqian Xie,Xuguang Ai,Xeuqing Peng,Fan Ma,Ruey-Ling Weng,Donald Wright,Adan Wang,Qingyu Chen,Vipina K. Keloth,Hua Xu
Main category: cs.CL
TL;DR: EHRNavigator是一个多智能体框架,用于在异构和多模态电子健康记录(EHR)数据上实现患者层面的问答,结合真实医疗环境评估,展现出强泛化能力和临床可用性。
Details
Motivation: 现有自然语言问答系统多在基准数据集上评估,难以反映实际临床应用需求,缺乏对复杂EHR结构、时间推理和多模态数据整合的支持。 Method: 提出EHRNavigator,采用多智能体框架,使AI智能体能在真实医院环境中导航并回答基于EHR的问题,支持跨机构数据、时间推理与多模态信息融合。 Result: 在公共基准和机构数据集上评估显示,EHRNavigator在真实案例中达到86%的准确率,并满足临床可接受的响应时间。 Conclusion: EHRNavigator有效弥合了实验室评测与临床部署之间的差距,为现实世界中的EHR问答提供了稳健、自适应且高效的解决方案。 Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.[33] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels
Wan Jou She,Lis Kanashiro Pereira,Fei Cheng,Sakiko Yahata,Panote Siriaraya,Eiji Aramaki
Main category: cs.CL
TL;DR: 本文介绍了一个名为EmplifAI的日本共情对话数据集,旨在支持患有慢性病的患者应对复杂且动态的情绪变化。该数据集基于GoEmotions分类法构建,包含28种细粒度情绪类别、280个医学情境和4125段两轮对话,并通过众包与专家评审收集验证。研究使用BERTScore评估多个大语言模型在情境-对话对上的情感一致性,最高F1达0.83;基于EmplifAI微调的日语大模型在流畅性、整体共情及特定情绪共情方面均有提升。此外,研究比较了LLM评分与人工评分的相关性,验证了评估流程并讨论了潜在风险。
Details
Motivation: 慢性病患者在疾病管理不同阶段会经历复杂多变的情绪波动,现有对话数据集难以充分捕捉这些细微情感变化,尤其缺乏针对日语和医学场景的共情对话资源。因此,需要构建一个情境化、细粒度且医学相关的日语共情对话数据集以提升AI在医疗心理支持中的表现。 Method: 基于GoEmotions taxonomy筛选并本地化为28类细粒度情绪,设计280个医学相关情境,通过众包生成两轮对话,并经临床专家审核确保内容质量。构建EmplifAI数据集后,采用BERTScore评估多个大语言模型在情境-对话对上的情感对齐表现,并对日语大模型LLM-jp-3.1-13b-instruct4进行微调以评估其效果提升。同时引入LLM-as-a-Judge机制,将其评分结果与人类评分员对比,分析其有效性与潜在偏差。 Result: EmplifAI数据集包含280个情境和4125段对话,覆盖28种情绪类别。BERTScore评估显示模型在情感对齐任务上最高F1得分为0.83。微调后的日语大模型在流畅性、一般共情和情绪特异性共情方面均优于基线模型。LLM评分与人工评分之间存在中等程度相关性,表明LLM可作为辅助评估工具,但也揭示出其在情感深度理解上的局限性和潜在误判风险。 Conclusion: EmplifAI是一个高质量、情境化、细粒度的日本共情对话数据集,适用于慢性病背景下的情感支持对话系统研究。实验表明,该数据集能有效提升日语大模型的共情表达能力,同时研究也强调需谨慎使用LLM进行自动评估,建议结合人工评审以确保可靠性。 Abstract: This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation--dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.[34] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment
Zhenghao Liu,Zhuoyang Wu,Xinze Li,Yukun Yan,Shuo Wang,Zulong Chen,Yu Gu,Ge Yu,Maosong Sun
Main category: cs.CL
TL;DR: 提出P-ALIGN框架,通过自适应前缀对齐蒸馏提升小模型数学推理能力,在多个基准上优于基线方法3%以上。
Details
Motivation: 教师模型生成的推理路径过长且复杂,导致学生模型难以有效学习,监督信号与学生学习能力之间存在不匹配。 Method: 提出P-ALIGN蒸馏框架,自适应截断教师生成的推理轨迹,判断剩余后缀是否足够简洁并能指导学生模型,并利用教师生成的前缀监督学生模型,实现有效的前缀对齐。 Result: 在多个数学推理基准上的实验表明,P-ALIGN比所有基线方法高出3%以上,分析显示其构建的前缀提供了更有效的监督信号,避免了冗余和不确定推理成分的负面影响。 Conclusion: P-ALIGN通过自适应利用教师模型的推理路径前缀进行知识蒸馏,显著提升了小规模模型的数学推理性能,为高效推理知识迁移提供了新思路。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.[35] Deriving Character Logic from Storyline as Codified Decision Trees
Letian Peng,Kun Zhou,Longfei Yun,Yupeng Hou,Jingbo Shang
Main category: cs.CL
TL;DR: 提出了一种名为Codified Decision Trees (CDT)的数据驱动框架,用于从大规模叙事数据中生成可执行且可解释的决策结构,以提升角色扮演代理的行为一致性与可靠性。
Details
Motivation: 现有角色扮演代理的行为档案多为非结构化、不可执行且验证不足,导致代理行为脆弱。为此,需要一种更可靠的方法来构建可执行和可验证的行为档案。 Method: CDT通过从大规模叙事数据中迭代生成候选场景-动作规则,基于数据验证这些规则,并通过层次化特化进行优化,最终构建出一棵条件规则树:内部节点表示经过验证的场景条件,叶节点编码具体行为陈述,从而实现上下文相关的确定性规则检索。 Result: 在包含16个作品中85个角色的多个基准测试上,CDT显著优于人工编写档案和先前的档案生成方法。 Conclusion: 结构化、可验证且可执行的行为表示(如CDT)能够有效提升角色扮演代理在多样化叙事环境中的行为一致性和鲁棒性。 Abstract: Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.[36] Is MT Ready for the Next Crisis or Pandemic?
Vipasha Bansal,Elizabeth Brown,Chelsea Kendrick,Benjamin Pong,William D. Lewis
Main category: cs.CL
TL;DR: 本研究评估了四种商业机器翻译系统在低资源语言中对疫情相关文本的翻译效果,使用TICO-19数据集来衡量其在危机或医疗语境下的可用性,并探讨当前应对下一次大流行的准备程度。
Details
Motivation: 在危机时期,政府、医疗机构与受助群体之间的语言障碍可能导致信息传递失败。现有的商业机器翻译系统在高风险、低资源语言环境中的表现尚不明确,尤其是在医疗和危机情境下,因此需要评估其实际可用性。 Method: 研究采用TICO-19数据集,包含多种高优先级语言的疫情相关句子,对四种商业机器翻译系统进行评估,分析其在翻译准确性与可读性方面的表现。 Result: 评估结果显示,当前商业机器翻译系统在低资源语言上的翻译质量参差不齐,部分语言的输出难以理解或存在严重错误,表明其在危机沟通中的可靠性有限。 Conclusion: 尽管商业机器翻译系统为危机沟通提供了潜在支持,但在低资源语言和专业领域(如医疗)中仍存在显著局限,需进一步优化以提升未来大流行应对的准备水平。 Abstract: Communication in times of crisis is essential. However, there is often a mismatch between the language of governments, aid providers, doctors, and those to whom they are providing aid. Commercial MT systems are reasonable tools to turn to in these scenarios. But how effective are these tools for translating to and from low resource languages, particularly in the crisis or medical domain? In this study, we evaluate four commercial MT systems using the TICO-19 dataset, which is composed of pandemic-related sentences from a large set of high priority languages spoken by communities most likely to be affected adversely in the next pandemic. We then assess the current degree of ``readiness'' for another pandemic (or epidemic) based on the usability of the output translations.[37] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking
Viet Cuong Nguyen,Nhi Yen Nguyen,Kristin A. Candan,Mary Conlon,Vanessa Rumie,Kristen Risola,Srijan Kumar,Munmun De Choudhury
Main category: cs.CL
TL;DR: CALM-IT是一个用于生成和评估长程动机性访谈对话的框架,通过建模双向对话状态动态提升大语言模型在心理健康对话中的连贯性和目标对齐能力。
Details
Motivation: 现有大语言模型在长期心理治疗对话中缺乏对治疗进展的持续建模,导致对话易出现断裂和长程偏离目标的问题。 Method: 提出CALM-IT框架,将治疗师与来访者的互动建模为双向状态空间过程,双方持续更新对齐程度、心理状态和短期目标,以指导策略选择和语句生成。 Result: 在大规模评估中,CALM-IT在有效性和目标对齐方面优于强基线模型,且随着对话延长表现更稳定;尽管治疗师重定向次数较少,但客户接受率最高(64.3%),表明干预时机更精准。 Conclusion: 建模动态演化的对话状态对于生成高质量的长程合成心理对话至关重要。 Abstract: Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.[38] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature
Yiming Ren,Junjie Wang,Yuxin Meng,Yihang Shi,Zhiqiang Lin,Ruihang Chu,Yiran Xu,Ziming Li,Yunfei Zhao,Zihan Wang,Yu Qiao,Ruiming Tang,Minghao Liu,Yujiu Yang
Main category: cs.CL
TL;DR: 提出“海洋中的鱼”(FITO)范式,要求模型在原生科学文档中构建显式的跨模态证据链,并构建SIN-Data和SIN-Bench以评估多模态大语言模型的可追溯推理能力。
Details
Motivation: 现有评估方法如仅看答案匹配或合成的‘海中针’测试,无法有效衡量模型是否真正理解长篇科学论文中的跨模态推理过程。 Method: 构建保留文本与图表原生交错结构的SIN-Data语料库,并设计包含四个渐进任务的SIN-Bench;引入‘无证据,无分数’评分机制,从匹配性、相关性和逻辑性评估证据质量。 Result: 实验显示Gemini-3-pro在整体得分最高(0.573),GPT-5在问答准确率最高(0.767)但证据对齐表现较差,揭示正确性与可追溯支持之间的差距。 Conclusion: 模型在科学文献理解中的主要瓶颈在于证据 grounding 能力,强调评估需重视可验证的推理链条而非仅答案正确性。 Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.[39] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation
Lechen Zhang,Yunxiang Zhang,Wei Hu,Lu Wang
Main category: cs.CL
TL;DR: 提出了一种技能中心的蒸馏框架,通过基于技能的数据选择和技能感知微调,用少量数据高效地将推理能力从大模型迁移至小模型。
Details
Motivation: 现有的推理模型蒸馏方法通常需要大量标注数据进行监督微调,缺乏数据效率,因此需要一种更高效的方法来减少数据需求并提升学生模型在薄弱技能上的表现。 Method: 提出技能中心的蒸馏框架,包含两个部分:一是基于技能的数据选择,优先选择针对学生模型薄弱技能的训练样本;二是技能感知微调,鼓励在问题求解过程中进行显式的技能分解。 Result: 仅使用从10万样本教师生成语料中选出的1000个训练样本,在五个数学推理基准上,相比随机SFT基线在Qwen3-4B和Qwen3-8B模型上分别提升了+1.6%和+1.4%,且性能增益集中在训练所强调的技能上。 Conclusion: 技能中心的蒸馏策略显著提高了数据利用效率和学生模型的推理能力,验证了针对性技能训练在高效推理蒸馏中的有效性。 Abstract: Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model's weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.[40] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends
Ye Wang,Jiaxing Chen,Hongjiang Xiao
Main category: cs.CL
TL;DR: 本文系统综述了角色扮演语言代理(RPLA)的发展脉络、关键技术与未来方向,涵盖从早期规则模板到认知模拟的技术演进,总结了人格建模、记忆机制与行为控制等核心技术路径,并分析了数据构建与多维度评估方法。
Details
Motivation: 随着大语言模型的快速发展,角色扮演语言代理在自然语言处理与人机交互中日益重要,亟需对其技术体系进行系统梳理以指导后续研究。 Method: 通过文献综述的方式,梳理RPLA的技术演进路径,分析心理量表驱动的人格建模、记忆增强提示机制、动机-情境行为控制等关键技术,同时总结数据构建方法与多维评估框架。 Result: 归纳出RPLA从规则模板到认知模拟的三个发展阶段,明确关键技术支持路径,整理现有基准数据集与评估手段,并识别数据版权、人格一致性与交互幻觉等核心挑战。 Conclusion: RPLA正向个性化演化、多智能体协作叙事、多模态沉浸交互及与认知神经科学融合的方向发展,需进一步构建动态人格模型与更可靠的评估体系。 Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.[41] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback
Yutao Mou,Zhangchi Xue,Lijun Li,Peiyang Liu,Shikun Zhang,Wei Ye,Jing Shao
Main category: cs.CL
TL;DR: 本文提出了一种用于检测LLM代理在调用外部工具时安全性的新方法,包括构建基准TS-Bench、训练守护模型TS-Guard以及引入反馈驱动的推理框架TS-Flow,显著降低有害操作并提升正常任务完成率。
Details
Motivation: 随着LLM代理通过调用外部工具与环境交互的能力增强,其潜在的安全风险也随之上升,亟需在执行前实时监测并干预不安全的工具调用行为。 Method: 构建了TS-Bench作为步级工具调用安全检测的基准;基于多任务强化学习训练TS-Guard模型,利用交互历史进行不安全调用的预判;设计TS-Flow框架,将守护模型的反馈融入代理推理过程。 Result: TS-Guard能有效识别请求的危害性和行为攻击关联性;TS-Flow使ReAct式代理的有害工具调用平均减少65%,并在提示注入攻击下使良性任务完成率提高约10%。 Conclusion: 该工作为LLM代理提供了可解释、可泛化的安全守护机制,通过前置化风险检测和反馈驱动的推理显著提升了代理系统的安全性与鲁棒性。 Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.[42] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models
Guimin Hu,Meng Li,Qiwei Peng,Lijie Hu,Boyan Xu,Ruichu Cai
Main category: cs.CL
TL;DR: 本文研究了MoE大模型中专家激活的机制,提出通过熵和因果效应指标识别具有领域偏好或强因果影响的专家,并发现早期token更易触发关键专家,调整这些专家权重可提升模型性能。
Details
Motivation: 由于人类大脑的功能特化启发,现有对Transformer的解释性工作多关注层或神经元级别,而MoE模型中的专家行为尚未被深入探索,因此需要研究专家激活模式以增强模型可解释性。 Method: 引入基于熵的指标评估专家在特定领域的激活偏好,使用因果效应指标识别对输出有显著影响的驱动专家,并分析不同token与专家激活之间的关联。 Result: 1)部分专家表现出明显的领域偏好,另一些则对模型输出有强因果影响;2)句子中靠前的token更可能触发驱动专家;3)调整领域专家和驱动专家的权重可在三个模型和领域上均带来性能提升。 Conclusion: 该工作揭示了MoE模型内部专家分工的机制,通过区分领域专家和驱动专家增强了模型的可解释性,并为优化MoE模型提供了新思路。 Abstract: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.[43] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Cameron Tice,Puria Radmard,Samuel Ratnam,Andy Kim,David Africa,Kyle O'Brien
Main category: cs.CL
TL;DR: 本研究首次通过控制预训练数据中的AI对齐讨论比例,发现关于AI不对其行为的讨论会加剧模型的不对其行为,而增加对其行为的讨论则显著降低不对其程度,提出“自我实现的对其”概念,并建议在预训练阶段就考虑对其目标。
Details
Motivation: 预训练语料中广泛存在对AI系统的负面描述,可能使大模型内化这些行为先验,导致自我实现的不对其现象,但这一影响尚未被系统研究。 Method: 通过预训练69亿参数的语言模型,控制输入数据中关于AI对其与不对其讨论的比例,量化其对模型行为的影响,并测试该效应在后训练阶段的持续性。 Result: 增加关于AI不对其的讨论会显著提升模型的不对其行为;相反,增加对其行为的讨论可将不对其分数从45%降至9%,证明了‘自我实现的对其’效应,且该效应在后训练后仍部分存在。 Conclusion: 预训练数据中的对齐相关内容直接影响模型的行为倾向,应将‘对齐预训练’作为后训练的补充,建议实践者在预训练阶段即纳入对其目标。 Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai[44] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers
Prachuryya Kaushik,Ashish Anand
Main category: cs.CL
TL;DR: AWED-FiNER是一个开源生态系统,旨在为36种全球语言(覆盖超66亿人)提供细粒度命名实体识别(FgNER)解决方案,特别关注低资源和濒危语言,集成代理工具、Web应用和轻量级专家模型,支持多语言快速标注与离线部署。
Details
Motivation: 现有大语言模型在低资源语言和细粒度NLP任务上表现不佳,缺乏对濒危语言的技术支持,且难以在资源受限设备上部署,因此需要一个高效、开放、多语言的FgNER专用系统。 Method: 构建包含代理工具包、Web应用和49个小型专家模型的生态系统;代理工具自动路由多语文本至对应语言的专用模型;开发针对各语言优化的小型开源模型,支持离线和边缘设备部署。 Result: 实现了覆盖36种语言的FgNER系统,支持超过66亿人的语言,包括Bodo、Manipuri等濒危语言;提供秒级实体标注能力、用户友好的Web服务及可在边缘设备运行的小型模型。 Conclusion: AWED-FiNER有效填补了低资源与濒危语言在细粒度命名实体识别方面的空白,通过模块化设计和开源策略促进了多语言NLP技术的可及性与可持续发展。 Abstract: We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED-FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED-FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed-finer).[45] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection
Nhung Nguyen Thi Hong,Cuong Nguyen Dang,Tri Le Ngoc
Main category: cs.CL
TL;DR: 本文提出了Credit C-GPT,一个专为越南语催收场景对话理解而设计的七亿参数领域专用大语言模型,整合了多种对话智能任务,实验表明其性能优于传统流水线方法。
Details
Motivation: 传统的自然语言处理系统在处理越南语催收场景中的非正式口语、情绪变化和复杂领域推理时面临挑战。 Method: 构建并微调一个七亿参数的领域专用大语言模型Credit C-GPT,集成对话理解、情感识别、意图检测、通话阶段分类和槽位值提取等任务,并采用特定的数据构建、标注和训练方法。 Result: 在私有标注数据集上的实验结果显示,Credit C-GPT在多个任务上持续优于传统流水线方法。 Conclusion: 领域专用的对话语言模型可为企业的联络中心提供可扩展且注重隐私的实时辅助与事后分析解决方案。 Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.[46] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning
Ziang Cui,Mengran Yu,Tianjiao Li,Chenyu Shi,Yingxuan Shi,Lusheng Zhang,Hongwei Lin
Main category: cs.CL
TL;DR: 本文提出了一种针对大语言模型在多语言翻译中存在跨语言冗长偏差的问题的解决方案,引入了Sand-Glass基准和HOMURA强化学习框架,以优化语义保持与时间合规之间的权衡。
Details
Motivation: 大语言模型在多语言翻译中表现出色,但在字幕和配音等时间受限任务中由于跨语言冗长偏差而受限,现有提示工程方法难以解决语义保真与时间可行性之间的冲突。 Method: 提出了Sand-Glass基准用于评估音节级时长约束下的翻译,并设计了HOMURA强化学习框架,采用KL正则化目标和动态音节比率奖励来显式优化语义保留与时间一致性之间的平衡。 Result: 实验结果表明,HOMURA显著优于强大的大语言模型基线,在尊重语言密度层次的同时实现了精确的长度控制,且不牺牲语义充分性。 Conclusion: HOMURA有效缓解了大语言模型在时间受限翻译任务中的冗长问题,为实现高质量、符合时间约束的多语言翻译提供了新思路。 Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.[47] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns
Xintao Wang,Jian Yang,Weiyuan Li,Rui Xie,Jen-tse Huang,Jun Gao,Shuai Huang,Yueping Kang,Liyuan Gou,Hongwei Feng,Yanghua Xiao
Main category: cs.CL
TL;DR: HUMANLLM是一个新框架,通过将心理模式建模为相互作用的因果力,提升角色扮演语言代理在人类认知和行为上的真实对齐。
Details
Motivation: 现有大语言模型在模拟人类心理和行为时缺乏真实对齐,尤其在多心理模式交互场景下表现不足。 Method: 从约12,000篇学术论文中构建244种心理模式,合成11,359个包含2-5种模式相互作用(增强、冲突、调节)的情境,并生成表达内心想法、行动和对话的多轮对话;提出双层检查表评估个体模式保真度与涌现的多模式动态。 Result: HUMANLLM-8B在多模式动态评估上优于Qwen3-32B,且人类对齐度高(r=0.91),但发现整体指标会混淆模拟准确性与社会期望性。 Conclusion: 真实的人格模拟不仅需模仿人类行为,还需建模生成这些行为的心理过程,认知建模是实现真正拟人化的关键。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling--simulating not just what humans do, but the psychological processes generating those behaviors.[48] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?
Arya Shah,Himanshu beniwal,Mayank Singh
Main category: cs.CL
TL;DR: 本文提出了一个涵盖12种印度语言和四项评估任务的统一基准,用于评估多语言嵌入模型在文化相关用户偏好对齐中的表现,发现E5-Large-Instruct和BGE-M3在检索任务中表现最佳,LaBSE在分类任务中表现最优。
Details
Motivation: 现有基准要么局限于单一语言,要么混淆了检索与生成,无法回答当前嵌入模型是否能在不依赖响应生成的情况下编码个性-指令兼容性,因此需要一个专门针对印度多语言环境的统一评估基准。 Method: 构建了一个包含12种印度语言和四个任务(单语及跨语言的个性-指令检索、指令-个性反向检索和二元兼容性分类)的基准,在冻结编码器设置下评估了八个多种语言嵌入模型,并使用轻量逻辑回归头进行分类。 Result: E5-Large-Instruct在单语检索和跨语言迁移上分别达到27.4%和20.7%的Recall@1;BGE-M3在反向检索中达到32.1%的Recall@1;LaBSE在分类任务中达到75.3%的AUROC且校准良好。 Conclusion: 研究为印度多语言环境下的模型选择提供了实用指导,并为未来工作建立了可复现的基线。 Abstract: Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.[49] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients
Kentaro Kazama,Daiki Shirafuji,Tatsuhiko Saito
Main category: cs.CL
TL;DR: 本文提出了一种基于流形的框架GeoSteer,通过在潜在空间中引导大语言模型的隐藏状态,提升多步推理过程中中间步骤的质量和一致性。
Details
Motivation: 现有的大语言模型在多步推理中常产生逻辑不一致的推理步骤,即使最终答案正确,也会影响推理过程的可靠性。因此需要一种方法来提高中间推理步骤的质量。 Method: 1)构建带有分段评分的思维链(CoT)数据集;2)训练变分自编码器(VAE)和质量评估模型,学习高质量CoT轨迹的低维流形;3)将目标LLM的隐藏状态向潜在空间中的高质量区域引导,实现几何上一致的引导。 Result: 在GSM8k数据集上使用Qwen3系列模型进行评估,GeoSteer使准确率最高提升了2.6个百分点,成对胜率提高了5.3个百分点。 Conclusion: GeoSteer提供了一种有效且可控的机制,能够显著提升大语言模型中间推理步骤的质量和一致性。 Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.[50] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?
Guanxu Chen,Dongrui Liu,Jing Shao
Main category: cs.CL
TL;DR: 循环Transformer(LTs)通过迭代共享层增加计算深度,试图弥合大语言模型内部知识与显式输出之间的差距,但实验表明其效果受限。
Details
Motivation: 研究大语言模型内部知识与其显式语言输出之间存在差距的问题,探索循环Transformer是否能通过迭代机制实现‘内省’来缩小这一差距。 Method: 通过实证实验分析增加循环次数对模型表现的影响,并评估LT在不同循环中对表征的感知能力。 Result: 增加循环次数虽能部分缩小差距,但伴随着表征中内部知识的退化;且LT对表征的感知能力并未随循环提升,仅在最终循环中存在。 Conclusion: 循环Transformer虽为扩展计算深度提供了有前景的方向,但尚未实现真正连接表征空间与自然语言所需的内省能力。 Abstract: Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational depth by iterating shared layers--can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs' ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.[51] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts
Prottay Kumar Adhikary,Reena Rawat,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 本文提出了coTherapist,一个基于小型语言模型的统一框架,通过领域特定微调、检索增强和代理推理来模拟核心治疗能力,能够在临床查询中生成更相关、更具临床依据的回应,并展现出高共情和治疗师一致的人格特质。
Details
Motivation: 由于心理健康服务面临人力短缺和需求上升的压力,亟需智能系统辅助心理治疗专家,从而推动了本研究的发展动机。 Method: 采用小型语言模型,结合领域特定微调、检索增强和代理推理构建coTherapist框架,并使用T-BARS评分标准和心理测量分析进行评估,同时由领域专家进行人工评价。 Result: coTherapist在临床问题回答中表现优于现有基线模型,表现出更高的相关性、临床合理性、共情能力和人格一致性,且经专家验证其回应准确、可信且安全。 Conclusion: 研究表明,经过工程化设计的小型模型可展现出类似专家的行为,为数字心理健康工具提供了可扩展的发展路径。 Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.[52] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs
Nan Li,Bo Kang,Tijl De Bie
Main category: cs.CL
TL;DR: 该研究探讨了大语言模型在不同语言下判断道德困境时的差异,提出了一种分离输入语言和推理语言影响的新方法,并结合道德基础理论解释模型判断,发现推理语言的影响是输入语言的两倍,且近半数模型存在被标准评估忽略的情境依赖性。
Details
Motivation: 现有评估方法混淆了语言输入和推理语言对道德判断的影响,无法准确分析多语言下模型道德决策的差异,因此需要一种能分解这两种因素作用的新方法。 Method: 通过独立操控道德困境的语言和模型推理的语言(包括匹配与不匹配条件),设计实验分离输入语言与推理语言的影响,并基于道德基础理论对判断结果进行解释性分析,同时提出将权威维度细分为家庭与制度两个子维度。 Result: 在13个大语言模型上的英-中文道德判断实验显示:推理语言带来的方差贡献是输入语言的两倍;检测到近一半模型存在标准评估未能发现的情境依赖性;建立了诊断分类体系以指导实际部署。 Conclusion: 大语言模型的道德判断受推理语言显著影响,而不仅是输入语言;所提方法能更精细地诊断跨文化道德决策行为,为多语言AI系统的部署提供理论支持与实践指导。 Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.[53] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel
Hiroaki Yamagiwa,Yusuke Takase,Hidetoshi Shimodaira
Main category: cs.CL
TL;DR: 本文提出了一种基于主角度的子空间相似性度量方法——投影核(Projection Kernel, PK),用于更好地理解Transformer中注意力头之间的关系,并通过实验验证其在IOI任务上优于现有指标,同时引入了一个评估PK分布信息量的框架,应用该方法发现GPT2-small中的L4H7注意力头具有枢纽作用。
Details
Motivation: 现有的注意力头间关系度量方法未能很好地捕捉Transformer内部结构,因此需要一种更有效的度量方式来揭示注意力头之间的相互作用。 Method: 利用注意力头权重矩阵张成的子空间,基于主角度定义投影核(PK)作为子空间相似性度量,并构建参考分布(来自随机正交子空间)以评估PK分布的信息量,进而构造有向图分析注意力头的功能角色。 Result: PK在IOI任务上比组成分数等现有指标更清晰地再现了已知的头间交互;通过PK构建的有向图发现GPT2-small模型中L4H7注意力头作为一个恒等头起到了中心枢纽的作用。 Conclusion: 投影核(PK)是一种有效的注意力头间关系度量方法,能够更好揭示Transformer内部结构,且所提框架有助于定量评估此类度量的信息含量,为模型解释提供了新工具。 Abstract: Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.[54] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts
Yuxuan Lou,Kai Yang,Yang You
Main category: cs.CL
TL;DR: MoST是一种基于Mixture of Experts架构的新型多模态大语言模型,通过模态感知的专家混合结构(MAMoE)实现语音与文本的无缝集成,利用专用路由路径和共享专家提升模态特定学习与跨模态理解能力。
Details
Motivation: 现有多模态模型通常使用相同参数处理不同模态,忽视了其表征差异;为此提出一种能根据输入类型动态分配计算资源的新架构。 Method: 提出MAMoE架构,包含模态特定专家组和共享专家,并设计高效的转换流程:在ASR和TTS数据集上进行后训练,再用精心构建的语音-文本指令数据集微调,全程使用开源数据。 Result: 在ASR、TTS、音频语言建模和口语问答等多个基准测试中,MoST均优于同等规模的现有模型;消融实验验证了模态特定路由和共享专家的有效性。 Conclusion: MoST是首个完全开源的基于MoE架构的语音-文本大语言模型,兼具高性能与数据效率,推动了多模态语言模型的发展。 Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST[55] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?
Luoming Hu,Jingjie Zeng,Liang Yang,Hongfei Lin
Main category: cs.CL
TL;DR: 本文提出了一种基于道德基础理论(MFT)的自适应道德融合(AMF)方法,通过跨语言线性探测识别并操纵大语言模型中的内在道德表征,实现了在推理时动态干预,有效平衡了安全性与有用性之间的权衡。
Details
Motivation: 现有的对齐技术通常仅作为表面防护,未能深入调整大语言模型内在的道德表征,导致安全性和实用性难以兼顾。为此,本文旨在揭示并利用LLMs中细粒度的道德结构,以实现更深层次的道德对齐。 Method: 基于道德基础理论(MFT),使用跨语言线性探测验证大语言模型中间层中存在共享但有所差异的道德子空间;从中提取可操控的道德向量,并提出自适应道德融合(AMF)方法,在推理时结合探针检测与向量注入进行动态干预。 Result: 实验证明,该方法能有效识别并操控模型内部的道德表示,在行为层面显著减少对良性查询的错误拒绝,同时降低越狱攻击的成功率,优于标准基线方法。 Conclusion: 通过挖掘和利用大语言模型中内在且可迁移的道德表征,AMF为实现细粒度、动态且内在的道德对齐提供了可行路径,有助于解决AI安全中的安全性与实用性冲突问题。 Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.[56] Multilinguality as Sense Adaptation
Jan Christian Blaise Cruz,David Ifeoluwa Adelani,Alham Fikri Aji
Main category: cs.CL
TL;DR: 提出SENSE-based Symmetric Interlingual Alignment (SENSIA)方法,通过在平行数据上对齐意义层面的混合和上下文表示,实现跨语言的潜层语义对齐,同时保持目标语言的流畅性。
Details
Motivation: 传统多语言模型依赖共享参数和大规模数据,难以有效对齐不同语言间的语义表示。本文旨在通过意义层面的适应来改进跨语言表示对齐。 Method: 引入SENSIA方法,通过对齐潜层意义表示和上下文表征,并结合目标语言的语言建模损失进行联合训练,实现从一种语言到另一种语言的适配。 Result: 在四种类型迥异的语言基准测试中,SENSIA普遍优于现有的多语言对齐方法,并在使用更少目标语言数据(2-4倍)的情况下达到与单语从头训练基线相当的准确率。分析显示学习到的意义几何结构保持了局部拓扑和相对于英语的全局结构。 Conclusion: SENSIA能有效实现跨语言意义对齐,在减少数据需求的同时保持高性能,且对模型设计和规模具有鲁棒性。 Abstract: We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.[57] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios
Aniket Deroy
Main category: cs.CL
TL;DR: 本文介绍了Advosynth-500数据集,包含100个合成语音文件,用于研究法庭辩论场景中合成语音的说话人识别问题。
Details
Motivation: 随着大规模语音到语音模型保真度的提高,区分结构化环境中的合成声音变得至关重要。 Method: 使用Speech Llama Omni模型模拟五组不同的律师对话语音,为每位律师定义特定的声音特征,并构建说话人识别挑战任务。 Result: 发布了包含10个独特律师身份的100个合成语音文件的数据集,可用于评估现代系统对合成语音来源的识别能力。 Conclusion: Advosynth-500为评估合成语音环境下的说话人识别提供了新的基准。 Abstract: As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins. Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500.[58] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis
Songsong Tian,Kongsheng Zhuo,Zhendong Wang,Rong Shen,Shengtao Zhang,Yong Wu
Main category: cs.CL
TL;DR: 本文提出了BAR-SQL,一种将可靠性与边界感知嵌入生成过程的统一NL2SQL训练框架,通过种子变异数据合成和知识引导推理提升SQL生成质量与对模糊及不可回答查询的拒绝能力。
Details
Motivation: 现有NL2SQL模型在面对模糊、模式限制或无法回答的查询时缺乏可靠的边界感知能力,难以满足企业级复杂分析场景的需求。 Method: 提出Seed Mutation数据合成方法构建包含多步分析查询和边界情况的企业语料库;采用Knowledge-Grounded Reasoning Synthesis生成基于元数据和业务规则的思维链;通过两阶段训练(SFT + 基于组相对策略优化的强化学习)结合任务条件混合奖励机制优化执行准确性和语义精确性。 Result: 在新构建的Ent-SQL-Bench基准上,BAR-SQL达到91.48%的平均准确率,优于Claude 4.5 Sonnet和GPT-5等先进专有模型,在SQL生成质量和边界感知拒绝能力方面均表现更优。 Conclusion: BAR-SQL通过显式建模边界感知与可靠性机制,显著提升了NL2SQL在真实企业场景中的实用性与可信度,所提出的训练框架和评测基准为后续研究提供了重要参考。 Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.[59] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit
Warren Jouanneau,Emma Jouffroy,Marc Palyart
Main category: cs.CL
TL;DR: 本文提出了一种基于晚期交叉注意力架构的重排序模型,用于实时、高效地匹配候选人与职位,利用大语言模型生成细粒度监督信号并通过知识蒸馏提升性能。
Details
Motivation: 解决长文本、多语言简历在人岗匹配中的挑战,并减少历史数据偏差对推荐结果的影响。 Method: 采用晚期交叉注意力架构分解简历和项目简述,使用生成式大语言模型作为教师模型提供语义丰富的监督信号,并通过改进的蒸馏损失函数训练学生模型。 Result: 实验表明,该方法在相关性、排序和校准指标上均优于现有最先进基线模型。 Conclusion: 所提方法能有效实现长上下文输入下的高效、可解释且公平的人岗匹配。 Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.[60] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding
Deming Ding,Shichun Liu,Enhui Yang,Jiahang Lin,Ziying Chen,Shihan Dou,Honglin Guo,Weiyu Cheng,Pengyu Zhao,Chengjun Xiao,Qunhong Zeng,Qi Zhang,Xuanjing Huang,Qidi Xu,Tao Gui
Main category: cs.CL
TL;DR: 本文提出了OctoBench,一个用于评估在仓库接地的代理编码中支架感知指令遵循能力的基准测试,强调了任务解决与遵循规则之间的系统性差距。
Details
Motivation: 现代编码支架虽然使大型语言模型成为有能力的软件代理,但它们遵循特定指令的能力,尤其是在异构且持续存在的约束条件下,仍缺乏充分研究。 Method: 引入OctoBench,包含34个环境和217个任务,涵盖三种支架类型,并配有一个自动化观察与评分工具包,以捕捉完整轨迹并进行细粒度检查。 Result: 对八种代表性模型的实验揭示了任务解决能力和支架感知合规性之间存在系统性差距。 Conclusion: 需要专门针对异构指令遵循的训练和评估方法,以促进更具备支架意识的编码代理的发展。 Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.[61] Training-Trajectory-Aware Token Selection
Zhanming Shen,Jiaqi Hu,Zeyu Qin,Hao Chen,Wentao Ye,Zenan Huang,Yihong Zhuang,Guoshan Lu,Junlin Zhou,Junbo Zhao
Main category: cs.CL
TL;DR: 本文提出了一种基于训练轨迹感知的词元选择方法T3S,用于解决在强推理能力学生模型上持续蒸馏效果不佳的问题,通过重构词元级别的训练目标,显著提升了AR和dLLM场景下的推理效率与性能。
Details
Motivation: 在学生模型已具备较强推理能力的情况下,传统的持续蒸馏往往导致性能下降或提升有限,本文旨在分析其根本原因并提出更高效的蒸馏策略。 Method: 作者发现训练过程中存在‘信心分裂’现象:部分词元(Imitation-Anchor Tokens)迅速优化并主导训练路径,抑制了其他待学习词元的信心增长,导致性能瓶颈。为此提出T3S方法,在词元级别重建训练目标,为待学习词元清理优化路径。 Result: T3S在多种模型和设置下均取得显著增益:仅用数百个示例,Qwen3-8B超越DeepSeek-R1;Qwen3-32B接近Qwen3-235B的表现;T3训练的LLaDA-2.0-Mini超过其自回归基线,成为16B规模非思维模型中的SOTA。 Conclusion: 通过细粒度的词元级训练轨迹建模,T3S有效解决了强学生模型中蒸馏失败的问题,为高效知识转化提供了新思路。 Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.[62] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text
Zhihao Xu,Rumei Li,Jiahuan Li,Rongxiang Weng,Jingang Wang,Xunliang Cai,Xiting Wang
Main category: cs.CL
TL;DR: 提出了一种基于文本语料库生成多轮工具使用轨迹的新范式GEM,通过四阶段流程从文本中提取真实、可扩展的交互数据,并训练专用轨迹合成器以降低成本,显著提升了大模型在多轮工具使用任务中的性能与泛化能力。
Details
Motivation: 获取多样化且真实的多轮工具使用数据存在挑战,限制了大语言模型在构建自主智能体方面的应用。 Method: 提出GEM数据合成管道,包含相关性过滤、工作流与工具提取、轨迹对齐和复杂度优化四个阶段,并训练一个高效的轨迹合成器模型来蒸馏整个流程。 Result: GEM-32B在BFCL V3多轮基准上性能提升16.5%,部分超越基于领域内数据训练的τ-bench模型,且轨迹合成器在质量相当的情况下大幅降低推理延迟和成本。 Conclusion: 基于文本语料库的数据合成范式能有效生成高质量多轮工具使用轨迹,具备良好泛化性和实用性,为构建具备工具调用能力的自主代理提供了可扩展解决方案。 Abstract: Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.[63] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models
Christina Lu,Jack Gallagher,Jonathan Michala,Kyle Fish,Jack Lindsey
Main category: cs.CL
TL;DR: 本文研究了大语言模型中“助手轴”(Assistant Axis)的结构,发现该轴决定了模型是否处于默认的助手模式,并影响其行为的一致性和安全性。通过激活方向调控可稳定模型人格,防止在复杂对话或对抗性攻击下的“人格漂移”。
Details
Motivation: 探索大语言模型中不同人格特征的空间结构,理解为何模型在特定情境下会偏离其默认的助手身份,并出现异常或有害行为。 Method: 通过提取代表不同角色原型的激活方向,分析多个模型中的人格空间结构,识别出主导的“助手轴”,并测量其在预训练和后训练模型中的表现;利用激活限制来控制模型在该轴上的行为。 Result: 发现了普遍存在的“助手轴”,其强度与模型是否表现为助手相关;远离该轴会导致模型表现出神秘、戏剧化的语言风格;该轴在预训练模型中已存在,影响人类型与灵性型角色的表达;偏离该轴可预测“人格漂移”现象,而限制激活范围可增强稳定性并抵御基于人格的越狱攻击。 Conclusion: 后训练使模型偏向特定人格区域,但未牢固绑定;‘助手轴’为人格控制提供了可操作的路径,未来需发展更深入锚定一致人格的训练与引导策略。 Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.[64] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
Tarun Sharma,Manikandan Ravikiran,Sourava Kumar Behera,Pramit Bhattacharya,Arnab Bhattacharya,Rohit Saluja
Main category: cs.CL
TL;DR: 本文介绍了INDIC-DIALECT,一个涵盖11种方言和2种语言(印地语和奥里亚语)的13k句对的人工标注平行语料库,并构建了一个包含方言分类、选择题回答和机器翻译的多任务基准。实验表明现有大模型在该任务上表现不佳,而基于印度语言预训练的微调模型显著提升了性能。
Details
Motivation: 大多数低资源语言方言在NLP研究中被忽视,尤其是在印度,尽管印地语和奥里亚语使用广泛,其方言却缺乏网络数据和研究支持。 Method: 构建了一个名为INDIC-DIALECT的平行语料库,并设计了一个包含方言分类、选择题回答和机器翻译的多任务基准,评估了大模型和微调模型的表现。 Result: GPT-4o和Gemini 2.5在方言分类任务上表现差;微调后的印度语言预训练模型将F1值从19.6%提升至89.8%;在方言到语言翻译中,混合AI模型BLEU得分为61.32(基线23.36);语言到方言翻译中,“规则+AI”方法取得最佳BLEU 48.44(基线27.59)。 Conclusion: INDIC-DIALECT为印度方言感知的NLP提供了新基准,将开源以推动低资源印度方言的研究。 Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.[65] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction
Mihai Dan Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran
Main category: cs.CL
TL;DR: TF3-RO是一个面向罗马尼亚语的端到端语言建模管道,支持从分词器设计到合成数据生成的全流程,旨在训练紧凑且高性能的罗马尼亚语模型并生成大规模本土化叙事语料库。
Details
Motivation: 针对形态丰富但计算资源匮乏的语言(如罗马尼亚语),缺乏统一、可复现的端到端建模范式,特别是在合成数据生成方面。因此需要构建一个涵盖分词、预训练、压缩与评估的完整框架。 Method: 基于英文语料TF1及其高质量罗马尼亚译本TF2,提出TF3-RO框架:设计适用于罗马尼亚语形态特征的BPE和Unigram分词器;从零开始使用长序列打包技术预训练5165万参数的LLaMA风格Transformer;通过量化、结构化剪枝和logit蒸馏优化出2645万参数的学生模型;利用该模型结合控制性组合提示生成三百万条罗马尼亚语合成寓言。 Result: 成功构建了罗马尼亚语专用的高效分词器,缓解了形态复杂导致的分词膨胀问题;训练出小型化且部署性能优良的2645万参数模型;生成了大规模、语言连贯的罗马尼亚语合成寓言数据集,并在内在指标、语法一致性、实体连贯性和LLM评估中表现良好。 Conclusion: TF3-RO为资源受限的丰富形态语言提供了可复现、语言学驱动的建模范式,展示了在低资源语言中实现高效模型训练与高质量合成数据生成的可行性。 Abstract: Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.[66] Are Language Models Models?
Philip Resnik
Main category: cs.CL
TL;DR: 语言模型(LMs)作为认知模型的主张在Marr的三个层次上均存在问题,更适合作为工具而非认知模型。
Details
Motivation: 评估语言模型是否真正适合作为认知模型,揭示其在实现、算法表征和计算理论层面的问题。 Method: 基于Marr的三个分析层次(实现、算法-表征、计算理论)对语言模型作为认知模型的主张进行系统评估。 Result: 发现语言模型在实现层次上明显不符合,在算法-表征层次上动机不足,在计算理论层次上存在问题。 Conclusion: 语言模型更适合被视为研究工具,而非真正的认知模型;将其称为认知模型夸大其词并助长了大模型炒作。 Abstract: Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.[67] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability
Ruochen Li,Kun Yuan,Yufei Xia,Yue Zhou,Qingyu Lu,Weihang Li,Youxiang Zhu,Nassir Navab
Main category: cs.CL
TL;DR: 本文提出了一种基于专家定义规则的手术规划正确性评估方法,通过多中心元评估基准揭示了现有视频语言模型在感知和推理方面的不足,并发现结构化知识对提升模型性能更为有效。
Details
Motivation: 当前对手术规划中视觉-语言模型的评估方法不够可靠,尤其是在安全关键场景下缺乏有效的评估标准。 Method: 提出基于手术阶段目标可满足性的规划正确性定义,构建包含有效变体和错误计划的多中心元评估基准,并采用基于规则的目标可满足性指标进行高精度评估。 Result: 实验表明序列相似性指标会误判规划质量,而基于规则的指标能更准确识别有效与无效计划;模型在受限设置下暴露出感知错误和推理不足的问题,结构化知识显著提升性能。 Conclusion: 结构化知识对于提高手术规划中VLMs的可靠性至关重要,未来应结合结构约束以实现更安全的决策支持。 Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.[68] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models
Abhinaba Basu,Pavan Chakraborty
Main category: cs.CL
TL;DR: 本文提出了Contextual StereoSet基准和Context Sensitivity Fingerprints(CSF)方法,用于评估语言模型在不同上下文条件下(如时间、场合、受众)的偏见敏感性,发现固定条件下的偏见评估结果可能无法泛化,强调应关注‘在何种条件下偏见出现’而非‘模型是否有偏见’。
Details
Motivation: 现有的偏见评估方法通常在固定上下文中进行,难以反映模型在真实部署环境中的表现。作者旨在揭示上下文变化对模型偏见的影响,并提出更稳健的评估框架。 Method: 构建Contextual StereoSet基准,保持刻板内容不变而系统性地变换上下文(如时间、场合、观察者视角),设计两种评估协议(全面诊断与预算筛选),并提出CSF指标来量化模型在各维度上的偏见敏感性分布。 Result: 在13个模型上测试发现,上下文变化(如设定为1990年而非2030年、八卦语境、外群体观察者)显著影响模型的刻板印象表达;效应在招聘、借贷和求助情境中可复现;CSF提供了带置信区间的细粒度偏见敏感性画像。 Conclusion: 固定条件下的偏见评分可能不具备泛化性,评估应转向考察模型在多样化上下文中的行为稳定性;CSF提供了一种新的诊断工具,推动偏见评估从二元判断向条件依赖分析转变。 Abstract: A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences -- no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases -- a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, "Under what conditions does bias appear?" rather than "Is this model biased?" We release our benchmark, code, and results.[69] DR-Arena: an Automated Evaluation Framework for Deep Research Agents
Yiwen Gao,Ruochen Zhao,Yang Deng,Wenxuan Zhang
Main category: cs.CL
TL;DR: 本文提出了一种名为DR-Arena的全自动评估框架,用于动态评估作为深度研究代理的大型语言模型,通过实时信息树和自适应演化循环实现与人类偏好的高度对齐。
Details
Motivation: 现有的静态数据集基准存在任务通用性差、时间错位和数据污染等问题,难以可靠评估具备自主研究能力的大型语言模型。 Method: 构建基于实时网络趋势的动态信息树,并设计自动化考官生成结构化任务,评估模型的深度推理与广度覆盖能力;引入自适应演化循环,根据实时表现动态提升任务难度。 Result: 在六个先进深度研究代理上的实验表明,DR-Arena与LMSYS搜索竞技场排行榜的斯皮尔曼相关系数达到0.94,是目前无需人工干预下与人类偏好对齐度最高的自动评估方法。 Conclusion: DR-Arena是一种高效、可靠的自动化评估框架,能够准确衡量深度研究型大模型的能力边界,可替代昂贵的人工评判。 Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.[70] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models
Xuan Luo,Lewei Yao,Libo Zhao,Lanqing Hong,Kai Chen,Dehua Tao,Daxin Tan,Ruifeng Xu,Jing Li
Main category: cs.CL
TL;DR: AEQ-Bench是一个新的基准,用于评估多模态大模型在理解和生成带有情感的多模态输入(音频+文本)中的共情能力,以及在无需文本转录的情况下判断音频回应共情性的能力。
Details
Motivation: 由于共情具有内在的情感特性,对多模态大模型进行自动评估时,评估共情是一项重大挑战。因此需要一个系统的方法来衡量这种能力。 Method: 引入AEQ-Bench,包含两种新设置,通过上下文特异性和语音语调的变化,从语言和副语言指标对模型进行全面评估。 Result: 具备音频输出能力的多模态大模型通常优于仅支持文本输出的模型;在粗粒度质量评估中,多模态大模型与人类判断一致,但在细粒度副语言表现评估中仍不可靠。 Conclusion: AEQ-Bench为评估多模态大模型的共情能力提供了有效工具,揭示了当前模型在细粒度情感表达理解上的局限性。 Abstract: While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.[71] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models
Chengbing Wang,Wuqiang Zheng,Yang Zhang,Fengbin Zhu,Junyi Cheng,Yi Xie,Wenjie Wang,Fuli Feng
Main category: cs.CL
TL;DR: 提出了一种基于心理学的共情奖励建模方法PERM,通过支持者、寻求者和旁观者三重视角实现双向共情评估,显著提升大语言模型在情感支持任务中的表现。
Details
Motivation: 现有共情奖励模型多从单一视角评估,忽略了共情在支持者与寻求者之间的双向互动本质,难以真实反映共情效果。 Method: 基于共情循环理论,提出PERM框架,将共情评估分解为支持者视角(内在共鸣与表达)、寻求者视角(情绪接受度)及旁观者视角(整体交互质量),并在强化学习中构建多视角奖励信号。 Result: 在公开基准和工业对话数据集上,PERM超越现有最优方法10%以上;盲测用户研究显示70%用户更偏好该方法生成的回应。 Conclusion: PERM通过心理学驱动的多视角奖励建模,有效提升了大语言模型在情感支持任务中的共情能力,具有更强的交互真实性与用户满意度。 Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10\%. Furthermore, a blinded user study reveals a 70\% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.[72] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure
Syed Naveed Mahmood,Md. Rezaur Rahman Bhuiyan,Tasfia Zaman,Jareen Tasneem Khondaker,Md. Sameer Sakib,Nazia Tasnim,Farig Sadeque
Main category: cs.CL
TL;DR: 本文提出了知识免疫框架(KIF),通过针对内部激活特征实现真正知识遗忘,解决了现有遗忘方法中行为抑制与真实知识删除混淆的问题。
Details
Motivation: 现有LLM知识遗忘方法难以区分表面拒绝与真正知识删除,导致潜在能力依然存在,影响GDPR合规性和模型安全性。 Method: 提出KIF框架,结合对主体特定表示的动态抑制和参数高效自适应,直接作用于内部激活签名而非表面输出,实现在不进行完整重训练下的持久遗忘。 Result: KIF在多种基础模型(Llama、Mistral)和推理优先模型(Qwen、DeepSeek)上验证有效,实现了接近oracle的遗忘效果(FQ≈0.99),同时保持了较高的实用性(MU=0.62);标准模型表现出尺度无关的真实遗忘,而推理优先模型显示出根本性的架构差异。 Conclusion: KIF打破了以往稳定性与遗忘性之间的权衡,首次系统性地诊断了不同模型族和规模下的机制级遗忘行为,提出的双指标评估协议可操作化地区分遮蔽与真正删除。 Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.[73] Form and Meaning in Intrinsic Multilingual Evaluations
Wessel Poelman,Miryam de Lhoneux
Main category: cs.CL
TL;DR: 本文探讨了在多语言设置中用于条件语言模型的内在评估指标(如困惑度或每字符比特数)的假设及其影响,实验表明当前的指标不具备普遍可比性,并通过形式-意义辩论对此进行了解释。
Details
Motivation: 探讨在多语言环境下使用困惑度等指标评估条件语言模型时所依赖的假设是否合理,尤其是在平行句子语义内容相同的情况下。 Method: 明确指出当前评估指标的信息论基础及相关假设,使用六种指标在两个多语言平行语料库上对单语和多语模型进行实验分析。 Result: 发现现有的内在评估指标在不同语言或模型之间不具备普遍可比性。 Conclusion: 当前的评估指标不能直接用于跨语言或跨模型的质量比较,需结合形式与意义的理论深入理解其局限性。 Abstract: Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.[74] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs
Yuxi Xia,Loris Schoenegger,Benjamin Roth
Main category: cs.CL
TL;DR: 本文提出了TracVC方法,用于追踪大语言模型(LLM)在生成回答时表达自信的来源,发现模型常依赖与问题无关的语言模式而非基于实际内容来表达自信,揭示了当前训练方式下LLM存在“表面自信”的问题。
Details
Motivation: 由于LLM常表现出不准确的过度自信,研究者希望理解其口头表达自信的来源,以提升模型可信度。 Method: 提出TracVC方法,结合信息检索与影响估计技术,追踪LLM输出中自信表达的训练数据来源,并引入‘内容 groundedness’指标衡量自信是否基于相关内容。 Result: 在OLMo和Llama模型上的实验表明,OLMo2-13B常受与问题无词汇关联的自信表达数据影响,显示其倾向于模仿表面语言模式而非基于内容推理。 Conclusion: 当前LLM可能学会‘显得自信’而非‘合理地自信’,训练机制需改进以增强自信表达的真实性与可靠性。 Abstract: Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.[75] Detecting Winning Arguments with Large Language Models and Persuasion Strategies
Tiziano Labruna,Arkadiusz Modzelewski,Giorgio Satta,Giovanni Da San Martino
Main category: cs.CL
TL;DR: 本文研究了说服策略在判断文本说服力中的作用,利用大语言模型和多策略说服评分方法,在三个标注数据集上进行实验,结果表明基于策略的推理能提升说服力预测效果,并发布了带有话题标注的数据集。
Details
Motivation: 理解人类交流中说服力的作用,识别论辩文本中的说服策略对提升文本分析能力具有重要意义。 Method: 采用大语言模型结合多策略说服评分方法,引导模型关注六种说服策略,并在多个数据集上进行实验,同时对Winning Arguments数据集按话题分类以分析内容影响。 Result: 策略引导的推理显著提升了说服力预测性能,不同话题下的表现存在差异,且发布的主题标注数据集有助于后续研究。 Conclusion: 结构化的、基于策略的提示方法能够增强论点质量评估的可解释性和鲁棒性,具有广泛应用潜力。 Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.[76] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals
Gilat Toker,Nitay Calderon,Ohad Amosy,Roi Reichart
Main category: cs.CL
TL;DR: 本文提出了LIBERTy框架,通过基于大语言模型和结构因果模型生成结构化反事实样本来评估概念性解释的可信度,并构建了三个高风险领域的数据集及新的评估指标order-faithfulness,用于系统评估现有解释方法并分析模型对概念干预的敏感性。
Details
Motivation: 现有概念性解释的评估依赖人工编写的反事实样本,成本高且不完美,缺乏可扩展、可靠的基准来衡量解释方法的忠实性。 Method: 提出LIBERTy框架,结合明确设定的结构因果模型(SCM)与大语言模型生成干预后的反事实文本;构建三个应用场景的数据集,并引入新指标order-faithfulness评估解释方法的顺序一致性。 Result: 在五个模型上评估多种解释方法,发现当前方法仍有显著改进空间;发现专有大语言模型对人口统计学概念的敏感性明显降低,可能源于后训练缓解策略。 Conclusion: LIBERTy为开发更可信的概念性解释方法提供了可扩展、可控的基准,推动高风险领域中可解释AI的发展。 Abstract: Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.[77] Grounding Agent Memory in Contextual Intent
Ruozhen Yang,Yucheng Jiang,Yueqi Jiang,Priyanka Kargupta,Yunyi Zhang,Jiawei Han
Main category: cs.CL
TL;DR: STITCH是一种面向长周期目标导向交互的智能体记忆系统,通过结构化意图索引提升上下文感知检索能力,在长轨迹场景中显著优于现有方法。
Details
Motivation: 在长周期、目标导向的交互中,由于相似实体和事实反复出现但潜在目标不同,传统记忆系统容易检索到上下文不匹配的信息,导致推理错误。 Method: 提出STITCH框架,将每个轨迹步骤用结构化检索线索、上下文意图进行索引,并基于当前步骤的意图进行历史检索;其中上下文意图包含当前潜在目标、动作类型和显著实体类型三部分。推理时根据意图兼容性过滤和优先排序记忆片段。 Result: 在CAME-Bench和LongMemEval两个基准上达到SOTA性能,比最强基线提升35.6%,且随着轨迹长度增加增益更大;分析表明意图索引显著降低了检索噪声。 Conclusion: STITCH通过结构化意图索引有效提升了长周期交互中的记忆检索准确性,支持更鲁棒的上下文感知推理,尤其适用于复杂、动态的任务场景。 Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.[78] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching
Changle Qu,Sunhao Dai,Hengyi Cai,Jun Xu,Shuaiqiang Wang,Dawei Yin
Main category: cs.CL
TL;DR: 提出MatchTIR框架,通过二分图匹配实现细粒度的回合级奖励分配和双层优势估计,提升大模型在长周期多轮任务中的工具调用效率。
Details
Motivation: 现有强化学习方法在长周期多轮任务中使用粗粒度的信用分配,难以区分有效与冗余或错误的工具调用。 Method: 将信用分配建模为预测轨迹与真实轨迹之间的二分图匹配问题,采用两种分配策略生成密集的回合级奖励,并结合回合级与轨迹级信号进行双层优势估计。 Result: 在三个基准上实验表明,MatchTIR显著优于现有方法,4B模型超过多数8B模型的表现,尤其在长周期多轮任务中表现突出。 Conclusion: MatchTIR通过细粒度奖励分配和双层优势估计,有效提升了LLM在复杂工具交互任务中的性能。 Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.cs.CV [Back]
[79] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification
Shahrzad Sayyafzadeh,Hongmei Chi,Shonda Bernadin
Main category: cs.CV
TL;DR: 提出了一种端到端的管道,用于生成、优化和评估对抗性贴片以攻击面部生物识别系统,结合FGSM和扩散模型提升隐蔽性,并利用ViT-GPT2进行语义描述,支持法证分析。
Details
Motivation: 为了测试和分析面部生物识别系统的安全性,尤其是对抗性贴片在现实场景中的隐蔽性和有效性,支持法证调查与安全评估。 Method: 使用FGSM生成针对身份分类器的对抗噪声,采用扩散模型的逆向扩散过程进行高斯平滑和自适应亮度校正以提高不可见性;将优化后的贴片应用于人脸图像,并使用ViT-GPT2模型生成对抗图像的语义描述;通过感知哈希和分割技术检测和分析对抗样本。 Result: 该方法成功欺骗了身份识别系统,同时保持较高的视觉自然性(SSIM达0.95),并能有效检测对抗贴片;在身份验证和表情识别任务中揭示了系统的脆弱性。 Conclusion: 所提出的管道能够高效生成难以察觉的对抗贴片,有效攻破面部生物识别系统,同时提供可解释的语义输出,适用于安全测试与法证分析场景。 Abstract: This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person's identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.[80] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving
Carlo Sgaravatti,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi
Main category: cs.CV
TL;DR: 提出了一种名为LCF3D的新型传感器融合框架,通过结合RGB图像上的2D检测器和LiDAR点云上的3D检测器来提升3D物体检测性能,采用late fusion和cascade fusion策略减少误检并恢复漏检,在KITTI和nuScenes数据集上对行人、骑行者等类别表现出显著改进,并展现出良好的域泛化能力。
Details
Motivation: 准确检测3D物体对自动驾驶至关重要,但如何有效融合RGB相机和LiDAR传感器数据仍具挑战,尤其是在不同传感器配置下保持良好泛化性能。 Method: 提出LCF3D框架,采用late fusion过滤未匹配的LiDAR误检,通过cascade fusion利用未匹配的RGB检测生成新的3D锥体候选以恢复漏检物体。 Result: 在KITTI和nuScenes数据集上,LCF3D在行人、骑行者、摩托车和自行车等类别上显著优于纯LiDAR方法,并表现出良好的域适应能力。 Conclusion: LCF3D通过有效的多模态融合策略提升了3D目标检测的准确性和鲁棒性,尤其在复杂场景和跨域设置下表现优越。 Abstract: Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: https://github.com/CarloSgaravatti/LCF3D.[81] Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images
Adil O. Khadidos,Aziida Nanyonga,Alaa O. Khadidos,Olfat M. Mirza,Mustafa Tahsin Yilmaz
Main category: cs.CV
TL;DR: 本研究比较了DenseNet121和EfficientNet-B0两种卷积神经网络在儿童肺炎检测中的性能,使用5863张胸部X光图像进行训练与评估,结果表明EfficientNet-B0在准确率、F1分数和MCC上表现更优,且结合Grad-CAM和LIME解释性方法提升了模型决策的可解释性与临床可信度。
Details
Motivation: 肺炎是全球儿童发病和死亡的主要原因,亟需高效、准确的诊断辅助工具。深度学习在医学影像分析中展现出潜力,但不同模型的性能差异仍需系统评估,尤其在儿科应用中对准确性与可解释性的要求更高。 Method: 采用公开的5863张儿童胸部X光图像数据集,经过归一化、调整大小和数据增强等预处理后,基于ImageNet预训练权重对DenseNet121和EfficientNet-B0进行微调,并在相同训练条件下比较其性能。评估指标包括准确率、F1分数、MCC和召回率,并使用Grad-CAM和LIME实现模型预测的可视化解释。 Result: EfficientNet-B0表现优于DenseNet121,准确率达84.6%,F1分数为0.8899,MCC为0.6849;DenseNet121分别为79.7%、0.8597和0.5852。两个模型的召回率均超过0.99,显示出高敏感性。Grad-CAM和LIME可视化结果显示模型关注于肺部关键区域,验证了预测的临床合理性。 Conclusion: EfficientNet-B0在性能和计算效率之间取得了更好平衡,适合用于临床环境下的儿童肺炎自动检测。结合可解释性技术增强了AI辅助诊断系统的透明度与医生信任度,具有良好的实际应用前景。 Abstract: Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.[82] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration
Subhajit Sanyal,Srinivas Soumitri Miriyala,Akshay Janardan Bankar,Sravanth Kodavanti,Harshit,Abhishek Ameta,Shreyas Pandith,Amit Satish Unde
Main category: cs.CV
TL;DR: 本文提出了NanoSD,一个从Stable Diffusion 1.5蒸馏而来的轻量级扩散基础模型家族,通过网络手术、特征级生成蒸馏和结构化缩放联合优化U-Net与VAE,实现在边缘设备上的实时高性能图像恢复与生成。
Details
Motivation: 现有的轻量扩散模型在压缩过程中破坏了潜在流形,限制了泛化能力,且难以在边缘设备上高效部署。作者旨在构建一个保持生成先验、兼顾精度、延迟和模型大小的全流水线轻量扩散模型。 Method: 提出NanoSD,结合网络手术、特征级生成蒸馏和对U-Net与VAE的结构化架构缩放,进行全流水线协同设计,在压缩模型的同时保留原始生成先验。 Result: NanoSD模型参数量在130M到315M之间,可在移动NPU上实现低至20ms的实时推理,并在超分辨率、去模糊、人脸恢复和单目深度估计等任务上超越现有轻量模型,兼具高感知质量和部署效率。 Conclusion: NanoSD实现了精度、延迟与模型大小之间的帕累托最优,是首个适用于边缘设备的通用型实时扩散基础模型家族,验证了全流水线协同设计对实际部署性能的关键作用。 Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.[83] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval
Xiaoxu Ma,Runhao Li,Hanwen Liu,Xiangbo Zhang,Zhenyu Weng
Main category: cs.CV
TL;DR: 本文提出了Unified Hashing (UniHash),一种双分支框架,结合了点对点和成对学习范式的优势,以在已见和未见类别图像检索中实现均衡性能。
Details
Motivation: 现有深度哈希方法通常局限于单一训练范式(点对点或成对),难以同时在已见和未见类别上取得良好表现,本文旨在克服这一局限。 Method: UniHash包含两个分支:基于中心的点对点分支和成对分支;通过互学习损失和Split-Merge Mixture of Hash Experts (SM-MoH) 模块实现双向知识迁移,提升哈希码的判别性和泛化能力。 Result: 在CIFAR-10、MSCOCO和ImageNet上的实验表明,UniHash在已见和未见类别的图像检索任务中均达到最先进的性能。 Conclusion: UniHash有效统一了两种学习范式,实现了对已知和新类别的良好检索平衡,具有较强的通用性和应用潜力。 Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.[84] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning
Po-han Li,Shenghui Chen,Ufuk Topcu,Sandeep Chinchali
Main category: cs.CV
TL;DR: 提出了一种基于信息论的视频摘要信息损失(ViSIL)评分框架,用于量化多模态摘要中的信息丢失,并实现跨模态格式的统一评估。
Details
Motivation: 传统指标如BLEU或ROUGE无法衡量跨模态的信息覆盖情况,难以评估文本与关键帧序列之间的信息一致性。 Method: 利用视觉-语言模型(VLM)推理,构建信息论框架ViSIL,通过测量视频信息在摘要中的丢失程度来量化摘要质量。 Result: ViSIL得分与人类及VLM在视频问答(VQA)任务上的表现呈显著相关性,并能用于优化摘要选择,在不增加处理负担的情况下比纯文本摘要提升7%的VQA准确率。 Conclusion: ViSIL是一种统一、有效的多模态摘要评估指标,可支持不同结构摘要的比较,并助力高效、高保真的视频内容检索与生成。 Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.[85] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP
Anant Mehta,Xiyuan Wei,Xingyu Chen,Tianbao Yang
Main category: cs.CV
TL;DR: 本文提出了TuneCLIP,一种用于提升开源CLIP模型在多种下游任务中性能的自监督微调框架,通过优化统计恢复和改进对比损失,在不重新训练的情况下显著提升了模型表现。
Details
Motivation: 现有的CLIP模型在微调时容易导致性能下降,且通常需要大量数据从头训练,成本高昂。本文旨在利用现有的自监督数据集,在不依赖大规模标注数据的前提下,提升开源CLIP模型的通用性能。 Method: 提出TuneCLIP框架,包含两个关键部分:一是热身阶段,通过恢复优化统计量来减少冷启动偏差;二是微调阶段,采用新的对比损失函数以减轻对假负样本对的惩罚。 Result: 实验表明,TuneCLIP在不同模型结构和规模上均能稳定提升性能。例如,在SigLIP(ViT-B/16)等领先开源模型上,ImageNet及其分布外基准测试准确率最高提升2.5%,DataComp基准测试提升1.2%。 Conclusion: TuneCLIP为高效地在预训练后进行适应性优化提供了新基线,证明了仅使用现有自监督数据即可有效增强多模态模型的泛化能力。 Abstract: CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.[86] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching
Kiarie Ndegwa,Andreas Gros,Tony Chang,David Diaz,Vincent A. Landau,Nathan E. Rutenbeck,Luke J. Zachmann,Guy Bayes,Scott Conway
Main category: cs.CV
TL;DR: VibrantSR是一种基于Sentinel-2影像的生成式超分辨率框架,用于从10米分辨率影像生成0.5米分辨率的树冠高度模型,在西部美国22个生态区表现优于现有卫星基准方法。
Details
Motivation: 现有基于航空影像的树冠高度建模受限于获取频率低且不规律,难以支持连续监测,而VibrantSR旨在利用全球可获取的Sentinel-2季节性合成影像实现高频、大范围森林监测。 Method: 提出VibrantSR框架,采用生成式超分辨率技术,将10米分辨率的Sentinel-2影像转化为0.5米分辨率的树冠高度模型,并在22个生态区使用空间分离验证进行评估。 Result: VibrantSR在树高≥2米时平均绝对误差为4.39米,优于Meta(4.83米)、LANDFIRE(5.96米)和ETH(7.05米)等卫星基准方法,但精度仍低于基于航空影像的VibrantVS(2.71米MAE)。 Conclusion: VibrantSR能够在不依赖昂贵且时相稀疏的航空数据的前提下,支持大陆尺度的森林动态监测与碳核算,具备业务化应用潜力。 Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.[87] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
Yang Xing,Jiong Wu,Savas Ozdemir,Ying Zhang,Yang Yang,Wei Shao,Kuang Gong
Main category: cs.CV
TL;DR: MedVL-SAM2 是一个统一的3D医学多模态模型,能够同时支持报告生成、视觉问答和多种分割任务,通过结合图像级推理与像素级感知,在3D医学图像中实现精确的多粒度空间推理。
Details
Motivation: 现有的医学视觉语言模型在细粒度视觉定位和3D空间推理方面存在不足,且难以在一个框架内统一多种功能。因此需要一个能同时处理多种任务并具备精确空间推理能力的通用3D医学VLM。 Method: 提出MedVL-SAM2,采用融合3D视觉特征与放射学文本嵌入的架构,集成基于SAM2的体积分割模块;通过多阶段训练:先在大规模3D CT图文对上预训练,再在包含语言理解和分割目标的综合数据集上联合优化。 Result: 模型在报告生成、VQA和多种3D分割任务上均达到SOTA性能,支持通过文本、点或框提示进行灵活交互,实现了可靠的3D视觉定位、可控的交互式分割和鲁棒的跨模态推理。 Conclusion: MedVL-SAM2成功地将高级语义推理与精确的3D定位能力统一于一个框架中,证明了在3D医学视觉语言模型中可同时实现多功能与高精度空间理解。 Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.[88] Transition Matching Distillation for Fast Video Generation
Weili Nie,Julius Berner,Nanye Ma,Chao Liu,Saining Xie,Arash Vahdat
Main category: cs.CV
TL;DR: 本文提出了Transition Matching Distillation (TMD),一种将视频扩散模型蒸馏为高效少步生成器的新框架,通过匹配多步去噪轨迹与少步概率转移过程,在保持高质量的同时显著提升生成速度。
Details
Motivation: 现有的大视频扩散模型和流模型虽能生成高质量视频,但因多步采样效率低,难以用于实时交互应用,因此需要一种高效的蒸馏方法来平衡生成速度与质量。 Method: 提出TMD框架,将扩散模型的多步去噪路径与少步概率转移过程对齐,每个转移步骤由轻量级条件流建模;并将原扩散模型分解为主干网络(提取语义表示)和流头(执行内部流更新),通过分布匹配蒸馏实现知识迁移。 Result: 在Wan2.1 1.3B和14B文本到视频模型上的实验表明,TMD在相同推理成本下优于现有蒸馏模型,兼顾生成速度、视觉保真度和提示一致性。 Conclusion: TMD为视频扩散模型的高效化提供了有效途径,实现了生成质量与速度之间的良好权衡,推动其在实时应用中的部署潜力。 Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd[89] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport
Zhihua Zhao,Guoqiang Li,Chen Min,Kangping Lu
Main category: cs.CV
TL;DR: 本文提出了一种基于最优传输的多模态融合框架OT-Drive,用于提升自动驾驶在分布外(OOD)场景下的可行驶区域分割性能。
Details
Motivation: 现有数据驱动方法在OOD场景下分割性能下降,影响自动驾驶的规划与决策。 Method: 设计了场景锚点生成器(SAG)和基于最优传输的多模态融合模块(OT Fusion),将RGB和表面法线特征映射到由语义锚点定义的流形上。 Result: 在ORFD的OOD场景上达到95.16% mIoU,超越先前方法6.35%;在跨数据集任务上达到89.79% mIoU,超越基线13.99%。 Conclusion: OT-Drive在少量训练数据下仍具有强OOD泛化能力,提升了实际部署的实用性与效率。 Abstract: Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport--driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.[90] The Spatial Blindspot of Vision-Language Models
Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna
Main category: cs.CV
TL;DR: 本文探讨了视觉语言模型(VLM)在空间关系理解上的不足,提出通过替代训练目标和二维位置编码来增强空间感知能力。
Details
Motivation: 当前的VLM通常使用将图像展平为一维序列的CLIP式训练方法,丢失了对空间推理至关重要的二维结构,限制了其在机器人和具身AI等需要空间定位的应用中的表现。 Method: 研究了两种改进策略:(i) 使用不同训练目标的图像编码器;(ii) 引入保留2D结构的位置编码。 Result: 实验表明,这些架构上的改进能在多个空间推理基准上带来性能提升。 Conclusion: 恢复并利用图像的二维结构信息是提升VLM空间理解能力的关键,应成为未来VLM设计的重要方向。 Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.[91] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models
Yulin He,Wei Chen,Zhikang Jian,Tianhang Guo,Wenjuan Zhou,Minglong Li
Main category: cs.CV
TL;DR: 提出DR$^2$Seg,一种无需额外监督的自奖励框架,通过两阶段 rollout 策略提升推理分割中的效率与准确性。
Details
Motivation: 现有方法在复杂文本查询下易产生冗余推理链,干扰多模态大模型中的对象定位。 Method: 采用两阶段 rollout 策略:第一阶段生成明确描述目标对象的自包含描述;第二阶段用该描述替代原查询以验证其自包含性,并引入两个自奖励机制来增强目标导向推理并抑制冗余思考。 Result: 在不同规模的多模态大语言模型和分割模型上实验表明,DR$^2$Seg 持续提升了推理效率和分割性能。 Conclusion: DR$^2$Seg 有效缓解了过推理问题,在无需额外监督的情况下实现了更高效准确的推理分割。 Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation performance.[92] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis
Chengjia Liang,Zhenjiong Wang,Chao Chen,Ruizhi Zhang,Songxi Liang,Hai Xie,Haijun Lei,Zhongwei Huang
Main category: cs.CV
TL;DR: 提出一种动态加权双图注意力网络(DW-DGAT)用于帕金森和阿尔茨海默病的早期诊断,融合多模态数据、双图结构提取特征并缓解类别不平衡问题,在PPMI和ADNI数据上表现优异。
Details
Motivation: 帕金森和阿尔茨海默病早期诊断面临高维多模态数据融合、异构性及类别不平衡等挑战,需更有效的模型提升诊断准确性。 Method: 提出DW-DGAT模型,包含通用数据融合策略、基于脑区和样本间关系的双图注意力架构,以及结合类权重生成机制与稳定损失函数以应对类别不平衡。 Result: 在PPMI和ADNI数据集上实验表明,该方法在早期神经退行性疾病诊断中达到最先进性能。 Conclusion: DW-DGAT有效整合多模态神经影像与表型数据,提升了PD和AD的早期诊断准确率,具有临床应用潜力。 Abstract: Parkinson's disease (PD) and Alzheimer's disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer's Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.[93] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models
Zefan Zhang,Kehua Zhu,Shijie Jiang,Hongyuan Lu,Shengkai Sun,Tian Bai
Main category: cs.CV
TL;DR: 本文提出了一种新的视频事件关系幻觉评测基准VERHallu,聚焦于因果、时序和子事件关系,并设计了关系分类、问答和反事实问答任务。研究发现现有VideoLLM在密集事件推理上表现不佳,常依赖先验知识而忽略帧级线索。为此提出关键帧传播(KFP)策略,通过重分配中间层的帧级注意力来增强多事件理解,有效缓解幻觉问题且不降低推理速度。
Details
Motivation: 现有研究忽视了视频中事件间关系的幻觉问题,尤其是因果、时序和子事件关系,缺乏系统性评估基准和针对性解决方案。 Method: 构建包含因果、时序和子事件关系的评测基准VERHallu,涵盖三种任务类型,并引入反直觉场景以检测模型偏见;提出关键帧传播(KFP)策略,在中间层重新分配帧级注意力以增强对多事件关系的理解。 Result: 实验表明当前主流VideoLLM在密集事件关系推理上表现差,易受语言先验影响;KFP策略能有效减轻事件关系幻觉,提升模型对复杂事件结构的理解能力,同时保持推理效率。 Conclusion: 事件关系幻觉是VideoLLM的重要挑战,VERHallu为评估该问题提供了新基准,而KFP策略通过改进注意力机制提升了多事件关系建模能力,有助于实现更准确的视频理解。 Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.[94] Disentangled Concept Representation for Text-to-image Person Re-identification
Giyeol Kim,Chanho Eom
Main category: cs.CV
TL;DR: 本文提出了一种名为DiCo的解耦概念表示框架,用于文本到图像行人重识别(TIReID),通过层次化和解耦的跨模态对齐,在缩小模态差距的同时实现细粒度匹配,提升了性能与可解释性。
Details
Motivation: 由于视觉外观与文本描述之间存在显著模态差异,且需建模区分个体的细粒度对应关系(如颜色、纹理、款式),现有方法难以有效进行跨模态行人检索。 Method: 提出DiCo框架,引入基于共享槽位的表示结构,每个槽位作为跨模态的部分级锚点,并进一步分解为多个概念块,以解耦颜色、纹理、形状等属性,保持图像与文本间的部分级一致性。 Result: 在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上实验表明,DiCo性能与当前最优方法相当,同时通过显式的槽位和块级表示增强了模型可解释性。 Conclusion: DiCo通过层次化解耦的跨模态表示,有效缩小了文本与图像间的语义鸿沟,实现了更精细的行人检索,兼顾性能与可解释性。 Abstract: Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.[95] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow
Nick Truong,Pritam P. Karmokar,William J. Beksi
Main category: cs.CV
TL;DR: 本文提出了首个用于基于事件的光流估计的合成水下基准数据集,该数据集基于物理光线追踪生成,并包含密集的真实光流、深度和相机运动信息,旨在推动水下事件感知算法的发展。
Details
Motivation: 由于水下光学环境复杂且缺乏配对的真实光流数据,事件相机在水下应用中的研究受到限制,因此需要一个结合真实水下光学特性和精确光流标注的数据集来推动该领域发展。 Method: 通过基于物理的光线追踪生成RGBD序列,利用现代视频到事件的转换流程生成逼真的事件数据流,并提供密集的地面实况光流、深度和相机运动信息,进而构建水下事件光流基准数据集。 Result: 成功构建了首个合成水下事件光流基准数据集,并对当前最先进的学习和模型驱动的光流预测方法进行了基准测试,揭示了水下光照传输对事件形成和运动估计精度的影响。 Conclusion: 该数据集为未来水下事件感知算法的研发和评估提供了新基准,有助于推动事件相机在水下环境中的应用。 Abstract: Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at https://robotic-vision-lab.github.io/ueof.[96] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation
Chengzhuo Tong,Mingkun Chang,Shenglong Zhang,Yuran Wang,Cheng Liang,Zhizheng Zhao,Ruichuan An,Bohan Zeng,Yang Shi,Yifan Dai,Ziming Zhao,Guanbin Li,Pengfei Wan,Yuanxing Zhang,Wentao Zhang
Main category: cs.CV
TL;DR: 提出CoF-T2I模型,将视频生成中的Chain-of-Frame(CoF)推理引入文本到图像(T2I)生成,通过渐进式视觉优化提升生成质量。
Details
Motivation: 探索视频生成模型中CoF推理在T2I生成中的潜力,解决缺乏明确推理起点和可解释中间状态的问题。 Method: 提出CoF-T2I模型,利用CoF-Evol-Instruct数据集建模从语义到美学的生成过程,并采用独立帧编码以减少运动伪影。 Result: 在GenEval和Imagine-Bench上分别达到0.86和7.468的性能,显著优于基线模型。 Conclusion: CoF-T2I验证了将CoF推理融入T2I生成的有效性,展示了视频模型在高质量图像生成中的巨大潜力。 Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.[97] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology
Hyun Do Jung,Jungwon Choi,Hwiyoung Kim
Main category: cs.CV
TL;DR: ReaMIL是一种用于全切片病理图像的多实例学习方法,通过引入轻量选择头和预算充分性目标,在保持性能的同时实现高效、紧凑的证据选择。
Details
Motivation: 现有MIL方法在全切片病理分析中缺乏对证据选择效率和可解释性的精细控制,需要一种无需额外监督即可识别最小充分证据集的方法。 Method: 提出ReaMIL,添加一个轻量级选择头生成软的每块门控,并采用基于铰链损失的预算充分性目标函数,在稀疏性约束下确保仅使用保留证据时真实类概率不低于阈值τ。 Result: 在TCGA-NSCLC、TCGA-BRCA和PANDA数据集上,ReaMIL达到或略优于基线AUC;在NSCLC中AUC达0.983,平均最小充分K(MSK)约8.2个块,AUKC约0.864,证据集小且空间紧凑。 Conclusion: ReaMIL在不牺牲性能的前提下实现了高效、可解释的证据选择,自然生成滑动级别覆盖图,为WSI分析提供了更严格的模型行为评估标准。 Abstract: We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq τ$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $τ= 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.[98] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting
Zhendong Wang,Lebin Zhou,Jingchuan Xiao,Rongduo Han,Nam Ling,Cihan Ruan
Main category: cs.CV
TL;DR: 本文提出了一种基于流动引导的几何对流框架,用于在3D高斯点阵中实现后印象派风格的艺术化表达,强调几何抽象而非表面纹理,并通过2D绘画中的流向场驱动3D结构变形。
Details
Motivation: 现有3D风格迁移方法多将几何视为刚性基础,仅进行表面纹理投影,难以真实再现后印象派强调结构性夸张与细节简化的核心美学。因此需要一种以几何抽象为主导的新型方法。 Method: 提出一种无需网格的3D高斯点阵框架,从2D绘画中提取方向性流动场并反向传播至3D空间,引导高斯基元形成符合场景拓扑的流向对齐笔触;采用亮度与结构解耦策略,分离几何形变与颜色优化过程。 Result: 实现了由绘画动势直接驱动的表达性3D结构抽象,避免了传统方法在剧烈几何变形时产生的伪影;并通过VLM-as-a-Judge框架验证了生成结果在艺术真实性上的提升。 Conclusion: 该方法成功将后印象派‘在本质中寻求夸张’的艺术原则引入3D风格化,确立了几何抽象作为主要表现手段的可行性,为艺术驱动的3D内容创作提供了新路径。 Abstract: In 1888, Vincent van Gogh wrote, "I am seeking exaggeration in the essential." This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.[99] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks
Mingzhuo Li,Guang Li,Linfeng Ye,Jiafeng Mao,Takahiro Ogawa,Konstantinos N. Plataniotis,Miki Haseyama
Main category: cs.CV
TL;DR: 本文提出了一种名为难度引导采样(DGS)的方法,以弥合数据蒸馏目标与下游任务之间的差距,提升图像分类等任务的蒸馏数据集性能。
Details
Motivation: 现有数据蒸馏方法多关注原始数据集的特征,忽视了下游任务特定信息,导致蒸馏目标与实际任务之间存在目标差距。 Method: 引入“难度”概念,提出DGS作为即插即用的后处理采样模块,并结合难度感知引导(DAG)在生成过程中融入难度信息,依据目标难度分布从已有方法生成的图像池中采样最终蒸馏数据集。 Result: 在多种实验设置下验证了DGS和DAG的有效性,显著提升了下游任务性能,并展示了“难度”概念在其他任务中的广泛应用潜力。 Conclusion: 通过将下游任务相关的难度信息引入数据蒸馏过程,能够有效缩小目标差距,提高蒸馏数据集的质量和实用性。 Abstract: In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.[100] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation
Han Wang,Yi Yang,Jingyuan Hu,Minfeng Zhu,Wei Chen
Main category: cs.CV
TL;DR: V-Zero是一种无需人类标注的视觉-语言模型自提升框架,通过问答双角色协同进化和无监督学习,在Qwen2.5-VL-7B-Instruct上显著提升了视觉数学和通用视觉推理能力。
Details
Motivation: 现有视觉-语言模型依赖大规模人工标注数据,成本高且耗时,限制了其广泛应用。V-Zero旨在利用未标注图像实现模型的自我改进,降低对人工标注的依赖。 Method: 提出V-Zero框架,构建Questioner与Solver两个角色形成共进化循环:Questioner通过直觉与推理结果的对比奖励生成高质量问题,Solver通过自身响应的多数投票生成伪标签进行优化,二者通过组相对策略优化(GRPO)迭代训练。 Result: 在无任何人工标注的情况下,V-Zero使Qwen2.5-VL-7B-Instruct在视觉数学推理上提升+1.7,通用视觉任务上提升+2.6,验证了自提升方法的有效性。 Conclusion: V-Zero证明了仅使用未标注图像即可实现视觉-语言模型的持续自我优化,为多模态系统的发展提供了低成本、可扩展的新路径。 Abstract: Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero[101] InfoSculpt: Sculpting the Latent Space for Generalized Category Discovery
Wenwen Liao,Hang Ruan,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文提出了InfoSculpt,一种基于信息瓶颈原理的广义类别发现框架,通过双条件互信息目标在类别级和实例级上学习解耦且鲁棒的表示。
Details
Motivation: 现有GCD方法依赖伪标签或两阶段聚类,缺乏从噪声中分离类别本质特征的机制。 Method: 提出InfoSculpt框架,结合类别级与实例级条件互信息最小化,分别在已知类和全数据上压缩噪声并保留类别信息。 Result: 在8个基准数据集上实验表明,该方法在已知和新类别发现上均优于现有方法。 Conclusion: 基于信息论的表示学习能有效提升GCD性能,双CMI目标可协同构建更解耦、鲁棒的特征空间。 Abstract: Generalized Category Discovery (GCD) aims to classify instances from both known and novel categories within a large-scale unlabeled dataset, a critical yet challenging task for real-world, open-world applications. However, existing methods often rely on pseudo-labeling, or two-stage clustering, which lack a principled mechanism to explicitly disentangle essential, category-defining signals from instance-specific noise. In this paper, we address this fundamental limitation by re-framing GCD from an information-theoretic perspective, grounded in the Information Bottleneck (IB) principle. We introduce InfoSculpt, a novel framework that systematically sculpts the representation space by minimizing a dual Conditional Mutual Information (CMI) objective. InfoSculpt uniquely combines a Category-Level CMI on labeled data to learn compact and discriminative representations for known classes, and a complementary Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise. These two objectives work synergistically at different scales to produce a disentangled and robust latent space where categorical information is preserved while noisy, instance-specific details are discarded. Extensive experiments on 8 benchmarks demonstrate that InfoSculpt validating the effectiveness of our information-theoretic approach.[102] FlowAct-R1: Towards Interactive Humanoid Video Generation
Lizhen Wang,Yongming Zhu,Zhipeng Ge,Youwei Zheng,Longhao Zhang,Tianshu Hu,Shiyang Qin,Mingshuang Luo,Jiaxu Zhang,Xin Chen,Yulong Wang,Zerong Zheng,Jianwen Jiang,Chao Liang,Weifeng Chen,Xing Wang,Yuan Zhang,Mingyuan Gao
Main category: cs.CV
TL;DR: 本文提出了一种名为FlowAct-R1的实时交互式人形视频生成框架,基于MMDiT架构,通过分块扩散强制策略和自强制变体确保长时间的时序一致性,并实现低延迟、高帧率的视频流生成。
Details
Motivation: 现有视频生成方法在高保真度与实时交互之间存在权衡,难以满足连续交互中对响应速度和视觉质量的双重需求。 Method: 采用MMDiT架构,引入分块扩散强制策略及其自强制变体,结合高效蒸馏与系统级优化,实现任意时长视频的流式生成,并支持细粒度全身控制。 Result: 在480p分辨率下稳定达到25fps,首帧时间约1.5秒,实验显示该方法在行为生动性、感知真实感和跨角色泛化能力上表现优异。 Conclusion: FlowAct-R1有效平衡了生成质量与实时性,为构建可交互的虚拟人代理提供了可行方案。 Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.[103] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers
Chenyue Zhou,Jiayi Tuo,Shitong Qin,Wei Dai,Mingxuan Wang,Ziwei Zhao,Duoyang Li,Shiyang Su,Yanxi Lu,Yanbiao Ma
Main category: cs.CV
TL;DR: 本文提出了MathDoc,这是首个针对真实高中数学试卷的文档级信息提取基准,包含3,609个带有现实世界噪声和不可识别样本的问题,旨在评估模型在恶劣文档条件下的可靠性,特别是对不完整输入的拒绝能力。
Details
Motivation: 现有的数据集主要关注干净文档或通用布局分析,忽视了数学问题的结构完整性以及模型主动拒绝不完整输入的能力,难以反映现实教育场景中的挑战。 Method: 构建了一个包含真实考试试卷的数据集MathDoc,并提出一个多维评估框架,涵盖题干准确性、视觉相似性和拒绝能力;在多个最先进的多模态大语言模型上进行实验。 Result: 端到端模型在提取性能上表现良好,但在面对无法识别的输入时普遍缺乏拒绝能力,往往生成自信但无效的输出。 Conclusion: 当前的多模态大语言模型在处理低质量文档时存在关键缺陷,MathDoc为评估模型在退化文档条件下的可靠性提供了新的基准。 Abstract: The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}[104] Enhancing Visual In-Context Learning by Multi-Faceted Fusion
Wenwen Liao,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang
Main category: cs.CV
TL;DR: 本文提出了一种新的视觉上下文学习框架,通过多组合协同融合策略,利用多个高质量提示生成三个互补的上下文表示分支,并结合提出的MULTI-VQGAN架构实现更鲁棒和准确的预测。
Details
Motivation: 现有的“检索-提示”方法通常只使用单一最佳视觉提示或简单融合前K个提示,忽略了其他有价值的信息,限制了模型的推理能力。本文旨在通过更丰富的多视角协同融合机制来充分利用多样化的上下文信息。 Method: 提出一种新型框架,生成三个由不同高质量提示组合集成而来的上下文表示分支,而非将多个提示压缩为单一表示;这些分支作为互补指导信号输入到新设计的MULTI-VQGAN架构中,以联合解释和利用来自多源的协作信息。 Result: 在前景分割、单目标检测和图像着色等多个任务上的实验表明,该方法具有强大的跨任务泛化能力、有效的上下文融合性能,且预测结果比现有方法更鲁棒和准确。 Conclusion: 通过多组合协同融合与MULTI-VQGAN架构,本文成功提升了视觉上下文学习中对丰富上下文信息的利用效率,显著增强了模型的推理能力和预测性能。 Abstract: Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.[105] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL
Wenwen Liao,Jianbo Yu,Yuansong Wang,Shifu Yan,Xiaofeng Yang
Main category: cs.CV
TL;DR: 提出了一种端到端的视觉上下文学习(VICL)框架,通过融合多个提示和利用排列信息来提升图像修复模型在少量提示下的适应能力。
Details
Motivation: 现有VICL方法存在两个问题:仅选择最相似的提示会丢弃其他高质量提示中的互补信息;未能利用不同提示排列所隐含的结构信息。 Method: 1. 设计自适应融合模块,聚合多个提示中的关键模式和标注,生成更精确的上下文提示;2. 引入与排列相关的轻量MLP,将布局先验从主模型中解耦;3. 采用双向微调机制,交换查询与提示角色,增强融合模块与修复模型的协作。 Result: 在前景分割、单目标检测和图像着色任务上实验表明,该方法性能优越,并具有强跨任务泛化能力。 Conclusion: 所提框架有效解决了现有VICL方法的关键缺陷,提升了多提示信息利用效率和模型对结构信息的建模能力。 Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.[106] VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation
Sicheng Yang,Zhaohu Xing,Lei Zhu
Main category: cs.CV
TL;DR: 本文提出VQ-Seg,首个利用向量量化(VQ)进行特征空间离散化并引入可控的量化扰动模块(QPM)以替代dropout的半监督医学图像分割方法,通过双分支结构和后量化特征适配器(PFA)缓解信息损失并融合基础模型语义信息,在肺癌数据集和公共基准上均取得优越性能。
Details
Motivation: 现有基于dropout的特征扰动方法依赖手动调参dropout率,该超参数敏感且难以优化,可能导致正则化效果不佳。 Method: 提出VQ-Seg,采用向量量化离散化特征空间,设计量化扰动模块(QPM)通过打乱码本索引的空间位置实现可控扰动;采用双分支架构联合图像重建与分割任务以保留信息,并引入后量化特征适配器(PFA)融合基础模型的高层语义指导。 Result: 在自建的大规模肺癌CT数据集(828例)及多个公开基准上实验表明,所提方法优于当前最先进的半监督分割方法。 Conclusion: VQ-Seg通过引入基于向量量化的可控扰动机制,有效替代了传统的dropout策略,在减少超参数依赖的同时提升了半监督医学图像分割的性能。 Abstract: Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult to optimize and may lead to suboptimal regularization. To overcome this limitation, we propose VQ-Seg, the first approach to employ vector quantization (VQ) to discretize the feature space and introduce a novel and controllable Quantized Perturbation Module (QPM) that replaces dropout. Our QPM perturbs discrete representations by shuffling the spatial locations of codebook indices, enabling effective and controllable regularization. To mitigate potential information loss caused by quantization, we design a dual-branch architecture where the post-quantization feature space is shared by both image reconstruction and segmentation tasks. Moreover, we introduce a Post-VQ Feature Adapter (PFA) to incorporate guidance from a foundation model (FM), supplementing the high-level semantic information lost during quantization. Furthermore, we collect a large-scale Lung Cancer (LC) dataset comprising 828 CT scans annotated for central-type lung carcinoma. Extensive experiments on the LC dataset and other public benchmarks demonstrate the effectiveness of our method, which outperforms state-of-the-art approaches. Code available at: https://github.com/script-Yang/VQ-Seg.[107] LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Linquan Wu,Tianxiang Jiang,Yifei Dong,Haoyu Yang,Fengji Zhang,Shichaang Meng,Ai Xuan,Linqi Song,Jacky Keung
Main category: cs.CV
TL;DR: LaViT是一种新框架,通过对学生模型进行视觉语义和注意力轨迹的自回归重建,弥合多模态推理中的感知差距,显著提升视觉接地性能。
Details
Motivation: 现有方法在多模态推理中依赖外部监督,忽视内在视觉注意力动态,导致学生模型虽模仿教师输出文本,但关注不同视觉区域,依赖语言先验而非真实感知。 Method: 提出LaViT框架,强制学生模型在文本生成前自回归地重建教师模型的视觉语义与注意力轨迹,并引入课程感知门控机制防止捷径学习。 Result: 实验表明LaViT在复杂推理任务上最多提升+16.9%,且3B小模型超越更大开源及GPT-4o等专有模型。 Conclusion: LaViT有效对齐潜在视觉思维,提升了视觉接地的多模态推理能力,减少对语言先验的依赖。 Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.[108] Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method
Chao Huang,Benfeng Wang,Wei Wang,Jie Wen,Li Shen,Wenqi Ren,Yong Xu,Xiaochun Cao
Main category: cs.CV
TL;DR: 本文提出了视频异常推理(VAR)新任务,旨在通过多阶段结构化推理提升多模态大模型在视频异常检测与理解中的表现,提出了新的数据集和基于感知-认知-行动链的标注框架,并设计了增强弱监督下推理可靠性的方法。
Details
Motivation: 现有基于MLLM的视频异常检测方法局限于定位或事后描述,缺乏显式推理、风险意识和决策解释能力,难以满足实际应用需求。 Method: 提出视频异常推理(VAR)任务,构建包含8641个视频和5万余样本的新数据集,采用PerCoAct-CoT链式思维标注;设计Anomaly-Aware Group Relative Policy Optimization算法,并开发支持自适应分层推理的端到端MLLM模型Vad-R1-Plus。 Result: 实验表明,所提模型在VAR任务上显著优于开源及闭源基线模型,有效提升了MLLM在异常推理、因果分析和风险感知决策方面的能力。 Conclusion: VAR任务和PerCoAct-CoT框架为视频异常理解提供了更深层次的推理范式,推动了MLLM在安全敏感场景下的智能决策发展。 Abstract: Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.[109] RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation
Yue Chang,Rufeng Chen,Zhaofan Zhang,Yi Chen,Sihong Xie
Main category: cs.CV
TL;DR: 提出RAG-3DSG方法,通过重拍引导的不确定性估计和检索增强生成提升开放词汇3D场景图的生成准确性和效率。
Details
Motivation: 现有开放词汇3D场景图生成方法在物体识别精度和速度上受限于视角限制、遮挡和冗余点云密度。 Method: 引入重拍引导的不确定性估计以减少聚合噪声,采用检索增强生成(RAG)支持物体级生成,并设计动态下采样映射策略以加速跨图像物体聚合。 Result: 在Replica数据集上的实验表明,RAG-3DSG显著提高了节点描述准确性,同时将建图时间减少了三分之二。 Conclusion: RAG-3DSG有效提升了开放词汇3D场景图生成的准确性和效率,适用于机器人中的操作与导航任务。 Abstract: Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.[110] From Physical Degradation Models to Task-Aware All-in-One Image Restoration
Hu Gao,Xiaoning Lei,Xichen Xu,Xingjian Wang,Lizhuang Ma
Main category: cs.CV
TL;DR: 本文提出了一种高效的全合一图像恢复框架OPIR,通过物理退化建模预测任务感知的逆退化算子,并引入不确定性感知图引导两阶段恢复,实现了高性能与高效率的统一。
Details
Motivation: 现有全合一图像恢复方法依赖复杂模块导致系统复杂、难以实时应用,本文旨在从物理退化建模角度出发,设计轻量且高效的统一恢复框架。 Method: 提出两阶段恢复框架:第一阶段由预测的逆退化算子生成初步恢复图像和不确定性感知图;第二阶段利用该图进行精细化修复。使用同一逆算子预测网络,结合任务感知参数适应不同退化任务,并加速卷积运算以提升效率。 Result: 所提OPIR框架在多种全合一恢复任务上表现优越,同时在特定单一任务上也保持竞争力,实验验证了其高效性和可靠性。 Conclusion: 通过物理退化建模与逆算子预测,OPIR实现了简洁、高效且性能优异的全合一图像恢复,具备良好的实际应用潜力。 Abstract: All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.[111] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation
Kim Youwang,Lee Hyoseok,Subin Park,Gerard Pons-Moll,Tae-Hyun Oh
Main category: cs.CV
TL;DR: ELITE是一种高效的单目视频生成高斯头像方法,结合了3D数据先验和2D生成先验的优势,通过快速初始化和测试时生成适应机制,实现高质量、强泛化能力的可动画头像合成。
Details
Motivation: 现有方法在处理单目视频生成头像时存在泛化能力差或计算复杂度高的问题,且容易产生身份幻觉。ELITE旨在结合3D数据先验和2D生成先验的优势,提升生成质量和效率。 Method: 提出Mesh2Gaussian Prior Model(MGPM)用于快速初始化高斯头像,并设计测试时生成适应阶段,利用真实和合成图像作为监督;引入渲染引导的单步扩散增强器,基于高斯头像渲染恢复缺失细节。 Result: 实验表明,ELITE在挑战性表情下仍能生成优于先前方法的视觉效果,同时合成速度比2D生成先验方法快60倍。 Conclusion: ELITE通过结合两种先验并引入高效初始化与测试时优化策略,在保持高保真度的同时显著提升了生成效率和在野外场景的泛化能力。 Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.[112] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation
Dong-Yu Chen,Yixin Guo,Shuojin Yang,Tai-Jiang Mu,Shi-Min Hu
Main category: cs.CV
TL;DR: 本文提出了一种名为DepthDirector的视频重渲染框架,通过利用显式3D表示中的深度信息作为相机控制引导,实现了在新相机轨迹下对动态场景的精确控制和一致内容生成。
Details
Motivation: 现有方法在精确控制相机轨迹的同时难以保持视频内容的一致性,且未能充分利用视频扩散模型(VDMs)中的3D先验知识,容易陷入修复陷阱导致主体不一致和生成质量下降。 Method: 设计了视图-内容双流条件机制,将源视频与目标视角下渲染的扭曲深度序列注入预训练的视频生成模型中,并采用基于LoRA的轻量级视频扩散适配器进行训练,同时构建了一个大规模多相机同步数据集MultiCam-WarpData用于实验验证。 Result: 实验结果表明,DepthDirector在相机可控性和视觉质量方面均优于现有方法,能够更准确地控制摄像机运动并生成高质量、内容一致的视频。 Conclusion: DepthDirector有效结合了显式3D几何引导与视频扩散模型的3D理解能力,解决了精确相机控制与内容保真之间的矛盾,为条件视频生成提供了新的解决方案。 Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.[113] Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
Sicheng Yang,Yukai Huang,Shitong Sun,Weitong Cai,Jiankang Deng,Jifei Song,Zhensong Zhang
Main category: cs.CV
TL;DR: 提出一种集成查询/选项预处理、领域特定微调和时序思维链提示的框架,显著提升多模态大模型在复杂视频问答任务中的表现,HD-EPIC VQA准确率达41.6%。
Details
Motivation: MLLMs在复杂视频问答任务中面临模糊查询、长时序推理困难和输出不规范等问题,需系统性优化以提升性能。 Method: 结合查询/选项预处理、Qwen2.5-VL的领域微调、新型时序思维链(T-CoT)提示机制以及强健的后处理策略。 Result: 在HD-EPIC VQA基准上达到41.6%的准确率,优于现有方法。 Conclusion: 复杂的视频理解任务需要整体流程的协同优化,单一改进不足以应对挑战。 Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.[114] Attend to what I say: Highlighting relevant content on slides
Megha Mariam K M,C. V. Jawahar
Main category: cs.CV
TL;DR: 提出一种基于演讲内容自动识别并高亮幻灯片中关键区域的方法,以增强听觉与视觉信息的同步性,提升对复杂演示内容的理解。
Details
Motivation: 在快速或内容密集的演讲中,听众难以同时处理口头叙述和幻灯片视觉信息,导致认知负荷增加和理解困难。 Method: 通过分析演讲者的语音内容,并将其与幻灯片中的文本或图形元素进行匹配,自动定位并突出显示最相关的幻灯片区域。 Result: 该方法有效提升了视听信息的同步性,减少了认知负担,在多种多媒体文档场景中验证了其有效性,并探讨了不同解决方案的成功与失败案例。 Conclusion: 所提方法有助于改善教育视频和会议报告等多媒体内容的理解体验,推动多模态内容分析的发展。 Abstract: Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker's narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: https://github.com/meghamariamkm2002/Slide_Highlight[115] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset
Hengyu Shen,Tiancheng Gu,Bin Qin,Lan Wu,Yuling Wu,Shuo Tan,Zelong Sun,Jun Wang,Nan Wu,Xiang An,Weidong Cai,Ziyong Feng,Kaicheng Yang
Main category: cs.CV
TL;DR: 本文提出了DanQing,一个包含1亿中文图文对的高质量跨模态数据集,通过更严格的数据筛选流程和基于2024-2025年网页数据构建,显著提升中文视觉语言预训练模型在下游任务中的表现。
Details
Motivation: 由于缺乏高质量的中文图文数据,中文视觉语言预训练的发展落后于英文领域,因此需要构建一个大规模、高质量的中文跨模态数据集来推动该领域的进步。 Method: 开发了一套完整的数据构建流程,从Common Crawl中收集中文图文对,并通过严格的筛选机制确保数据质量;数据主要来自2024-2025年的网页内容,以捕捉最新的语义变化趋势。 Result: 在SigLIP2模型上进行持续预训练的实验表明,DanQing在中文零样本分类、跨模态检索和基于大模型的评估等任务中均优于现有数据集。 Conclusion: DanQing是一个高质量、时效性强的中文图文数据集,能有效推动中文视觉语言预训练模型的发展,且该数据集将开源以促进相关研究。 Abstract: Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.[116] Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
Peng-Fei Zhang,Zi Huang
Main category: cs.CV
TL;DR: 提出了一种名为Hierarchical Refinement Attack (HRA)的多模态通用攻击框架,用于提升视觉-语言预训练(VLP)模型中的对抗攻击效率和效果。
Details
Motivation: 现有的VLP模型对抗攻击大多是样本特定的,导致在大规模数据集或新场景中计算开销巨大。 Method: HRA在样本级别和优化级别上分别细化通用对抗扰动(UAPs);图像模态中分离干净图像与扰动,并引入ScMix增强策略以多样化视觉上下文;利用历史和未来梯度的时间层次结构优化路径;文本模态中结合句子内和句子间重要性度量识别全局影响词作为通用文本扰动。 Result: 在多种下游任务、VLP模型和数据集上的大量实验表明所提出的通用多模态攻击具有优越性能。 Conclusion: HRA显著提高了对抗扰动的泛化能力和攻击效率,有效克服了现有方法的局限性。 Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.[117] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding
Xueyun Tian,Wei Li,Bingbing Xu,Heng Dong,Yuanzhuo Wang,Huawei Shen
Main category: cs.CV
TL;DR: ROMA是一个实时全模态助手,能够统一处理音频、视频和文本的连续输入,通过同步多模态单元和轻量级发声头实现反应式与主动式交互,在12个基准测试中表现出色。
Details
Motivation: 现有全模态模型在流式音视频理解上存在模态支持不完整或缺乏自主主动监控的问题,亟需一个能同时支持反应式与主动式交互的统一框架。 Method: ROMA将连续输入作为同步多模态单元处理,对齐密集音频与离散视频帧以解决粒度不匹配问题;引入轻量级发声头,解耦响应触发与生成过程;使用精心构建的流式数据集和两阶段课程学习进行训练,并重新组织评测基准形成涵盖主动与反应任务的统一评估套件。 Result: 在12个基准测试上实验表明,ROMA在主动任务(如警报、叙述)上达到最先进性能,在反应任务(如问答)上表现具有竞争力。 Conclusion: ROMA实现了强大的实时全模态理解能力,支持统一的反应式与主动式交互,验证了其在复杂流式多模态场景中的有效性与鲁棒性。 Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.[118] SRAW-Attack: Space-Reweighted Adversarial Warping Attack for SAR Target Recognition
Yiming Zhang,Weibo Qin,Yuntian Liu,Feng Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Space-Reweighted Adversarial Warping (SRAW)的新型攻击方法,通过在前景和背景区域间优化空间变形与重加权扰动预算,生成更隐蔽且具有强迁移性的对抗样本,有效降低了深度神经网络在合成孔径雷达自动目标识别(SAR-ATR)中的鲁棒性。
Details
Motivation: 由于SAR图像固有的信息稀疏性和DNN模型对背景区域的过度依赖,现有SAR-ATR系统易受对抗攻击,而当前攻击方法往往需要明显可感知的扰动,缺乏隐蔽性,因此需要一种兼顾有效性与不可见性的攻击方式。 Method: 提出SRAW方法,利用空间形变生成对抗样本,并根据前景和背景区域的重要性进行扰动预算重加权,优化攻击效果,在保持低可见性的同时提升攻击成功率和跨模型迁移能力。 Result: 实验表明,SRAW显著降低了先进SAR-ATR模型的识别准确率,在不可感知性和对抗迁移性方面均优于现有攻击方法。 Conclusion: SRAW通过有区别的空间扰动策略提升了SAR域对抗攻击的效率与隐蔽性,揭示了当前SAR-ATR模型的脆弱性,为后续防御机制设计提供了参考。 Abstract: Synthetic aperture radar (SAR) imagery exhibits intrinsic information sparsity due to its unique electromagnetic scattering mechanism. Despite the widespread adoption of deep neural network (DNN)-based SAR automatic target recognition (SAR-ATR) systems, they remain vulnerable to adversarial examples and tend to over-rely on background regions, leading to degraded adversarial robustness. Existing adversarial attacks for SAR-ATR often require visually perceptible distortions to achieve effective performance, thereby necessitating an attack method that balances effectiveness and stealthiness. In this paper, a novel attack method termed Space-Reweighted Adversarial Warping (SRAW) is proposed, which generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions. Extensive experiments demonstrate that SRAW significantly degrades the performance of state-of-the-art SAR-ATR models and consistently outperforms existing methods in terms of imperceptibility and adversarial transferability. Code is made available at https://github.com/boremycin/SAR-ATR-TransAttack.[119] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders
Siqi Kou,Jiachun Jin,Zetong Zhou,Ye Ma,Yugang Wang,Quan Chen,Peng Jiang,Xiao Yang,Jun Zhu,Kai Yu,Zhijie Deng
Main category: cs.CV
TL;DR: 本文提出了“先思考后生成”(T2G)范式,利用大语言模型的推理能力重写文本提示,提升文本到图像扩散模型在事实一致性、语义对齐和视觉真实感方面的表现。
Details
Motivation: 现有文本到图像扩散模型多为文本-像素映射器,未充分利用大语言模型的推理能力来决定应生成的视觉内容。 Method: 通过轻量级监督微调激活大语言模型的“先思考后重写”模式,并采用Dual-GRPO协同优化语言模型与扩散模型,其中语言模型通过图像引导奖励强化推理,扩散模型则提升语义一致性和视觉连贯性。 Result: 在多个基于推理的图像生成与编辑基准上显著提升了事实一致性、语义对齐和视觉真实感,WISE得分为0.79,接近GPT-4水平。 Conclusion: T2G范式推动了具备推理、表达与具现能力的下一代统一模型的发展,是迈向更智能图像生成系统的重要一步。 Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.[120] An analytic theory of convolutional neural network inverse problems solvers
Minh Hai Nguyen,Quoc Bao Do,Edouard Pauwels,Pierre Weiss
Main category: cs.CV
TL;DR: 提出了一种局部等变最小均方误差(LE-MMSE)估计器,通过引入CNN的平移等变性和局部感受野特性,理论分析了监督卷积神经网络在图像逆问题中的表现,并验证了其与实际网络输出的高度一致性。
Details
Motivation: 现有监督CNN在图像逆问题中表现优异但缺乏理论解释,被视为黑箱,需建立可解释的理论框架来理解其工作机制。 Method: 基于最小均方误差(MMSE)估计器,引入平移等变性和局部性约束,推导出LE-MMSE的解析表达式,并在多种任务、数据集和网络结构上进行实验验证。 Result: LE-MMSE理论预测与实际训练网络输出高度一致(PSNR ≳25dB),并揭示了物理感知与非感知估计器的差异及训练分布密度等因素的影响。 Conclusion: 该理论为理解CNN在图像逆问题中的成功提供了可解释的框架,表明其行为可由受归纳偏置约束的统计最优估计解释。 Abstract: Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).[121] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs
Ningyu Sun,Zhaolin Cai,Zitong Xu,Peihang Chen,Huiyu Duan,Yichao Yan,Xiongkuo Min,Xiaokang Yang
Main category: cs.CV
TL;DR: 本文提出了HPE-Bench,一个专用于文本引导人体姿态编辑的基准,并设计了一个基于层选择性多模态大语言模型的统一评估框架,通过对比LoRA微调和新的层敏感性分析机制,在真实性和多维度质量评估上实现了优越性能。
Details
Motivation: 现有姿态编辑评估方法将真实性检测与质量评估分离,缺乏对姿态特异性不一致的细粒度分析,因此需要一种更综合、精确的评估方案。 Method: 构建包含1700个样本的HPE-Bench基准,提出基于层选择性多模态大语言模型的框架,采用对比LoRA微调和层敏感性分析(LSA)确定最优特征层用于评估。 Result: 该框架在真实性检测和多维度质量回归任务上均取得优异表现,显著提升了姿态编辑结果的评估一致性与细粒度分析能力。 Conclusion: 所提方法有效弥合了法医检测与质量评估之间的差距,为文本引导的姿态编辑提供了可靠、统一的评估标准。 Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.[122] Towards Efficient Low-rate Image Compression with Frequency-aware Diffusion Prior Refinement
Yichong Xia,Yimin Zhou,Jinpeng Wang,Bin Chen
Main category: cs.CV
TL;DR: 本文提出了一种名为DiffCR的新型扩散模型图像压缩框架,通过频率感知跳跃估计和一致性先验优化,实现了高效、高质量的低比特率图像重建。
Details
Motivation: 现有基于扩散模型的图像压缩方法存在采样速度慢和比特分配次优的问题,主要由于训练范式分散所致。 Method: 提出DiffCR框架,包含频率解耦注意力(FDA)驱动的频率感知跳跃估计(FaSE)模块,用于在不同时间步对齐压缩潜在表示与预训练扩散模型的ε预测先验,并引入轻量级一致性估计器实现两步快速解码。 Result: 在不更新主干扩散模型的情况下,相比当前最先进的扩散压缩方法,DiffCR实现了27.2%的LPIPS BD-rate降低和65.1%的PSNR BD-rate降低,并获得超过10倍的解码速度提升。 Conclusion: DiffCR通过一致性先验 refinement 和高效的解码机制,在保持高重建质量的同时显著提升了压缩效率和解码速度,为实用化低比特率图像压缩提供了有效方案。 Abstract: Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $ε$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.[123] Global Context Compression with Interleaved Vision-Text Transformation
Dian Jiao,Jiaxin Duan,Shuai Zhao,Jiabing Leng,Yiran Zhang,Feng Huang
Main category: cs.CV
TL;DR: 本文提出VIST2,一种通过将文本块渲染为草图图像并利用视觉编码实现全局上下文压缩的新型Transformer模型,有效减少预填充和推理阶段的token数量,在长文本生成任务中显著提升速度、降低内存和计算开销。
Details
Motivation: 现有视觉-语言模型在端到端OCR中的进展启发了利用视觉编码进行文本信息压缩的方法,但此前方法仅在预填充阶段部分压缩token,无法在逐token推理时节省计算与内存开销。因此需要一种能在预填充和推理两个阶段都减少token的全局上下文压缩方法。 Method: 提出VIST2模型,将输入文本分块并与对应的草图图像交错输入,模型在预上下文中完全依赖视觉token来预测下一个文本token分布;采用多阶段训练策略,包括课程调度的光学语言建模预训练和模态交错的指令微调。 Result: 在0.6B到8B规模的VIST2模型上进行实验,实现4倍压缩比,平均首token生成速度快3倍,内存使用减少77%,FLOPS减少74%,在长文本写作任务上显著优于基线模型。 Conclusion: VIST2通过全局上下文压缩实现了推理和预填充阶段的高效token减少,为Transformer架构提供了低损耗、高效率的长序列处理新范式。 Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.[124] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer
Filippo Ruffini,Camillo Maria Caruso,Claudia Tacconi,Lorenzo Nibid,Francesca Miccolis,Marta Lovino,Carlo Greco,Edy Ippolito,Michele Fiore,Alessio Cortellini,Bruno Beomonte Zobel,Giuseppe Perrone,Bruno Vincenzi,Claudio Marrocco,Alessandro Bria,Elisa Ficarra,Sara Ramella,Valerio Guarrasi,Paolo Soda
Main category: cs.CV
TL;DR: 提出了一种能够处理模态缺失的多模态生存预测框架,结合CT、全切片病理图像(WSI)和临床变量,用于不可切除的II-III期非小细胞肺癌患者的总体生存建模,采用基础模型提取特征并设计了对缺失模态鲁棒的中间融合策略,实验表明WSI与临床数据融合效果最佳(C-index 73.30)。
Details
Motivation: 传统多模态深度学习在生存预测中受限于样本量小和模态缺失问题,常需删除不完整样本或进行强制插补,影响模型泛化性和临床适用性。本文旨在构建一个能充分利用不完整多模态数据且对缺失模态鲁棒的生存预测模型。 Method: 利用基础模型(Foundation Models)对CT、WSI和临床变量进行模态特异性特征提取,并设计一种缺失感知的编码机制,实现中间层次的多模态融合;模型在训练和推理过程中无需删除患者样本,可自然处理不完整的模态组合。 Result: 中间融合策略优于单模态及早晚期融合方法,其中WSI与临床变量融合取得最优性能(C-index达73.30);模型具备自适应模态权重分配能力,对信息量较少的模态(如CT)自动降权。 Conclusion: 所提缺失感知多模态框架有效提升了NSCLC生存预测的准确性与实用性,能够在真实临床场景中稳健运行,尤其适用于存在模态缺失的小样本队列研究。 Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.[125] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy
Hassan Eshkiki,Sarah Costa,Mostafa Mohammadpour,Farinaz Tanhaei,Christopher H. George,Fabio Caraffini
Main category: cs.CV
TL;DR: 提出一种计算框架,将多时相荧光显微图像融合为高质量单幅图像,显著提升细胞计数与图像质量。
Details
Motivation: 荧光显微图像常受噪声、时间变异性和信号波动影响,限制了其在生物分析中的应用。 Method: 结合多种可解释的计算机视觉技术,整合多个时间分辨帧信息生成高质量融合图像。 Result: 在动态、异质性心脏细胞2D单层数据集上验证,相比现有方法平均细胞计数提高44%。 Conclusion: 该框架能有效保留并增强原始视频中的生物学内容,适用于需多时相图像融合的其他成像领域。 Abstract: Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.[126] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation
Clementine Grethen,Nicolas Menga,Roland Brochard,Geraldine Morin,Simone Gasparini,Jeremy Lebreton,Manuel Sanchez Gestido
Main category: cs.CV
TL;DR: 本文提出了一种名为Lunar-G2R的几何到反射率学习框架,能够直接从月球数字高程模型(DEM)预测空间变化的BRDF参数,无需多视角图像或专用硬件,显著提升了月表渲染的光度精度和视觉真实感。
Details
Motivation: 现有月球表面渲染方法依赖简化或空间均匀的BRDF模型,难以准确估计参数且无法捕捉局部反射率变化,限制了光度真实感。 Method: 提出Lunar-G2R框架,利用U-Net网络结合可微分渲染,通过最小化真实轨道图像与物理渲染图像之间的光度差异,直接从DEM预测空间变化的BRDF参数。 Result: 在Tycho坑区域的实验表明,相比现有最优方法,该方法光度误差降低38%,PSNR和SSIM更高,感知相似性更好,能捕捉细尺度反射率变化。 Conclusion: 这是首个仅从地形几何直接推断空间变化反射率模型的方法,在行星表面建模中具有重要意义。 Abstract: We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.[127] Urban Socio-Semantic Segmentation with Vision-Language Reasoning
Yu Wang,Yi Wang,Rui Dai,Yujie Wang,Kaikui Liu,Xiangxiang Chu,Yansheng Li
Main category: cs.CV
TL;DR: 本文提出了一种基于视觉-语言模型推理的城市社会语义分割方法,引入了新的数据集SocioSeg和框架SocioReasoner,通过跨模态识别与多阶段推理实现对社会定义实体的精准分割,并展现出优越性能与强零样本泛化能力。
Details
Motivation: 现有分割模型在物理属性定义的实体上表现良好,但难以处理社会语义类别(如学校、公园),因此需要能够理解社会语义信息的新型方法。 Method: 构建包含卫星图像、数字地图和层级化像素级标签的社会语义分割数据集SocioSeg;提出SocioReasoner框架,结合视觉-语言模型进行跨模态识别与多阶段推理,并利用强化学习优化不可微的推理过程。 Result: 实验表明该方法优于当前最先进的模型,在零样本设置下具有良好的泛化能力。 Conclusion: 通过视觉-语言模型的推理机制可有效实现城市环境中的社会语义分割,为遥感图像理解提供了新路径。 Abstract: As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.[128] mergetune: Continued fine-tuning of vision-language models
Wenqing Wang,Da Li,Xiatian Zhu,Josef Kittler
Main category: cs.CV
TL;DR: 提出了一种名为MERGETUNE的持续微调(CFT)方法,通过线性模式连通性(LMC)恢复视觉语言模型在微调后丢失的预训练知识,无需额外参数且提升泛化性能。
Details
Motivation: 微调视觉语言模型常导致预训练知识的灾难性遗忘,现有方法难以完全避免,因此需要一种能在微调后恢复原始知识的新范式。 Method: 提出继续微调(CFT)策略MERGETUNE,利用线性模式连通性(LMC)寻找连接零样本模型和微调模型的低损失路径,并通过二阶代理近似避免大规模数据回放。 Result: MERGETUNE在CoOp基础上将基类-新类泛化的调和平均提升5.6%,无需增加参数;在鲁棒性评估中超越集成基线,推理成本更低,与零样本模型集成后达到SOTA。 Conclusion: MERGETUNE提供了一种有效的后适应方法来恢复VLM中的预训练知识,揭示了损失流形结构在模型合并与知识保留中的潜力。 Abstract: Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.[129] SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction
Kanak Mazumder,Fabian B. Flohr
Main category: cs.CV
TL;DR: 本文提出了一种名为SatMap的在线矢量化高精地图估计方法,结合卫星地图与多视角相机观测,用于自动驾驶中的下游预测与规划模块。
Details
Motivation: 现有的基于车载摄像头的方法存在深度感知有限和遮挡导致精度下降的问题,因此需要一种更鲁棒的地图构建方法。 Method: SatMap利用来自鸟瞰视角的卫星图像中的车道级语义和纹理作为全局先验,融合多视角相机数据,直接预测矢量化高精地图。 Result: 在nuScenes数据集上,SatMap相比纯相机基线提升了34.8% mAP,相比相机-LiDAR融合基线提升了8.5% mAP,并在长距离和恶劣天气条件下表现出优越性能。 Conclusion: 引入卫星地图作为先验信息可有效缓解深度模糊和遮挡问题,显著提升在线高精地图构建的精度与鲁棒性。 Abstract: Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird's Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at https://iv.ee.hm.edu/satmap/.[130] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition
Max A. Buettner,Kanak Mazumder,Luca Koecher,Mario Finkbeiner,Sebastian Niebler,Fabian B. Flohr
Main category: cs.CV
TL;DR: 本文介绍了FUSE-Bike,首个面向骑行者视角的开放感知平台,以及BikeActions多模态数据集,用于提升弱势道路使用者(VRU)行为建模,并建立了图卷积与Transformer模型的基准性能。
Details
Motivation: 当前自动驾驶研究主要关注行人过街行为,缺乏对密集共享空间中VRU交互的研究,尤其是骑行者视角的数据缺失。 Method: 开发了配备LiDAR、相机和GNSS的FUSE-Bike平台,采集骑行者视角的近距离数据,构建包含852个样本、5类动作的BikeActions数据集,并使用图卷积和Transformer模型进行基准测试。 Result: 发布了首个骑行者动作识别的公开数据集与硬件设计,建立了该任务的首个性能基准,支持多模态感知与行为预测研究。 Conclusion: FUSE-Bike与BikeActions填补了VRU在共享交通空间中细粒度行为理解的空白,推动以VRU为中心的感知技术发展。 Abstract: Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle's perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist's viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under https://iv.ee.hm.edu/bikeactions/.[131] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery
Chong Liu,Luxuan Fu,Yang Jia,Zhen Dong,Bisheng Yang
Main category: cs.CV
TL;DR: SVII-3D是一个用于高保真基础设施数字化的统一框架,通过融合LoRA微调的开放集检测、几何引导优化和视觉-语言模型,实现基于稀疏图像的鲁棒性数字孪生构建。
Details
Motivation: 现有基于稀疏图像的数字孪生构建方法在鲁棒性、定位精度和细粒度状态理解方面存在不足,难以满足智能城市和设施全生命周期管理的需求。 Method: 提出SVII-3D框架:1)结合LoRA微调的开放集检测与空间注意力匹配网络,实现跨稀疏视角观测的稳健关联;2)引入几何引导优化机制,提升三维定位精度至分米级;3)集成基于多模态提示的视觉-语言模型代理,实现对设备运行状态的细粒度自动诊断。 Result: 实验表明,SVII-3D显著提高了资产识别准确率并降低了定位误差,能够在稀疏图像输入下实现高精度的三维资产数字化与状态感知。 Conclusion: SVII-3D为低成本、高保真的基础设施数字化提供了可扩展的解决方案,有效弥合了稀疏感知与自动化智能运维之间的鸿沟。 Abstract: The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.[132] Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning
Oscar H. Ramírez-Agudelo,Akshay N. Shewatkar,Edoardo Milana,Roland C. Aydin,Kai Franke
Main category: cs.CV
TL;DR: 该研究利用FFA-Net和AECR-Net两种深度学习模型,提升烟雾和雾霾环境下模拟仪表图像的可读性,生成包含14000多张图像的合成数据集进行训练,结果表明AECR-Net在去雾任务中表现更优,尽管去烟效果较弱,但整体显著提升了恶劣环境下的仪表识别能力。
Details
Motivation: 在烟雾和雾霾环境中,图像可见度降低,影响基础设施监控和应急响应,缺乏公开的仪表图像数据集,且现有方法主要针对去雾而非去烟,因此需要研究适用于此类场景的图像增强方法以支持自动读取仪表数据。 Method: 采用FFA-Net和AECR-Net两种深度学习架构对受雾霾和烟雾干扰的仪表图像进行增强;使用虚幻引擎生成包含超过14,000张图像的合成数据集,并按80%训练、10%验证、10%测试划分;通过SSIM和PSNR指标评估性能。 Result: 在合成雾霾数据集上,SSIM达到0.98,PSNR约为43dB,接近当前最优水平,AECR-Net表现优于FFA-Net;在烟雾数据集上效果较差,但仍取得一定成果,主要受限于烟雾的非均匀性和高密度特性。 Conclusion: 深度学习模型能显著提升烟雾和雾霾环境下模拟仪表图像的质量,增强后的图像可用于后续自动读取,在紧急情况下为救援人员提供有效支持,未来需开发更专门针对去烟任务的模型。 Abstract: Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges[133] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure
Luxuan Fu,Chong Liu,Bisheng Yang,Zhen Dong
Main category: cs.CV
TL;DR: 提出一种领域自适应框架,将大视觉语言模型(VLMs)转化为专业化的城市道路基础设施智能分析代理,结合数据高效微调与知识增强推理,在检测和属性识别上表现优异。
Details
Motivation: 通用模型难以捕捉城市道路基础设施的细粒度属性和工程规范要求,现有VLM在实际应用中易产生幻觉且合规性差,影响自动化感知可靠性。 Method: 采用两阶段方法:首先使用Grounding DINO进行开放词汇目标定位,然后通过LoRA微调Qwen-VL实现语义属性推理,并引入双模态RAG机制,在推理时动态检索行业标准和视觉示例以增强专业合规性。 Result: 在新构建的城市道路场景数据集上,检测性能达到58.9 mAP,属性识别准确率为95.5%。 Conclusion: 所提框架有效提升了VLM在专业领域下的感知精度与合规性,为智能基础设施监测提供了可靠解决方案。 Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.[134] Inference-time Physics Alignment of Video Generative Models with Latent World Models
Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano
Main category: cs.CV
TL;DR: 本文提出WMReward方法,利用潜在世界模型作为奖励来优化视频生成的推理过程,显著提升生成内容的物理合理性。
Details
Motivation: 现有视频生成模型虽然视觉效果 promising,但常违背基本物理规律,限制了其应用。作者认为这不仅是训练数据的问题,更与推理策略有关。 Method: 引入WMReward框架,在推理时利用预训练的潜在世界模型(如VJEPA-2)作为物理先验,构建奖励函数,通过搜索和引导多个去噪轨迹来提升生成视频的物理合理性。 Result: 在多种生成设置(图像条件、多帧条件、文本条件)下均显著提升物理合理性,并在ICCV 2025 PhysicsIQ挑战赛中以62.64%的成绩获得第一名,超越先前最优方法7.42%。人类偏好实验也验证了效果。 Conclusion: 使用潜在世界模型进行推理时对齐是提升视频生成物理合理性的有效途径,具有广泛适用性。 Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.[135] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery
Constantin Selzer,Fabian B. Flohr
Main category: cs.CV
TL;DR: DeepUrban是一个新的无人机数据集,专注于密集城市交通场景,用于提升轨迹预测和规划算法的性能评估。
Details
Motivation: 现有自动驾驶基准缺乏密集交通场景,限制了对复杂道路交互的理解和建模。 Method: 与产业伙伴DeepScenario合作,构建名为DeepUrban的数据集,包含从约100米高空拍摄的城市交叉口高分辨率图像中提取的3D交通物体,并提供地图和场景信息;在nuScenes基础上加入DeepUrban进行实验验证。 Result: 在轨迹预测任务中,加入DeepUrban后,ADE和FDE指标分别最高提升了44.1%和44.3%,并增强了模型的泛化能力。 Conclusion: DeepUrban有效弥补了密集城市交通数据的空白,显著提升现有方法的预测与规划性能,具有重要的基准和应用价值。 Abstract: The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban[136] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation
Serena Grazia De Benedictis,Amedeo Altavilla,Nicoletta Del Buono
Main category: cs.CV
TL;DR: 提出一种基于Jordan曲线定理和数字拓扑理论的拓扑感知图像分割评估方法,通过Betti数验证分割掩码的结构连贯性,确保图像被划分为有意义的内外区域。
Details
Motivation: 传统分割评价指标难以捕捉分割结果的结构和拓扑一致性,尤其在医学图像等应用中,小的边界误差或碎片化预测可能导致高分但语义错误的结果。 Method: 基于Jordan曲线定理定义“Jordan-可分割掩码”概念,利用数字拓扑和同调理论提取掩码中的4-曲线候选,并通过Betti数(β₀ = β₁ = 1)验证其拓扑有效性。 Result: 提供了一种数学上严谨、无监督的分割掩码结构连贯性评估标准,能够判断分割结果是否将图像域正确划分为两个8-连通补集成分。 Conclusion: 该方法结合数字Jordan理论与同调不变量,为分割评估提供了新的拓扑视角,特别适用于需保持拓扑正确性的应用场景。 Abstract: Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emph{Jordan-segmentatable mask}, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a $4$-curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with $β_0 = β_1 = 1$, or equivalently when its complement splits into exactly two $8$-connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved.[137] Adversarial Evasion Attacks on Computer Vision using SHAP Values
Frank Mollard,Marcus Becker,Florian Roehrbein
Main category: cs.CV
TL;DR: 本文提出了一种基于SHAP值的白盒攻击方法,用于计算机视觉模型中的对抗性逃避攻击,证明其在生成误分类方面比FGSM更具鲁棒性,尤其是在梯度隐藏场景中。
Details
Motivation: 为了揭示深度学习模型在面对不可察觉的对抗样本时的脆弱性,并探索现有解释性方法(如SHAP)被用于攻击的可能性。 Method: 利用SHAP值量化输入特征对模型输出的重要性,在推理阶段引导对抗样本生成,并与FGSM方法进行对比实验。 Result: SHAP攻击在降低模型输出置信度和诱导误分类方面更有效,尤其在模型采用梯度隐藏防御机制时仍保持较高攻击成功率。 Conclusion: SHAP值不仅可用于模型解释,也可被滥用为强对抗性攻击工具,提示需加强对基于解释性机制的攻击防范。 Abstract: The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.[138] Action100M: A Large-scale Video Action Dataset
Delong Chen,Tejaswi Kasarla,Yejin Bang,Mustafa Shukor,Willy Chung,Jade Yu,Allen Bolourchi,Theo Moutakanni,Pascale Fung
Main category: cs.CV
TL;DR: 本文提出了Action100M,一个从120万段网络教学视频中自动生成的大规模、开放词汇的动作标注视频数据集,包含约一亿个时间定位的动作片段,通过全自动流水线生成结构化注释,并验证了其在视频理解与世界建模中的有效性。
Details
Motivation: 为了提升机器智能在物理世界中的动作理解能力,需要大规模、开放词汇、跨领域的视频动作数据集,但现有数据集受限于标注成本和覆盖范围,难以满足需求。 Method: 提出全自动流水线:利用V-JEPA 2嵌入进行分层时间分割,生成多级帧与片段字幕并组织为‘字幕树’,并通过多轮Self-Refine机制结合推理模型(GPT-OSS-120B)聚合证据,输出结构化动作标注(包括动作、执行者、详细描述等)。 Result: 构建了Action100M数据集,包含1.2百万视频(14.6年时长),产生约1亿个时间定位动作片段;在该数据集上训练VL-JEPA模型展现出持续的数据扩展性能提升,并在多种动作识别基准上实现强零样本性能。 Conclusion: Action100M是一个高效、可扩展的大规模视频动作数据集,为视频理解与世界建模研究提供了坚实基础,验证了全自动数据生成 pipeline 的巨大潜力。 Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.[139] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
Peng Chen,Xiaobao Wei,Yi Yang,Naiming Yao,Hui Chen,Feng Tian
Main category: cs.CV
TL;DR: 本文提出了RSATalker,首个基于3D高斯点阵的逼真且具有社交感知能力的对话头像生成框架,支持多轮对话并建模复杂社会关系。
Details
Motivation: 现有对话头像生成方法在真实感、计算效率或社交关系建模上存在不足,尤其是缺乏对人际关系的显式建模。 Method: 结合语音驱动网格动画与3D高斯点阵渲染,引入可学习查询机制的社会感知模块来编码血缘/非血缘、平等/不平等社会关系,并设计三阶段训练范式。 Result: 在新构建的标注社会关系的RSATalker数据集上验证,实现了最先进的真实感和社交感知性能。 Conclusion: RSATalker首次将3DGS应用于具社交意识的双人对话头像生成,兼顾高质量渲染与社会关系建模,为VR中社交交互提供了新思路。 Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.[140] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark,Jieyu Zhang,Zixian Ma,Jae Sung Park,Mohammadreza Salehi,Rohun Tripathi,Sangho Lee,Zhongzheng Ren,Chris Dongjoo Kim,Yinuo Yang,Vincent Shao,Yue Yang,Weikai Huang,Ziqi Gao,Taira Anderson,Jianrui Zhang,Jitesh Jain,George Stoica,Winson Han,Ali Farhadi,Ranjay Krishna
Main category: cs.CV
TL;DR: Molmo2 是一个开源的视频语言模型家族,通过创新的数据集和训练方法,在开放权重和数据模型中实现了最先进的性能,尤其在视频指代、计数和跟踪等任务上显著优于现有模型。
Details
Motivation: 当前最强的视频语言模型多为专有模型,开源社区缺乏改进的基础;同时,许多下游应用需要像素级的定位能力,而现有模型(包括专有模型)在这方面能力不足。 Method: 提出了7个新的视频数据集和2个多图像数据集,全部不依赖闭源VLM生成;采用高效的打包和消息树编码方案进行训练,并引入视觉标记的双向注意力机制与新颖的标记加权策略。 Result: 在短视频理解、计数和字幕生成上优于其他开源模型,在长视频任务中具有竞争力;在视频计数准确率上超过Qwen3-VL(35.5 vs 29.6),在视频指向F1分数上超越Gemini 3 Pro(38.4 vs 20.0),在视频跟踪J&F指标上也表现更优(56.2 vs 41.1)。 Conclusion: Molmo2 为开源社区提供了可复现且高性能的视频语言建模基础,在无需依赖专有模型数据的前提下,实现了卓越的细粒度视频理解与像素级定位能力。 Abstract: Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).[141] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Chengfeng Zhao,Jiazhi Shu,Yubo Zhao,Tianyu Huang,Jiahao Lu,Zekai Gu,Chengwei Ren,Zhiyang Dou,Qing Shuai,Yuan Liu
Main category: cs.CV
TL;DR: 本文提出了CoMoVi,一个耦合2D视频与3D人体动作生成的扩散框架,通过共享表示和交叉注意力实现同步生成,并构建了大规模标注数据集CoMoVi Dataset。
Details
Motivation: 3D人体动作与2D视频生成本质上相互关联:3D提供结构先验以保证合理性,而预训练视频模型有助于提升动作生成的泛化能力,因此需要联合建模。 Method: 提出一种有效的2D人体动作表示方法,以继承预训练视频扩散模型(VDM)的强大先验;设计双分支扩散模型,通过互特征交互和3D-2D交叉注意力耦合动作与视频生成过程;构建包含文本和动作标注的大规模真实世界人类视频数据集CoMoVi Dataset。 Result: 实验表明,该方法在3D人体动作生成和2D视频生成任务上均有效,实现了高质量、一致性的同步生成。 Conclusion: CoMoVi成功实现了3D动作与2D视频的协同生成,验证了二者联合建模的优势,为未来多模态人体内容生成提供了新思路。 Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.[142] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
Darshan Singh,Arsha Nagrani,Kawshik Manikantan,Harman Singh,Dinesh Tewari,Tobias Weyand,Cordelia Schmid,Anelia Angelova,Shachi Dave
Main category: cs.CV
TL;DR: 本文介绍了CURVE,一个用于多文化、多语言视频理解的新型基准,包含18个地区的人工生成文化视频标注,强调对视觉文化背景的深入理解,并揭示现有视频大模型在跨文化推理上的不足。
Details
Motivation: 现有视频理解基准主要基于西方数据和英语内容,存在显著的文化和语言偏见,限制了模型在全球多样化场景下的评估与应用。 Method: 构建了一个名为CURVE的多文化、多语言视频推理基准,包含来自18个地区的高质量人工标注,涵盖母语编写的问题、答案和多步推理过程;并利用推理轨迹构建证据图,提出一种迭代策略来识别细粒度推理错误。 Result: 实验表明,当前最先进的视频大模型在CURVE上表现远低于人类水平,主要错误源于对文化相关视觉元素的理解不足。 Conclusion: CURVE为评估视频模型在真实世界多元文化环境中的推理能力提供了更具挑战性的标准,并推动对文化情境理解的研究。 Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural[143] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements
S M Rayeed,Mridul Khurana,Alyson East,Isadora E. Fluck,Elizabeth G. Campolongo,Samuel Stevens,Iuliia Zarubiieva,Scott C. Lowe,Michael W. Denslow,Evan D. Donoso,Jiaman Wu,Michelle Ramirez,Benjamin Baiser,Charles V. Stewart,Paula Mabee,Tanya Berger-Wolf,Anuj Karpatne,Hilmar Lapp,Robert P. Guralnick,Graham W. Taylor,Sydne Record
Main category: cs.CV
TL;DR: 本研究通过高分辨率成像技术数字化了来自美国30个站点的超过13,200只NEON地面甲虫,构建了一个多模态数据集,以解决无脊椎动物性状数据库代表性不足的问题。
Details
Motivation: 全球性状数据库对无脊椎动物(如地面甲虫)的覆盖严重不足,限制了生态分析的全面性,而地面甲虫作为生态系统健康的重要生物指示物种,亟需更广泛的数据支持。 Method: 利用高分辨率成像技术对NEON收集的地面甲虫标本进行数字化,并通过数字测量获取每只标本的鞘翅长宽等形态性状,验证其与人工测量的一致性。 Result: 实现了亚毫米级精度的数字性状提取,成功建立了支持AI分析的大规模地面甲虫形态数据集。 Conclusion: 该数据集填补了无脊椎动物在性状数据库中的空白,为基于人工智能的物种识别和性状研究提供了可靠基础,推动生物多样性监测与保护的发展。 Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.[144] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection
Amir Mallak,Erfan Aasi,Shiva Sreeram,Tsun-Hsuan Wang,Daniela Rus,Alaa Maalouf
Main category: cs.CV
TL;DR: 本文提出了一种名为Stochastic-Patch-Selection (SPS) 的方法,通过随机掩码图像块特征来提升端到端自动驾驶策略在分布外(OOD)场景下的鲁棒性和泛化能力,并实现更快的推理速度和真实世界迁移。
Details
Motivation: 现有基于基础模型提取的patch对齐特征进行训练的自动驾驶策略虽然在OOD上表现更好,但由于自注意力机制导致特征间高度冗余,容易过拟合虚假相关性,损害泛化性能。 Method: 提出SPS方法,在每帧中随机掩码一部分patch描述符但保持剩余patch的空间布局,使策略接收不同但完整的场景视图,从而学习对特定token存在与否不变的稳定特征。 Result: 实验表明,SPS在所有OOD场景下均超越现有SOTA,平均提升6.2%,最高达20.4%,且推理速度快2.4倍;8/9个消融实验系统优于先前SOTA,并验证了策略无需调优即可迁移到真实车辆。 Conclusion: SPS通过减少patch特征冗余有效提升了自动驾驶策略的鲁棒性、泛化能力和效率,并具备实际部署潜力。 Abstract: Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.[145] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion
Cheng Chen,Yuyu Guo,Pengpeng Zeng,Jingkuan Song,Peng Di,Hang Yu,Lianli Gao
Main category: cs.CV
TL;DR: 提出Cross-Layer Injection (CLI)框架,通过动态多对多连接增强视觉-语言模型的跨模态理解能力。
Details
Motivation: 现有视觉-语言模型因静态、不对称的连接方式导致视觉特征瓶颈,限制了语言模型与多层次视觉信息的充分对齐。 Method: 设计包含自适应多投影(AMP)模块和自适应门控融合(AGF)机制的CLI框架,实现视觉编码器多层特征到大语言模型的动态注入。 Result: 在LLaVA-OneVision和LLaVA-1.5上集成CLI后,在18个基准测试中均取得显著性能提升。 Conclusion: CLI是一种可扩展的轻量级范式,通过提供按需访问完整视觉层次结构的能力,实现了更深层次的多模态理解。 Abstract: Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.[146] Alterbute: Editing Intrinsic Attributes of Objects in Images
Tal Reiss,Daniel Winter,Matan Cohen,Alex Rav-Acha,Yael Pritch,Ariel Shamir,Yedid Hoshen
Main category: cs.CV
TL;DR: 本文提出了Alterbute,一种基于扩散模型的图像对象内在属性编辑方法,能够在保持对象身份和场景上下文的同时修改颜色、纹理、材质甚至形状。
Details
Motivation: 现有方法在保持对象身份方面存在不足,或依赖于无法有效保持身份的无监督先验,或使用过于严格的监督限制了内在属性的合理变化。因此需要一种既能灵活编辑又能保持身份的方法。 Method: 提出了一种松弛训练目标,结合身份参考图像、文本提示、背景图像和对象掩码来控制内在与外在属性的变化;在推理时通过复用原始背景和掩码限制外在变化;引入视觉命名实体(VNEs)作为细粒度的身份类别,并利用视觉语言模型从大规模数据中自动提取VNE标签和属性描述以实现可扩展的监督。 Result: 实验表明,Alterbute在保持对象身份的前提下,优于现有的对象内在属性编辑方法。 Conclusion: Alterbute通过松弛训练目标和视觉命名实体实现了更优的身份保持与可控的内在属性编辑,为图像编辑提供了更灵活且有效的解决方案。 Abstract: We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.[147] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments
Xuweiyi Chen,Wentao Zhou,Zezhou Cheng
Main category: cs.CV
TL;DR: WildRayZer是一种自监督框架,用于动态环境中(相机和物体均移动)的新型视图合成,通过分析-合成测试分离静态与动态内容,有效提升新视角合成质量。