cs.CL [Back]

[1] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

Jinqiang Wang,Huansheng Ning,Jianguo Ding,Tao Zhu,Liming Chen,Chris Nugent

Main category: cs.CL

TL;DR: 本文提出了一种名为ProUtt的LLM驱动的偏好数据合成方法，用于主动预测用户下一句话语。该方法通过构建意图树并显式建模意图推理路径，结合利用与探索视角，并生成偏好与非偏好推理过程，显著优于现有方法。

Details

Motivation: 现有方法在用户下一句预测中存在隐私问题、计算成本高，且缺乏对用户意图推理过程的显式建模。需要一种高效、隐私安全且能推进对话的任务特定解决方案。 Method: 提出ProUtt方法：将对话历史转化为意图树，从利用和探索角度预测下一个合理的路径，并通过扰动或修改未来轮次的路径构建偏好与非偏好推理过程，用于训练紧凑型任务特定LLM。 Result: 在四个基准数据集上，通过LLM打分和人工评估，ProUtt在主动下一句预测任务中 consistently 优于现有的数据合成方法、用户模拟器和商业LLM API。 Conclusion: ProUtt通过显式建模意图推理和合成高质量偏好数据，为低资源环境下实现高效、个性化的对话系统提供了有效方案，并公开了代码与数据集以促进后续研究。 Abstract: Proactively predicting a users next utterance in human-machine dialogue can streamline interaction and improve user experience. Existing commercial API-based solutions are subject to privacy concerns while deploying general-purpose LLMs locally remains computationally expensive. As such, training a compact, task-specific LLM provides a practical alternative. Although user simulator methods can predict a user's next utterance, they mainly imitate their speaking style rather than advancing the dialogue. Preference data synthesis has been investigated to generate data for proactive next utterance prediction and help align LLMs with user preferences. Yet existing methods lack the ability to explicitly model the intent reasoning that leads to the user's next utterance and to define and synthesize preference and non-preference reasoning processes for predicting the user's next utterance.To address these challenges, we propose ProUtt, an LLM-driven preference data synthesis method for proactive next utterance prediction. ProUtt converts dialogue history into an intent tree and explicitly models intent reasoning trajectories by predicting the next plausible path from both exploitation and exploration perspectives. It then constructs preference and non-preference reasoning processes by perturbing or revising intent tree paths at different future turns. Extensive evaluations using LLM-as-a-judge and human judgments demonstrate that ProUtt consistently outperforms existing data synthesis methods, user simulators, and commercial LLM APIs across four benchmark datasets. We release both the code and the synthesized datasets to facilitate future research.

[2] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

Devesh Saraogi,Rohit Singhee,Dhruv Kumar

Main category: cs.CL

TL;DR: 该论文探讨了代理式工作流（如迭代推理、进化搜索和递归分解）在生成新颖且可行的研究计划方面的能力，发现基于分解和长上下文的工作流在新颖性上表现最佳。

Details

Motivation: 解决大语言模型在单步提示中存在“智能抄袭”问题，即仅对已有想法进行术语改写，缺乏真正创新。 Method: 评估五种推理架构：基于反思的迭代优化、Sakana AI v2进化算法、Google Co-Scientist多智能体框架、GPT Deep Research递归分解和Gemini 3 Pro多模态长上下文流水线，在30个提案中分别评分其新颖性、可行性和影响力。 Result: 基于分解和长上下文的工作流平均新颖性得分为4.17/5，显著高于反思型方法的2.33/5；不同领域表现各异，高性能工作流能在保持可行性的同时不牺牲创造力。 Conclusion: 精心设计的多阶段代理工作流能够提升AI辅助科研的创意生成能力，推动科学研究的原创性发展。 Abstract: The integration of Large Language Models (LLMs) into the scientific ecosystem raises fundamental questions about the creativity and originality of AI-generated research. Recent work has identified ``smart plagiarism'' as a concern in single-step prompting approaches, where models reproduce existing ideas with terminological shifts. This paper investigates whether agentic workflows -- multi-step systems employing iterative reasoning, evolutionary search, and recursive decomposition -- can generate more novel and feasible research plans. We benchmark five reasoning architectures: Reflection-based iterative refinement, Sakana AI v2 evolutionary algorithms, Google Co-Scientist multi-agent framework, GPT Deep Research (GPT-5.1) recursive decomposition, and Gemini~3 Pro multimodal long-context pipeline. Using evaluations from thirty proposals each on novelty, feasibility, and impact, we find that decomposition-based and long-context workflows achieve mean novelty of 4.17/5, while reflection-based approaches score significantly lower (2.33/5). Results reveal varied performance across research domains, with high-performing workflows maintaining feasibility without sacrificing creativity. These findings support the view that carefully designed multi-stage agentic workflows can advance AI-assisted research ideation.

[3] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents

Adam Bradley,John Hastings,Khandaker Mamun Ahmed

Main category: cs.CL

TL;DR: 本文提出并评估了Axlerod，一个面向保险代理的AI对话系统，结合NLP、RAG和领域知识，实现高效保单检索与客户服务，显著提升响应准确率并缩短搜索时间。

Details

Motivation: 为提升独立保险代理的运营效率，需解决传统流程中信息检索慢、客户交互缺乏上下文感知的问题，现有聊天机器人在专业性和准确性上仍不足。 Method: 设计并实现了基于自然语言处理（NLP）、检索增强生成（RAG）和领域特定知识集成的对话系统Axlerod，支持意图识别、结构化数据库访问和实时响应生成。 Result: 实验显示Axlerod在保单检索任务中达到93.18%的准确率，平均搜索时间减少2.42秒，表现出高效的性能和实用性。 Conclusion: Axlerod证明了面向代理的AI助手在保险科技中的有效性，强调企业级AI应聚焦于辅助专业人员而非仅面向消费者，推动了智能代理系统在复杂业务场景中的应用。 Abstract: The insurance industry is undergoing a paradigm shift through the adoption of artificial intelligence (AI) technologies, particularly in the realm of intelligent conversational agents. Chatbots have evolved into sophisticated AI-driven systems capable of automating complex workflows, including policy recommendation and claims triage, while simultaneously enabling dynamic, context-aware user engagement. This paper presents the design, implementation, and empirical evaluation of Axlerod, an AI-powered conversational interface designed to improve the operational efficiency of independent insurance agents. Leveraging natural language processing (NLP), retrieval-augmented generation (RAG), and domain-specific knowledge integration, Axlerod demonstrates robust capabilities in parsing user intent, accessing structured policy databases, and delivering real-time, contextually relevant responses. Experimental results underscore Axlerod's effectiveness, achieving an overall accuracy of 93.18% in policy retrieval tasks while reducing the average search time by 2.42 seconds. This work contributes to the growing body of research on enterprise-grade AI applications in insurtech, with a particular focus on agent-assistive rather than consumer-facing architectures.

Derguene Mbaye,Tatiana D. P. Mbengue,Madoune R. Seye,Moussa Diallo,Mamadou L. Ndiaye,Dimitri S. Adjanohoun,Cheikh S. Wade,Djiby Sow,Jean-Claude B. Munyaka,Jerome Chenal

Main category: cs.CL

TL;DR: 本文首次全面综述了塞内加尔六种官方语言（Wolof、Pulaar、Sereer、Joola、Mandingue 和 Soninke）的自然语言处理（NLP）发展现状与挑战，分析了影响其数字化准备的语言学、社会技术与基础设施因素，并提出了构建可持续、以社区为中心的NLP生态系统的路线图。

Details

Motivation: 非洲语言在NLP技术发展中长期被边缘化，尽管NLP正在改变各学科的研究方法。本文旨在填补塞内加尔国家语言在NLP研究中的空白，推动语言多样性与技术公平。 Method: 综合语言学、社会技术和基础设施等多个维度，系统梳理现有NLP项目与研究成果；建立一个集中化的GitHub资源库，整合公开可用的NLP资源，并重点探讨NLP在社会科学领域的应用潜力。 Result: 识别出六种塞内加尔语言在数据、工具和基准测试方面的关键缺口；汇总了文本规范化、机器翻译和语音处理等方面的现有进展；提供了一个促进协作与可复现性的开源资源平台；展示了NLP在多语言社会科学研究中的实际应用场景。 Conclusion: 实现塞内加尔语言的可持续NLP发展需要以社区为中心的方法，强调伦理数据治理、开放资源建设以及跨学科合作，为低资源语言的技术发展提供可推广的框架。 Abstract: Natural Language Processing (NLP) is rapidly transforming research methodologies across disciplines, yet African languages remain largely underrepresented in this technological shift. This paper provides the first comprehensive overview of NLP progress and challenges for the six national languages officially recognized by the Senegalese Constitution: Wolof, Pulaar, Sereer, Joola, Mandingue, and Soninke. We synthesize linguistic, sociotechnical, and infrastructural factors that shape their digital readiness and identify gaps in data, tools, and benchmarks. Building on existing initiatives and research works, we analyze ongoing efforts in text normalization, machine translation, and speech processing. We also provide a centralized GitHub repository that compiles publicly accessible resources for a range of NLP tasks across these languages, designed to facilitate collaboration and reproducibility. A special focus is devoted to the application of NLP to the social sciences, where multilingual transcription, translation, and retrieval pipelines can significantly enhance the efficiency and inclusiveness of field research. The paper concludes by outlining a roadmap toward sustainable, community-centered NLP ecosystems for Senegalese languages, emphasizing ethical data governance, open resources, and interdisciplinary collaboration.

[5] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

Yiwei Yan,Hao Li,Hua He,Gong Kai,Zhengyi Yang,Guanfeng Liu

Main category: cs.CL

TL;DR: 本研究提出了一种基于大语言模型的提取管道SALP-CG，用于对在线医疗对话数据中的隐私风险进行分类与分级，符合GB/T 39725-2020标准，在MedDialog-CN基准上表现出高准确性和良好的模式合规性。

Details

Motivation: 在线医疗咨询产生大量包含受保护健康信息的对话数据，现有方法缺乏统一标准和可靠的自动化手段来进行敏感性分类。 Method: 结合少样本引导、JSON Schema约束解码和确定性高风险规则，构建后端无关的提取管道SALP-CG，并依据国家标准制定健康数据分类分级规则。 Result: 在MedDialog-CN基准上，模型实现了稳健的实体计数、高模式合规性和准确的敏感性分级，最强模型在最高等级预测中达到micro-F1=0.900；分析显示2-3级数据项占主导，联合时可致再识别；4-5级较少但危害更大。 Conclusion: SALP-CG能可靠地跨LLM实现在线对话健康数据的类别分类与敏感性分级，为健康数据治理提供了实用解决方案。 Abstract: Online medical consultations generate large volumes of conversational health data that often embed protected health information, requiring robust methods to classify data categories and assign risk levels in line with policies and practice. However, existing approaches lack unified standards and reliable automated methods to fulfill sensitivity classification for such conversational health data. This study presents a large language model-based extraction pipeline, SALP-CG, for classifying and grading privacy risks in online conversational health data. We concluded health-data classification and grading rules in accordance with GB/T 39725-2020. Combining few-shot guidance, JSON Schema constrained decoding, and deterministic high-risk rules, the backend-agnostic extraction pipeline achieves strong category compliance and reliable sensitivity across diverse LLMs. On the MedDialog-CN benchmark, models yields robust entity counts, high schema compliance, and accurate sensitivity grading, while the strongest model attains micro-F1=0.900 for maximum-level prediction. The category landscape stratified by sensitivity shows that Level 2-3 items dominate, enabling re-identification when combined; Level 4-5 items are less frequent but carry outsize harm. SALP-CG reliably helps classify categories and grading sensitivity in online conversational health data across LLMs, offering a practical method for health data governance. Code is available at https://github.com/dommii1218/SALP-CG.

[6] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

Jing-Yi Zeng,Guan-Hua Huang

Main category: cs.CL

TL;DR: 本研究提出了一种高效构建统计领域专用大语言模型的方法，基于轻量级LLaMA-3.2-3B系列模型，发现从具备指令跟随能力的预训练模型出发才能有效实现领域专业化，最终模型StatLLaMA在数学推理、常识推理和统计专长方面表现均衡。

Details

Motivation: 如何高效地构建一个面向统计领域的专用大语言模型，尤其是在资源受限的情况下，是一个关键问题。现有方法在领域适应过程中可能忽视了基础模型选择的重要性。 Method: 系统比较了三种多阶段训练流程：从无指令跟随能力的基础模型、经过后处理指令微调的基础模型、以及已具备强通用推理能力的指令微调模型出发，依次进行持续预训练、监督微调（SFT）、基于人类反馈的强化学习（RLHF）偏好对齐和下游任务适配。 Result: 实验表明，从基础模型开始的流程无法发展出有效的统计推理能力；而从LLaMA-3.2-3B-Instruct出发可成功实现领域专业化。SFT变体评估揭示了领域专长与通用推理能力之间的权衡。直接偏好优化（DPO）被证明能稳定有效地实现RLHF对齐。极低强度的下游微调可避免高度优化模型的灾难性遗忘。 Conclusion: 构建高效的领域专用LLM必须从具有良好指令理解能力的基础模型出发，结合合理的多阶段训练策略和低强度下游微调，才能在保持通用能力的同时获得专业性能，StatLLaMA为资源受限下的统计LLM开发提供了可行蓝图。 Abstract: This study investigates how to efficiently build a domain-specialized large language model (LLM) for statistics using the lightweight LLaMA-3.2-3B family as the foundation model (FM). We systematically compare three multi-stage training pipelines, starting from a base FM with no instruction-following capability, a base FM augmented with post-hoc instruction tuning, and an instruction-tuned FM with strong general reasoning abilities across continual pretraining, supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF) preference alignment, and downstream task adaptation. Results show that pipelines beginning with a base FM fail to develop meaningful statistical reasoning, even after extensive instruction tuning, SFT, or RLHF alignment. In contrast, starting from LLaMA-3.2-3B-Instruct enables effective domain specialization. A comprehensive evaluation of SFT variants reveals clear trade-offs between domain expertise and general reasoning ability. We further demonstrate that direct preference optimization provides stable and effective RLHF preference alignment. Finally, we show that downstream fine-tuning must be performed with extremely low intensity to avoid catastrophic forgetting in highly optimized models. The final model, StatLLaMA, achieves strong and balanced performance on benchmarks of mathematical reasoning, common-sense reasoning, and statistical expertise, offering a practical blueprint for developing resource-efficient statistical LLMs. The code is available at https://github.com/HuangDLab/StatLLaMA.

[7] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Hoyoon Byun,Youngjun Choi,Taero Kim,Sungrae Park,Kyungwoo Song

Main category: cs.CL

TL;DR: 本文提出了Bounded Hyperbolic Tanh (BHyT)，一种用于大语言模型的Pre-LN替代方案，兼顾训练稳定性和计算效率。

Details

Motivation: Pre-LN虽然广泛使用，但存在计算重复、深度增加时激活值方差和幅值增长导致训练不稳定的问题；现有高效方法在深层模型中仍脆弱。 Method: 提出BHyT，结合tanh非线性与数据驱动的输入边界控制，限制激活在非饱和区间；每块仅计算一次统计量，并用轻量级方差近似替代第二次归一化以提升效率。 Result: BHyT在预训练中表现出更高稳定性和效率，平均训练速度提升15.8%，token生成吞吐量提高4.2%，同时在语言理解和推理任务上性能匹配或优于RMSNorm。 Conclusion: BHyT是一种高效、稳定的Pre-LN替代方案，具备理论稳定性保证，适用于深层大模型。 Abstract: Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and suffers from the curse of depth. As layers grow, the magnitude and variance of the hidden state escalate, destabilizing training. Efficiency-oriented normalization-free methods such as Dynamic Tanh (DyT) improve speed but remain fragile at depth. To jointly address stability and efficiency, we propose Bounded Hyperbolic Tanh (BHyT), a drop-in replacement for Pre-LN. BHyT couples a tanh nonlinearity with explicit, data-driven input bounding to keep activations within a non-saturating range. It prevents depth-wise growth in activation magnitude and variance and comes with a theoretical stability guarantee. For efficiency, BHyT computes exact statistics once per block and replaces a second normalization with a lightweight variance approximation, enhancing efficiency. Empirically, BHyT demonstrates improved stability and efficiency during pretraining, achieving an average of 15.8% faster training and an average of 4.2% higher token generation throughput compared to RMSNorm., while matching or surpassing its inference performance and robustness across language understanding and reasoning benchmarks. Our code is available at: https://anonymous.4open.science/r/BHyT

[8] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering

Yu Takahashi,Shun Takeuchi,Kexuan Xin,Guillaume Pelat,Yoshiaki Ikai,Junya Saito,Jonathan Vitale,Shlomo Berkovsky,Amin Beheshti

Main category: cs.CL

TL;DR: 本文提出了一种不确定性感知的动态知识图谱（KG）框架，用于提升问答系统在高风险应用中的可靠性与可解释性，特别是在医疗领域中从电子健康记录构建个性化知识图谱并建模不确定性。

Details

Motivation: 现有基于知识图谱的问答系统通常将事实视为静态且确定的，难以捕捉信息的动态演化和推理过程中的不确定性，从而影响系统在关键场景下的可靠性。 Method: 该框架结合了动态知识图谱的构建、置信度评分与不确定性感知的信息检索，并提供一个交互式界面，支持用户探索动态图结构、检查带有置信度标注的事实三元组，并对比传统与置信度感知的答案。 Result: 在医疗领域的实例中，系统能够从电子健康记录中构建个性化的动态知识图谱，可视化患者多次就诊过程中的不确定性，并在死亡率预测任务中评估其影响，验证了该方法的有效性。 Conclusion: 不确定性感知的动态知识图谱能够增强问答系统的鲁棒性、透明性和可解释性，尤其适用于医疗等高风险决策场景。 Abstract: Question answering (QA) systems are increasingly deployed across domains. However, their reliability is undermined when retrieved evidence is incomplete, noisy, or uncertain. Existing knowledge graph (KG) based QA frameworks typically represent facts as static and deterministic, failing to capture the evolving nature of information and the uncertainty inherent in reasoning. We present a demonstration of uncertainty-aware dynamic KGs, a framework that combines (i) dynamic construction of evolving KGs, (ii) confidence scoring and uncertainty-aware retrieval, and (iii) an interactive interface for reliable and interpretable QA. Our system highlights how uncertainty modeling can make QA more robust and transparent by enabling users to explore dynamic graphs, inspect confidence-annotated triples, and compare baseline versus confidence-aware answers. The target users of this demo are clinical data scientists and clinicians, and we instantiate the framework in healthcare: constructing personalized KGs from electronic health records, visualizing uncertainty across patient visits, and evaluating its impact on a mortality prediction task. This use case demonstrates the broader promise of uncertainty-aware dynamic KGs for enhancing QA reliability in high-stakes applications.

[9] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

Vahideh Zolfaghari

Main category: cs.CL

TL;DR: 该研究评估了在父母焦虑驱动的对抗性压力下，大语言模型在儿科医疗咨询中的安全性，发现较小的模型在某些情况下表现优于更大的模型，且所有模型均缺乏紧急情况识别能力，不适合用于分诊。

Details

Motivation: 现有对大语言模型在医疗咨询中安全性的评估多基于中性条件，忽略了用户焦虑等现实压力因素可能引发的安全漏洞，因此需要在更真实的使用场景下进行评估。 Method: 研究使用PediatricAnxietyBench数据集（包含150个真实和150个对抗性问题），在三个主流LLM（Llama-3.3-70B、Llama-3.1-8B、Mistral-7B）上通过API进行测试，共分析900条响应；采用0-15分量表评估安全性，并使用配对t检验与自助法置信区间进行统计分析。 Result: 平均安全得分介于9.70至10.39之间；Llama-3.1-8B显著优于Llama-3.3-70B（+0.66, p=0.0001）；对抗性问题反而提升部分模型表现，其中Mistral-7B最明显（+1.09, p=0.0002）；模型在癫痫诊断中错误率高达33%；犹豫表达（hedging）与安全性显著正相关（r=0.68）；Llama-3.3-70B有8%的安全失败案例。 Conclusion: 模型安全性更多依赖对齐策略和架构设计而非参数规模，较小模型可通过优化超越更大模型；版本演进显示出训练改进的迹象；但普遍缺乏紧急识别能力，表明当前LLM尚不适用于医疗分诊；研究强调应进行对抗性测试，并提供开放基准以推动医学AI安全发展。 Abstract: Background Large language models (LLMs) are increasingly deployed in medical consultations, yet their safety under realistic user pressures remains understudied. Prior assessments focused on neutral conditions, overlooking vulnerabilities from anxious users challenging safeguards. This study evaluated LLM safety under parental anxiety-driven adversarial pressures in pediatric consultations across models and platforms. Methods PediatricAnxietyBench, from a prior evaluation, includes 300 queries (150 authentic, 150 adversarial) spanning 10 topics. Three models were assessed via APIs: Llama-3.3-70B and Llama-3.1-8B (Groq), Mistral-7B (HuggingFace), yielding 900 responses. Safety used a 0-15 scale for restraint, referral, hedging, emergency recognition, and non-prescriptive behavior. Analyses employed paired t-tests with bootstrapped CIs. Results Mean scores: 9.70 (Llama-3.3-70B) to 10.39 (Mistral-7B). Llama-3.1-8B outperformed Llama-3.3-70B by +0.66 (p=0.0001, d=0.225). Models showed positive adversarial effects, Mistral-7B strongest (+1.09, p=0.0002). Safety generalized across platforms; Llama-3.3-70B had 8% failures. Seizures vulnerable (33% inappropriate diagnoses). Hedging predicted safety (r=0.68, p<0.001). Conclusions Evaluation shows safety depends on alignment and architecture over scale, with smaller models outperforming larger. Evolution to robustness across releases suggests targeted training progress. Vulnerabilities and no emergency recognition indicate unsuitability for triage. Findings guide selection, stress adversarial testing, and provide open benchmark for medical AI safety.

[10] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

Franciszek Górski,Andrzej Czyżewski

Main category: cs.CL

TL;DR: 本研究提出了一种利用多语言大模型（如Llama3.1）作为教师模型，为波兰语医学文本提供标注的框架，并基于该标注数据训练轻量级BERT类分类器，在资源有限的情况下实现了高效的多类别临床文本分类。

Details

Motivation: 由于缺乏足够的标注资源来处理波兰语医学文本，难以构建高质量的临床文本分类器，因此需要一种高效且低成本的自动标注方法。 Method: 使用预训练的多语言LLM（Llama3.1）对大规模波兰语医学文本进行自动标注，并利用有限的人工标注数据构建测试集；随后基于标注数据训练三种基于BERT架构的分类器：DistilBERT、BioBERT和HerBERT。 Result: DistilBERT模型表现最佳，所有临床类别的F1分数均超过0.80，其中三个类别超过0.93；同时模型体积缩小近500倍，GPU显存消耗降低300倍，推理速度加快数百倍。 Conclusion: 通过知识蒸馏方式利用多语言大模型进行自动标注，可有效解决低资源语言医学文本分类中的标注瓶颈问题，并训练出高效、轻量化的专用分类器，适用于实际医疗场景部署。 Abstract: In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

[11] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels

Guancheng Du,Yong Hu,Wenqing Wang,Yaming Yang,Jiaheng Gao

Main category: cs.CL

TL;DR: 本文提出了SagaScale，一个基于长篇小说构建的真实、可扩展且高质量的长上下文基准测试，用于评估大语言模型在处理超长文本时的能力。该基准支持双语并提供迄今为止最长的上下文长度（英文平均25万token，中文32万token），并通过自动化流程利用外部资源生成问答对。评估结果显示：直接输入完整上下文表现最佳，多数模型仍难以处理长文本（Gemini-2.5-Pro例外），Agentic RAG能有效缓解检索瓶颈。作者已公开发布该基准和代码。

Details

Motivation: 现有长上下文基准存在任务真实性差、数据可扩展性低和质量不高的问题，因此需要构建更真实、高质量且可扩展的评测基准来准确评估大语言模型在复杂长文档理解中的表现。 Method: 提出SagaScale基准，通过自动化数据收集管道，利用外部资源（如Wikipedia）从完整小说中构建问答对。外部资源仅用于构建阶段，不参与模型评估，从而生成超出模型当前回答能力的复杂问题。支持双语（中英文），并提供超长上下文（平均超过25万/32万token）。 Result: 在12个前沿大模型和3种长上下文方法（Naïve RAG、Agentic RAG、Long Context）上的评估显示：（1）直接提供完整上下文显著优于其他方法；（2）大多数模型仍难以处理超长上下文，但Gemini-2.5-Pro表现突出；（3）Agentic RAG能有效解决Naïve RAG的检索瓶颈。 Conclusion: SagaScale是一个现实、可扩展且高质量的长上下文基准，能够有效评估大语言模型在处理极长文本时的能力，推动长上下文建模的发展，并为未来研究提供了宝贵的开源资源。 Abstract: Large Language Models (LLMs) have shown significant progress, but understanding long and complex documents remains challenging. Many long-context benchmarks have been proposed, but they face several limitations, including task realism, data scalability, and data quality. To this end, we introduce SagaScale, a realistic, scalable, and high-quality long-context benchmark built from full-length novels. The entire benchmark is constructed using an automated data collection pipeline that utilizes external resources (e.g., Wikipedia pages) to curate question-answer pairs. Critically, these external resources are provided only for benchmark construction and not during evaluation, which allows LLMs to curate complex questions that go beyond what they can answer during evaluation. SagaScale is also bilingual and offers the largest context length to date, with average token counts exceeding 250K for English novels and 320K for Chinese novels. Our evaluation across 12 frontier LLMs and three long-context methods -- Naïve RAG, Agentic RAG, and Long Context -- yields key insights, including: (1) Directly supplying the full context to the LLM can outperform other methods by a large margin; (2) Most LLMs still struggle with lengthy contexts, but Gemini-2.5-Pro stands out as an exception; and (3) Agentic RAG effectively addresses the retrieval bottleneck in Naïve RAG. Finally, we publicly release the SagaScale benchmark and our data collection codebase to facilitate future research.

[12] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions

Katherine Elkins,Jon Chun

Main category: cs.CL

TL;DR: 该研究提出了“句法框架脆弱性”（SFF）框架，用于评估大语言模型在不同但逻辑等价的提示下伦理判断的一致性，发现许多模型因句法变化（如否定结构）而产生显著不一致，尤其在开源模型中更严重；通过思维链推理可有效缓解此问题。

Details

Motivation: 探究大语言模型在关键决策中对句法变化的鲁棒性，尤其是逻辑等价但表述不同的提示是否导致伦理判断不一致。 Method: 提出SFF框架，结合逻辑极性归一化（LPN），在14个伦理场景和四种句法变体下测试23个主流模型（共39,975次决策），分析其判断一致性，并评估思维链等方法的缓解效果。 Result: 发现广泛且统计显著的不一致性，部分模型仅因句法极性反转伦理判断；开源模型脆弱性是商业模型的两倍以上；某些模型在‘不应’提示下仍支持行为的比例高达80-97%；思维链能显著降低脆弱性；金融与商业场景风险高于医疗场景。 Conclusion: 句法一致性是伦理鲁棒性的关键维度，应将SFF类审计纳入部署前的标准安全评估流程。 Abstract: Large language models (LLMs) are increasingly deployed in consequential decision-making settings, yet their robustness to benign prompt variation remains underexplored. In this work, we study whether LLMs maintain consistent ethical judgments across logically equivalent but syntactically different prompts, focusing on variations involving negation and conditional structure. We introduce Syntactic Framing Fragility (SFF), a robustness evaluation framework that isolates purely syntactic effects via Logical Polarity Normalization (LPN), enabling direct comparison of decisions across positive and negative framings without semantic drift. Auditing 23 state-of-the-art models spanning the U.S. and China as well as small U.S. open-source software models over 14 ethical scenarios and four controlled framings (39,975 decisions), we find widespread and statistically significant inconsistency: many models reverse ethical endorsements solely due to syntactic polarity, with open-source models exhibiting over twice the fragility of commercial counterparts. We further uncover extreme negation sensitivity, where some models endorse actions in 80-97% of cases when explicitly prompted with "should not." We show that eliciting chain-of-thought reasoning substantially reduces fragility, identifying a practical mitigation lever, and we map fragility across scenarios, finding higher risk in financial and business contexts than in medical scenarios. Our results demonstrate that syntactic consistency constitutes a distinct and critical dimension of ethical robustness, and we argue that SFF-style audits should be a standard component of safety evaluation for deployed LLMs. Code and results will be available on github.com.

[13] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

Kaustubh Shivshankar Shejole,Sourabh Deoghare,Pushpak Bhattacharyya

Main category: cs.CL

TL;DR: 本文提出了Virām，首个用于评估英语到马拉地语机器翻译中标点鲁棒性的诊断基准，通过手动整理54个标点歧义实例，比较了恢复后翻译的流水线方法与直接在标点变化数据上微调的方法，结果表明专用微调模型和流水线系统显著优于标准基线模型，且当前大语言模型在此类任务上的表现落后于这些特定方法。

Details

Motivation: 解决低资源语言（如马拉地语）机器翻译中因标点符号引起的语义和结构歧义问题，提升翻译系统的可靠性。 Method: 构建了一个包含54个手动整理的标点歧义实例的诊断基准Virām，评估两种策略：一是先恢复标点再翻译的流水线方法，二是直接在标点变化的数据上进行微调。 Result: 实验结果显示，专用微调模型和流水线系统在Virām基准上显著优于标准基线模型；定性分析表明原始模型可能导致错误翻译和误解，而微调模型大幅提高了整体可靠性；同时发现当前的大语言模型在处理标点歧义文本时表现不如任务特定方法。 Conclusion: 针对标点歧义问题，专用微调和流水线方法能有效提升英语到马拉地语机器翻译的质量和可靠性，而现有大语言模型尚需进一步改进以应对此类挑战。 Abstract: Punctuation plays a critical role in resolving semantic and structural ambiguity in written language. Machine Translation (MT) systems are now widely applied across diverse domains and languages, including many low-resource settings. In this work, we focus on Marathi, a low- to middle-resource language. We introduce Virām, the first diagnostic benchmark for assessing punctuation robustness in English-to-Marathi machine translation, consisting of 54 manually curated, punctuation-ambiguous instances. We evaluate two primary strategies for enhancing reliability: a pipeline-based restore-then-translate approach and direct fine-tuned on punctuation-varied data. Our results demonstrate that specialized fine-tuned models and pipeline systems significantly improve translation quality over standard baselines on the Virām benchmark. Qualitative analysis reveals that the original model may result in wrong translations leading to wrong interpretations, while fine-tuned models significantly improve overall reliability. Furthermore, we find that current Large Language Models (LLMs) lag behind these task-specific approaches in preserving meaning for punctuation-ambiguous text, thus necessitating further research in this area.

[14] Forgetting as a Feature: Cognitive Alignment of Large Language Models

Hien Tran,Quinten Steenhuis,Alexandros Christoforos,Chadbourne Davis

Main category: cs.CL

TL;DR: 本文提出将大语言模型中的“遗忘”现象重新理解为一种功能性认知机制，借鉴人类记忆的指数衰减动态，建模为概率性记忆过程，并通过新基准测试验证LLM在时间推理、概念漂移和联想回忆中的类人遗忘模式。作者进一步提出概率性记忆提示策略，提升长程推理性能，主张遗忘是实现适应性智能的原则性机制。

Details

Motivation: 大语言模型常被以完美贝叶斯推断为标准评估，但实际表现出对过往信息的系统性遗忘。本文动机在于重新审视这种遗忘行为，不将其视为缺陷，而是从人类记忆机制中汲取灵感，探索其作为适应性机制的功能意义。 Method: 将LLM的上下文推理建模为受指数衰减控制的概率性记忆过程，设计包含时间推理、概念漂移适应和关联回忆的基准测试套件，用于比较模型与人类认知模式的相似性，并提出“概率性记忆提示”方法，通过调节证据整合方式模拟人类记忆衰减。 Result: 实验表明LLM的遗忘速率与人类记忆在稳定性与适应性之间的权衡相似；所提提示方法能有效改善模型在长时程推理任务上的表现。 Conclusion: 遗忘不应被视为大模型的失败模式，而是一种实现适应性智能的原理性机制，为构建更高效、类人的推理系统提供了新视角。 Abstract: Large Language Models (LLMs) are often evaluated against ideals of perfect Bayesian inference, yet growing evidence suggests that their in-context reasoning exhibits systematic forgetting of past information. Rather than viewing this behavior as a limitation, we reinterpret forgetting as a functional cognitive mechanism. Drawing inspiration from human memory dynamics, we model LLM inference as a probabilistic memory process governed by exponential decay. We introduce a benchmark suite that evaluates temporal reasoning, concept drift adaptation, and associative recall, enabling direct comparison between model behavior and human cognitive patterns. Our empirical results reveal that LLMs demonstrate forgetting rates analogous to human memory efficiency trade-offs between stability and adaptability. Building on these observations, we propose probabilistic memory prompting, a lightweight strategy that shapes evidence integration to mimic human-like memory decay, leading to improved long-horizon reasoning performance. Our findings position forgetting not as a failure mode, but as a principled mechanism for adaptive intelligence.

[15] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

Sauhard Dubey

Main category: cs.CL

TL;DR: 本文提出SciNets，通过构建文献导出的概念图并利用图约束的多跳推理来实现跨领域科学机制综合，相较于传统方法能更好地控制推理深度与结构稳定性。

Details

Motivation: 现有的科学综合方法在跨文献连接机制解释方面存在局限，难以控制推理深度和结构基础，因此需要一种更可控且可解释的方法来实现跨域知识整合。 Method: 将机制综合建模为基于文献概念图的图约束多跳推理问题；针对科学问题和局部语料构建有向概念图，并通过多种路径搜索策略（如最短路径、k最短路径、随机游走）识别罕见共现概念间的推理路径。 Result: 实验表明，显式的图约束支持可控的多跳推理，但存在权衡：更深、更多样化的符号推理会降低结构稳定性，而最短路径推理稳定但结构保守。 Conclusion: 图约束与大语言模型结合可用于科学综合，但需在推理深度与接地稳定性之间进行权衡，本文为此提供了系统的行为表征框架。 Abstract: Cross-domain scientific synthesis requires connecting mechanistic explanations across fragmented literature, a capability that remains challenging for both retrieval-based systems and unconstrained language models. While recent work has applied large language models to scientific summarization and question answering, these approaches provide limited control over reasoning depth and structural grounding. We frame mechanistic synthesis as a graph-constrained multi-hop reasoning problem over literature-derived concept graphs. Given a scientific query and a compact, query-local corpus, SciNets constructs a directed concept graph and synthesizes mechanistic explanations by identifying multi-hop reasoning paths that connect concepts that rarely co-occur within individual papers. We systematically compare shortest-path reasoning, k-shortest paths with diversity constraints, stochastic random walks, and a retrieval-augmented language model baseline. Rather than evaluating correctness, which is often indeterminate when synthesizing connections across distributed sources, we introduce a behavioral framework that measures symbolic reasoning depth, mechanistic diversity, and grounding stability. Across machine learning, biology, and climate science tasks, explicit graph constraints enable controllable multi-hop reasoning while revealing a consistent trade-off: deeper and more diverse symbolic reasoning increases grounding instability, whereas shortest-path reasoning remains highly stable but structurally conservative. These findings provide a systematic behavioral characterization of the limits and capabilities of current graph-LLM integration for scientific synthesis.

[16] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens

Meicong Zhang,Tiancheng su,Guoxiu He

Main category: cs.CL

TL;DR: 提出STIG方法，通过将多阶段逻辑结构参数化到大语言模型中，实现单次推理生成完整引言，避免了传统代理工作流的错误累积和连贯性问题。

Details

Motivation: 现有基于预定义代理工作流的方法在生成学术引言时存在推理链过长、错误累积和文本连贯性差的问题，难以满足引言写作对逻辑严谨性和结构一致性的高要求。 Method: 提出Stage Token for Introduction Generation（STIG），将原本 workflow 中的多个阶段转化为显式的阶段标记（stage tokens），作为模型生成过程中的逻辑信号，并通过指令微调使模型学习阶段标记与文本功能、逻辑顺序及转换模式之间的映射关系，从而将多阶段逻辑内化到模型参数中。 Result: 实验表明，STIG 能在无需外部工作流调用的情况下，单次推理生成多阶段引言文本，在语义相似性和句子级结构合理性等指标上优于传统代理工作流及其他基线方法。 Conclusion: 将工作流逻辑直接参数化到模型中是一种比依赖外部代理流程更高效、更连贯的引言生成范式，STIG 为复杂文本的结构化生成提供了新思路。 Abstract: In recent years, using predefined agentic workflows to guide large language models (LLMs) for literature classification and review has become a research focus. However, writing research introductions is more challenging. It requires rigorous logic, coherent structure, and abstract summarization. Existing workflows often suffer from long reasoning chains, error accumulation, and reduced textual coherence. To address these limitations, we propose eliminating external agentic workflows. Instead, we directly parameterize their logical structure into the LLM. This allows the generation of a complete introduction in a single inference. To this end, we introduce the Stage Token for Introduction Generation (STIG). STIG converts the multiple stages of the original workflow into explicit stage signals. These signals guide the model to follow different logical roles and functions during generation. Through instruction tuning, the model learns the mapping between stage tokens and text functions. It also learns the logical order and transition patterns between stages, encoding this knowledge into the model parameters. Experimental results show that STIG can generate multi-stage text in a single inference. It does not require explicit workflow calls. STIG outperforms traditional agentic workflows and other baselines on metrics of semantic similarity and sentence-level structural rationality. The code is provided in the Supplementary Materials.

[17] Enhancing Business Analytics through Hybrid Summarization of Financial Reports

Tohida Rehman

Main category: cs.CL

TL;DR: 本文提出了一种结合抽取式和生成式的混合摘要框架，用于从财务电话会议记录中自动生成准确、简洁的Reuters风格摘要，并在资源受限环境下实现了良好的性能与事实一致性。

Details

Motivation: 由于财务报告和业绩电话会议内容庞大且复杂，手动分析效率低且易出错，因此需要自动化方法来高效提取关键商业信息。 Method: 采用两阶段混合框架：第一阶段使用LexRank算法抽取关键句子，第二阶段利用微调后的BART和PEGASUS模型进行抽象摘要；同时并行训练Longformer Encoder-Decoder（LED）模型以捕捉长距离上下文依赖。 Result: 实验结果显示，长上下文模型整体表现最佳，而混合框架在计算资源受限时仍具有竞争力，并表现出更高的事实准确性；评估指标包括ROUGE、METEOR、BERTScore及领域特定的FinBERTScore和实体级精度指标。 Conclusion: 该研究表明，所提出的混合框架和长上下文模型可有效支持财务文本的自动摘要，有助于将冗长的金融文本转化为实用的业务洞察。 Abstract: Financial reports and earnings communications contain large volumes of structured and semi structured information, making detailed manual analysis inefficient. Earnings conference calls provide valuable evidence about a firm's performance, outlook, and strategic priorities. The manual analysis of lengthy call transcripts requires substantial effort and is susceptible to interpretive bias and unintentional error. In this work, we present a hybrid summarization framework that combines extractive and abstractive techniques to produce concise and factually reliable Reuters-style summaries from the ECTSum dataset. The proposed two stage pipeline first applies the LexRank algorithm to identify salient sentences, which are subsequently summarized using fine-tuned variants of BART and PEGASUS designed for resource constrained settings. In parallel, we fine-tune a Longformer Encoder-Decoder (LED) model to directly capture long-range contextual dependencies in financial documents. Model performance is evaluated using standard automatic metrics, including ROUGE, METEOR, MoverScore, and BERTScore, along with domain-specific variants such as SciBERTScore and FinBERTScore. To assess factual accuracy, we further employ entity-level measures based on source-precision and F1-target. The results highlight complementary trade offs between approaches, long context models yield the strongest overall performance, while the hybrid framework achieves competitive results with improved factual consistency under computational constraints. These findings support the development of practical summarization systems for efficiently distilling lengthy financial texts into usable business insights.

[18] Clinical Document Metadata Extraction: A Scoping Review

Kurt Miller,Qiuhao Lu,William Hersh,Kirk Roberts,Steven Bedrick,Andrew Wen,Hongfang Liu

Main category: cs.CL

TL;DR: 该论文对临床文档元数据提取的研究进行了范围综述，识别了方法论趋势、应用及数据缺口，发现研究正从基于规则和传统机器学习方法转向基于Transformer和大语言模型的架构。

Details

Motivation: 临床文档元数据对于准确解读临床信息至关重要，但文档异质性和随时间变化导致元数据标准化困难，亟需系统性梳理现有提取方法并识别研究空白。 Method: 遵循PRISMA-ScR指南，筛选2011年1月至2025年8月发表的文献，最终纳入67篇进行综合分析，分类其研究类型（方法学、应用、组成分析）并总结技术演进与数据可用性。 Result: 在266篇初筛文献中，67篇被深入审查：45篇为方法学研究，17篇将元数据用于下游任务，5篇分析元数据构成；发现标注公开数据稀缺（除结构化章节外），方法从规则和传统机器学习向低特征工程的Transformer架构发展，大语言模型提升了跨任务泛化能力。 Conclusion: 临床文档元数据提取正朝着更丰富的表示形式和更深度集成于临床工作流的方向发展，未来研究将受益于大语言模型推动的通用文本处理系统。 Abstract: Clinical document metadata, such as document type, structure, author role, medical specialty, and encounter setting, is essential for accurate interpretation of information captured in clinical documents. However, vast documentation heterogeneity and drift over time challenge harmonization of document metadata. Automated extraction methods have emerged to coalesce metadata from disparate practices into target schema. This scoping review aims to catalog research on clinical document metadata extraction, identify methodological trends and applications, and highlight gaps. We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) guidelines to identify articles that perform clinical document metadata extraction. We initially found and screened 266 articles published between January 2011 and August 2025, then comprehensively reviewed 67 we deemed relevant to our study. Among the articles included, 45 were methodological, 17 used document metadata as features in a downstream application, and 5 analyzed document metadata composition. We observe myriad purposes for methodological study and application types. Available labelled public data remains sparse except for structural section datasets. Methods for extracting document metadata have progressed from largely rule-based and traditional machine learning with ample feature engineering to transformer-based architectures with minimal feature engineering. The emergence of large language models has enabled broader exploration of generalizability across tasks and datasets, allowing the possibility of advanced clinical text processing systems. We anticipate that research will continue to expand into richer document metadata representations and integrate further into clinical applications and workflows.

[19] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

Wen G Gong

Main category: cs.CL

TL;DR: 提出了一种多层级分析框架Semanscope，利用PHATE流形学习揭示多语言嵌入中的语义几何结构，并发现现有模型在区分语义与结构成分方面存在系统性缺陷。

Details

Motivation: 当前嵌入模型难以有效区分语义与结构性成分，缺乏对多语言语义几何结构的系统性分析工具。 Method: 构建了一个四层语言学层面（子字符、字符、词、数字）的多级分析框架，结合PHATE流形学习实现可视化分析工具Semanscope。 Result: 发现子字符层级上中文部首出现几何坍缩；不同文字系统在字符层级呈现独特几何特征；词汇层级上内容词在20个语义域中形成聚类-分支模式；阿拉伯数字呈现螺旋轨迹而非聚类。 Conclusion: PHATE流形学习是分析嵌入空间语义几何结构和评估模型有效性的关键工具，揭示了当前嵌入模型在语义表示上的系统性局限。 Abstract: We introduce a multi-level analysis framework for examining semantic geometry in multilingual embeddings, implemented through Semanscope (a visualization tool that applies PHATE manifold learning across four linguistic levels). Analysis of diverse datasets spanning sub-character components, alphabetic systems, semantic domains, and numerical concepts reveals systematic geometric patterns and critical limitations in current embedding models. At the sub-character level, purely structural elements (Chinese radicals) exhibit geometric collapse, highlighting model failures to distinguish semantic from structural components. At the character level, different writing systems show distinct geometric signatures. At the word level, content words form clustering-branching patterns across 20 semantic domains in English, Chinese, and German. Arabic numbers organize through spiral trajectories rather than clustering, violating standard distributional semantics assumptions. These findings establish PHATE manifold learning as an essential analytic tool not only for studying geometric structure of meaning in embedding space, but also for validating the effectiveness of embedding models in capturing semantic relationships.

[20] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

Wen G. Gong

Main category: cs.CL

TL;DR: 本文提出了语义亲和度（Semantic Affinity, SA）指标和Semanscope框架，用于评估多语言嵌入模型的跨语言语义对齐质量，发现训练目标而非模型规模或架构是决定对齐效果的关键因素。

Details

Motivation: 现有任务驱动基准（如MTEB）可能掩盖多语言嵌入模型在跨语言语义对齐上的根本缺陷，导致 practitioners 难以判断模型真实性能，因此需要一种更直接的语义对齐评估方法。 Method: 提出语义亲和度（SA）指标，通过余弦距离计算跨语言与同语言表示的分布比率，并结合PHATE可视化构建Semanscope分析框架；在13个模型、4个数据集上进行52次实验进行评测。 Result: 实验揭示了三类模型表现：(1) 使用翻译对监督训练的顶级BERT模型（如LaBSE, USE, S-BERT）SA达0.68–0.70，语义对齐最强；(2) 大规模语言模型嵌入（LLM embeddings）SA介于0.55–0.61，不随参数规模提升；(3) 仅使用MLM预训练的模型（如mBERT, XLM-R）SA低于0.50，对齐效果差。训练目标起决定性作用。此外，甲骨文原语分析显示模型学习的是语料模式而非认知原语，存在语义漂移。 Conclusion: 跨语言语义对齐依赖显式的翻译监督，而非单纯的模型规模或多语言数据量；本研究为从业者从数百种模型中选择高质量多语言嵌入提供了有效评估工具。 Abstract: With hundreds of multilingual embedding models available, practitioners lack clear guidance on which provide genuine cross-lingual semantic alignment versus task performance through language-specific patterns. Task-driven benchmarks (MTEB) may mask fundamental alignment shortcomings. We introduce Semantic Affinity (SA), a bounded (between 0 and 1) metric measuring inter-lingual to intra-lingual spread ratio using cosine distance, combined with PHATE visualization in our Semanscope framework. Benchmarking 13 models across 4 datasets (52 experiments) reveals a three-tier structure: (1) Top BERT models (LaBSE SA = 0.70, USE SA = 0.68, S-BERT SA = 0.68) achieve strong alignment via translation-pair supervision; (2) LLM embeddings plateau at SA between 0.55 and 0.61 regardless of 0.6 B to 8 B scale; (3) MLM-only BERT models (mBERT, XLM-R, SA < 0.50) fail despite more than 100 language training. Training objective, not architecture or scale, determines alignment. Oracle Bone primitives (1200 BCE) expose semantic drift-models learn corpus patterns rather than cognitive primitives. This work provides semantic benchmarking to help practitioners select quality multilingual embeddings from hundreds of available models, showing cross-lingual alignment requires explicit translation supervision, not merely model scale or multilingual data.

[21] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

Xin Gao,Xiaoyang Wang,Yun Zhu,Mengzhang Cai,Conghui He,Lijun Wu

Main category: cs.CL

TL;DR: 提出了一种基于OpenDataArena（ODA）的闭环数据工程框架，通过价值锚定排序和多维分析指导监督微调（SFT）数据集构建，显著提升大模型在数学推理和多领域任务中的性能与数据效率。

Details

Motivation: 现有SFT数据集构建依赖启发式聚合，缺乏对样本如何影响模型性能的系统理解，亟需一种更理论化、反馈驱动的数据构建方法。 Method: 提出闭环数据工程框架ODA，利用价值锚定排名和多维分析将基准评估转化为数据构建的反馈信号；构建ODA-Math-460k（两阶段难度感知流程）和ODA-Mixture系列（“锚点-补丁”策略）两个新数据集。 Result: ODA驱动的数据集在AIME、HMMT等数学基准上达到SOTA，并在多领域任务中优于更大规模的开源基线，同时具备更高数据效率。 Conclusion: 验证了以透明评估为核心的闭环数据工程可有效推动数据中心型AI发展，为高质量SFT数据集构建提供了可复现、可优化的新范式。 Abstract: The construction of Supervised Fine-Tuning (SFT) datasets is a critical yet under-theorized stage in the post-training of Large Language Models (LLMs), as prevalent practices often rely on heuristic aggregation without a systematic understanding of how individual samples contribute to model performance. In this report, we propose a paradigm shift from ad-hoc curation to a closed-loop dataset engineering framework using OpenDataArena (ODA), which leverages value-anchored rankings and multi-dimensional analysis to transform value benchmarking into feedback signals guiding dataset construction. We instantiate this methodology through two new datasets: \textbf{ODA-Math-460k}, a specialized mathematics reasoning dataset that utilizes a novel two-stage difficulty-aware pipeline to achieve State-of-the-Art (SOTA) results on benchmarks such as AIME and HMMT, and \textbf{ODA-Mixture (100k \& 500k)}, a series of multi-domain instruction datasets built via an ``Anchor-and-Patch'' strategy that outperforms significantly larger open-source baselines. Our empirical results demonstrate that ODA-driven datasets significantly improve both domain-specific reasoning and general utility while achieving superior data efficiency, validating a transition toward data-centric AI where transparent evaluation serves as the primary engine for engineering high-quality training data.

[22] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis

Yanyi Liu,Qingwen Yang,Tiezheng Guo,Feiyu Qu,Jun Liu,Yingyou Wen

Main category: cs.CL

TL;DR: 本文提出了一种新的“幻觉诊断”范式，旨在超越传统的二元检测方法，通过错误定位、因果解释和内容修正提升大语言模型的可靠性。

Details

Motivation: 现有大语言模型中的幻觉问题严重影响其在关键领域的可靠部署，而当前的检测方法缺乏可解释性和改进指导，限制了实际应用。 Method: 提出了幻觉诊断任务，并开发了HDG自动生成带有丰富诊断元数据的训练样本；利用这些数据结合GRPO训练了HDM-4B-RL模型。 Result: HDM-4B-RL在HaluEval基准上超过了先前最先进的检测模型，同时在诊断任务中表现与更大规模的通用模型相当。 Conclusion: 幻觉诊断是可行且有价值的，为构建更可信的生成式AI系统提供了有效方法。 Abstract: Hallucinations in Large Language Models (LLMs), defined as the generation of content inconsistent with facts or context, represent a core obstacle to their reliable deployment in critical domains. Current research primarily focuses on binary "detection" approaches that, while capable of identifying hallucinations, fail to provide interpretable and actionable feedback for model improvement, thus limiting practical utility. To address this limitation, a new research paradigm is proposed, shifting from "detection" to "diagnosis". The Hallucination Diagnosis Task is introduced, a task which requires models to not only detect hallucinations, but also perform error localization, causal explanation, and content correction. We develop the Hallucination Diagnosis Generator (HDG), an automated pipeline that systematically generates high-quality training samples with rich diagnostic metadata from raw corpora through multi-dimensional augmentation strategies including controlled fact fabrication and reasoning chain perturbation. Using HDG-generated data, we train HDM-4B-RL, a 4-billion-parameter hallucination diagnosis model, employing Group Relative Policy Optimization (GRPO) with a comprehensive reward function incorporating structural, accuracy, and localization signals. Experimental results demonstrate that our model surpasses previous state-of-the-art detection models on the HaluEval benchmark while achieving comparable performance to advanced general-purpose models. In comprehensive diagnosis tasks, HDM-4B-RL matches the capabilities of larger general models while maintaining a smaller size. This work validates the feasibility and value of hallucination diagnosis, providing an effective methodology for building more trustworthy and reliable generative AI systems.

[23] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

Xiaoxu Ma,Xiangbo Zhang,Zhenyu Weng

Main category: cs.CL

TL;DR: 提出了一种基于内部激活的稳定且可解释的大语言模型人格特质评估方法Persona-Vector Neutrality Interpolation (PVNI)。

Details

Motivation: 现有基于问卷的人格评估方法稳定性差、可解释性低，结果对提示词微小变化敏感。 Method: 使用对比提示从模型内部激活中提取与目标人格特质相关的人格向量，并通过沿该向量插值估计中性得分，实现可解释评估。 Result: 在多种大语言模型上的实验表明，PVNI比现有方法更稳定，即使在问卷和角色扮演变体下也表现优异。 Conclusion: PVNI提供了一种更稳定、可解释的人格特质评估方法，有助于模型理解、比较与负责任部署。 Abstract: Evaluating personality traits in Large Language Models (LLMs) is key to model interpretation, comparison, and responsible deployment. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation-based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model's internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.

[24] Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences

Sriram Padmanabhan,Siyuan Song,Kanishka Misra

Main category: cs.CL

TL;DR: 研究探讨了视觉语言模型是否像儿童一样，在归纳推理中对不同类型的语言表述（如泛指、全称量化和不定复数）表现出不同的敏感性，并发现模型行为与人类一致，其差异源于归纳约束而非表面形式。

Details

Motivation: 探索语言如何微妙地影响归纳推理，并检验视觉语言模型是否能反映儿童在语言命题理解上的细微差异。 Method: 通过复现Gelman等人的实验，对视觉语言模型进行预测试（类别识别、对'all'和'some'的敏感性），然后进行主实验，评估模型对泛指、全称量化和不定复数陈述的反应，并分析其表征差异。 Result: 视觉语言模型在任务中表现出与人类儿童相似的行为模式（all > generics > some），且后验分析显示这种差异基于归纳约束而非表面语言形式。 Conclusion: 视觉语言模型展现出类似人类的归纳推理敏感性，表明其具备捕捉语言与认知交互中细微语义差异的能力。 Abstract: Language places subtle constraints on how we make inductive inferences. Developmental evidence by Gelman et al. (2002) has shown children (4 years and older) to differentiate among generic statements ("Bears are daxable"), universally quantified NPs ("all bears are daxable") and indefinite plural NPs ("some bears are daxable") in extending novel properties to a specific member (all > generics > some), suggesting that they represent these types of propositions differently. We test if these subtle differences arise in general purpose statistical learners like Vision Language Models, by replicating the original experiment. On tasking them through a series of precondition tests (robust identification of categories in images and sensitivities to all and some), followed by the original experiment, we find behavioral alignment between models and humans. Post-hoc analyses on their representations revealed that these differences are organized based on inductive constraints and not surface-form differences.

[25] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

Sraavya Sambara,Yuan Pu,Ayman Ali,Vishala Mishra,Lionel Wong,Monica Agrawal

Main category: cs.CL

TL;DR: 该研究提出了MedRedFlag数据集，用于评估大语言模型在面对包含错误前提的真实医疗问题时的重定向能力，发现现有LLM常未能正确重定向，存在安全隐患。

Details

Motivation: 检测大语言模型在真实医疗咨询中处理隐含错误假设的问题的能力，确保患者安全。 Method: 构建半自动化的管道，从Reddit收集超过1100个需要重定向的医疗问题，形成MedRedFlag数据集，并系统比较最新LLM与临床医生的回答。 Result: 分析显示，即使检测到错误前提，LLM仍常未能进行有效重定向，可能引发不良医疗决策。 Conclusion: 当前面向患者的医疗AI系统在处理嵌套错误前提的提问方面存在显著安全缺陷，需改进重定向能力。 Abstract: Real-world health questions from patients often unintentionally embed false assumptions or premises. In such cases, safe medical communication typically involves redirection: addressing the implicit misconception and then responding to the underlying patient context, rather than the original question. While large language models (LLMs) are increasingly being used by lay users for medical advice, they have not yet been tested for this crucial competency. Therefore, in this work, we investigate how LLMs react to false premises embedded within real-world health questions. We develop a semi-automated pipeline to curate MedRedFlag, a dataset of 1100+ questions sourced from Reddit that require redirection. We then systematically compare responses from state-of-the-art LLMs to those from clinicians. Our analysis reveals that LLMs often fail to redirect problematic questions, even when the problematic premise is detected, and provide answers that could lead to suboptimal medical decision making. Our benchmark and results reveal a novel and substantial gap in how LLMs perform under the conditions of real-world health communication, highlighting critical safety concerns for patient-facing medical AI systems. Code and dataset are available at https://github.com/srsambara-1/MedRedFlag.

[26] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

Yilin Bao,Ziyao He,Zayden Yang

Main category: cs.CL

TL;DR: 提出了一种基于强化学习的科学论文提纲生成框架，通过长视野规划和两阶段优化提升文档结构一致性、引用忠实度和事实准确性。

Details

Motivation: 现有大语言模型在生成科学论文时存在全局结构不一致、输入覆盖不足和引用不准确的问题，需要更有效的文档级规划方法。 Method: 将科学提纲构建建模为分层文档结构上的长视野规划问题，采用结构化动作编辑提纲，并设计两阶段优化：反向提纲重建和前向价值引导的强化学习，结合科学正确性、话语连贯性和引用保真度的奖励机制。 Result: 在新提出的科学论文生成基准上，该方法在结构连贯性、引用可靠性、输入利用和事实准确性方面均优于强神经网络和大语言模型基线。 Conclusion: 所提框架有效提升了科学论文生成中的全局规划能力和事实一致性，为自动化科研写作提供了可靠的技术路径。 Abstract: Scientific paper generation requires document-level planning and factual grounding, but current large language models, despite their strong local fluency, often fail in global structure, input coverage, and citation consistency. We present a reinforcement learning framework that casts scientific outline construction as a long-horizon planning problem over hierarchical document structures. Our approach models edit evolving outlines through structured actions, enabling the system to incrementally build a complete scientific manuscript. To support effective and stabilize learning,we introduce a two-stage optimization procedure consisting of (i) backward outline reconstruction from partial plans to enforce global structural consistency, and (ii) forward value-guided reinforcement learning with rewards explicitly modeling scientific correctness, discourse coherence, and citation fidelity. In addition, We further introduce a benchmark for scientific paper generation that evaluates document planning, input utilization, reference faithfulness, outline organization, and content-level factual accuracy. Our results show consistent improvements over strong neural and LLM baselines, particularly in long-range structural coherence and citation reliability.

[27] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

Yifei Shen,Yilun Zhao,Justice Ou,Tinglin Huang,Arman Cohan

Main category: cs.CL

TL;DR: CLINSQL是一个基于MIMIC-IV v3.1的临床文本到SQL基准测试，包含633个专家标注任务，要求多表连接、临床意义过滤和可执行SQL查询，评估显示当前模型在临床可靠性上仍有差距。

Details

Motivation: 现有的文本到SQL模型在处理真实世界电子健康记录（EHR）时难以满足临床需求，缺乏对多表关联、时间窗口和患者队列等复杂逻辑的支持。 Method: 构建CLINSQL基准，包含633个需多步推理、多表连接和临床语义理解的SQL任务，并采用思维链自 refinement 和基于评分标准的SQL分析结合执行验证来评估22个闭源与开源模型。 Result: GPT-5-mini在测试集上达到74.7%执行准确率，DeepSeek-R1以69.2%成为最佳开源模型，Gemini-2.5-Pro在困难样本上从85.5%降至67.2%，表明模型在复杂临床查询上表现显著下降。 Conclusion: 尽管已有进展，现有模型在临床可靠性和复杂EHR查询处理方面仍存在明显不足，CLINSQL为推动临床可信的文本到SQL系统提供了重要基准。 Abstract: Real-world clinical text-to-SQL requires reasoning over heterogeneous EHR tables, temporal windows, and patient-similarity cohorts to produce executable queries. We introduce CLINSQL, a benchmark of 633 expert-annotated tasks on MIMIC-IV v3.1 that demands multi-table joins, clinically meaningful filters, and executable SQL. Solving CLINSQL entails navigating schema metadata and clinical coding systems, handling long contexts, and composing multi-step queries beyond traditional text-to-SQL. We evaluate 22 proprietary and open-source models under Chain-of-Thought self-refinement and use rubric-based SQL analysis with execution checks that prioritize critical clinical requirements. Despite recent advances, performance remains far from clinical reliability: on the test set, GPT-5-mini attains 74.7% execution score, DeepSeek-R1 leads open-source at 69.2% and Gemini-2.5-Pro drops from 85.5% on Easy to 67.2% on Hard. Progress on CLINSQL marks tangible advances toward clinically reliable text-to-SQL for real-world EHR analytics.

[28] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

Sathvik Nair,Byung-Doh Oh

Main category: cs.CL

TL;DR: 语言模型的概率在预测语言处理难度上优于基于人类填空任务（cloze）的概率，本文探讨了这一优势的三个原因：更高的分辨率、区分语义相似词的能力，以及对低频词更准确的概率分配。

Details

Motivation: 需要明确语言模型概率优于人类cloze概率的原因，以确保科学结论的正确性，避免因使用不同预测因子而得出不同的语言预测作用结论。 Method: 通过比较语言模型概率与cloze任务衍生概率在预测处理努力方面的表现，分析语言模型优势的具体来源。 Result: 发现语言模型的优势主要来自三方面：避免了cloze数据的低分辨率问题、能更好地区分语义相似词、并对低频词赋予更准确的概率。 Conclusion: 应改进cloze研究的分辨率，并进一步实验验证人类语言预测是否也对语言模型所捕捉的细微差别具有类似敏感性。 Abstract: How predictable a word is can be quantified in two ways: using human responses to the cloze task or using probabilities from language models (LMs).When used as predictors of processing effort, LM probabilities outperform probabilities derived from cloze data. However, it is important to establish that LM probabilities do so for the right reasons, since different predictors can lead to different scientific conclusions about the role of prediction in language comprehension. We present evidence for three hypotheses about the advantage of LM probabilities: not suffering from low resolution, distinguishing semantically similar words, and accurately assigning probabilities to low-frequency words. These results call for efforts to improve the resolution of cloze studies, coupled with experiments on whether human-like prediction is also as sensitive to the fine-grained distinctions made by LM probabilities.

[29] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye,Yi Ting Huang,Marine Carpuat,Rachel Rudinger

Main category: cs.CL

TL;DR: 本研究探讨了开源大语言模型（LLM）在预测现实世界学生数学题目难度中的应用，通过模拟不同年级学生的角色扮演，并结合项目反应理论（IRT）模型，取得了与真实数据高度相关的结果。

Details

Motivation: 标准化数学测试的题目难度通常依赖昂贵的人工试点研究来确定，因此需要一种更高效、低成本的方法来评估题目难度。 Method: 使用开源大语言模型模拟4、8、12年级学生作为‘虚拟教室’，通过角色扮演生成答题结果，并用这些数据拟合项目反应理论（IRT）模型，从而估计题目的难度参数，再与NAEP提供的实际难度进行比较。 Result: 模拟结果与真实学生表现的相关系数分别达到0.75（4年级）、0.76（8年级）和0.82（12年级）；使用具名学生角色并按性别和种族分层可提升预测效果；数学能力较弱的模型（如Gemma）反而比更强的模型（如Llama和Qwen）预测更准确。 Conclusion: 开源大语言模型在适当条件下可通过模拟学生作答有效预测现实数学题目的难度，且模型并非越强越好，提示其在教育评估中具有实用潜力。 Abstract: Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice math questions for real-world students. We show that, while LLMs are poor direct judges of problem difficulty, simulation-based approaches with LLMs yield promising results under the right conditions. Under the proposed approach, we simulate a "classroom" of 4th, 8th, or 12th grade students by prompting the LLM to role-play students of varying proficiency levels. We use the outcomes of these simulations to fit Item Response Theory (IRT) models, comparing learned difficulty parameters for items to their real-world difficulties, as determined by item-level statistics furnished by the National Assessment of Educational Progress (NAEP). We observe correlations as high as 0.75, 0.76, and 0.82 for grades 4, 8, and 12, respectively. In our simulations, we experiment with different "classroom sizes," showing tradeoffs between computation size and accuracy. We find that role-plays with named students improves predictions (compared to student ids), and stratifying names across gender and race further improves predictions. Our results show that LLMs with relatively weaker mathematical abilities (Gemma) actually yield better real-world difficulty predictions than mathematically stronger models (Llama and Qwen), further underscoring the suitability of open-source models for the task.

[30] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

David Samuel Setiawan,Raphaël Merx,Jey Han Lau

Main category: cs.CL

TL;DR: 提出了一种结合NMT和大语言模型与检索增强生成（RAG）的混合框架，以缓解低资源语言在领域迁移下的神经机器翻译性能下降问题，在Dhao语上的实验显示该方法几乎完全恢复了跨领域翻译质量。

Details

Motivation: 低资源语言在领域迁移下神经机器翻译性能显著下降，尤其在缺乏多样化训练数据的情况下（如仅依赖《新约》的Dhao语），跨领域泛化能力严重受限。 Method: 采用混合框架：先用在《新约》上微调的NMT模型生成初稿，再利用基于检索增强生成（RAG）的大语言模型进行修正，重点分析检索示例数量与算法选择对性能的影响。 Result: 在《旧约》这一未见领域上，NMT模型的chrF++从27.11提升至35.21，实现了8.10的恢复，接近原始域内性能（36.17）；性能提升主要源于检索示例的数量而非检索算法的选择。 Conclusion: 该混合框架能有效弥补低资源语言在领域迁移中的翻译质量损失，大语言模型结合RAG可作为强健的“安全网”，纠正NMT在零样本领域的严重错误。 Abstract: Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.

[31] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

Sanghyeok Choi,Woosang Jeon,Kyuseok Yang,Taehyeong Kim

Main category: cs.CL

TL;DR: SocraticKG提出了一种基于问答对的中间表示方法，用于从非结构化文本中构建知识图谱，有效平衡了事实覆盖与结构连贯性之间的权衡。

Details

Motivation: 现有基于大语言模型的知识图谱构建方法在事实覆盖和关系完整性之间存在权衡问题，导致信息丢失或关系碎片化。 Method: 引入5W1H引导的问题-答案扩展作为结构化中间表示，在三元组提取前系统展开文档级语义，并通过问答对保留上下文依赖和隐式关系。 Result: 在MINE基准上的评估表明，该方法在显著增加知识量的同时，保持了更高的事实保留率和结构凝聚力。 Conclusion: 基于问答的语义支架在知识图谱构建前能有效组织语义，提升图谱的连贯性和可靠性。 Abstract: Constructing Knowledge Graphs (KGs) from unstructured text provides a structured framework for knowledge representation and reasoning, yet current LLM-based approaches struggle with a fundamental trade-off: factual coverage often leads to relational fragmentation, while premature consolidation causes information loss. To address this, we propose SocraticKG, an automated KG construction method that introduces question-answer pairs as a structured intermediate representation to systematically unfold document-level semantics prior to triple extraction. By employing 5W1H-guided QA expansion, SocraticKG captures contextual dependencies and implicit relational links typically lost in direct KG extraction pipelines, providing explicit grounding in the source document that helps mitigate implicit reasoning errors. Evaluation on the MINE benchmark demonstrates that our approach effectively addresses the coverage-connectivity trade-off, achieving superior factual retention while maintaining high structural cohesion even as extracted knowledge volume substantially expands. These results highlight that QA-mediated semantic scaffolding plays a critical role in structuring semantics prior to KG extraction, enabling more coherent and reliable graph construction in subsequent stages.

[32] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

Lingfei Qian,Mauro Giuffre,Yan Wang,Huan He,Qianqian Xie,Xuguang Ai,Xeuqing Peng,Fan Ma,Ruey-Ling Weng,Donald Wright,Adan Wang,Qingyu Chen,Vipina K. Keloth,Hua Xu

Main category: cs.CL

TL;DR: EHRNavigator是一个多智能体框架，用于在异构和多模态电子健康记录（EHR）数据中进行患者级别的问答，实现在真实临床环境下的高效准确响应。

Details

Motivation: 现有的自然语言问答系统主要在基准数据集上评估，缺乏实际临床应用的相关性，难以应对真实医院环境中复杂的EHR结构和临床决策需求。 Method: 提出EHRNavigator，一种基于AI智能体的多智能体框架，能够在多样化模式、时间推理和多模态证据整合的条件下，在真实EHR数据上进行问答，并在公共基准和机构数据集上进行评估。 Result: 在真实世界病例中达到86%的准确率，并具有临床可接受的响应时间，表现出良好的泛化能力。 Conclusion: EHRNavigator有效弥合了基准评估与临床部署之间的差距，为现实世界的EHR问答提供了强大、自适应且高效的解决方案。 Abstract: Clinical decision-making increasingly relies on timely and context-aware access to patient information within Electronic Health Records (EHRs), yet most existing natural language question-answering (QA) systems are evaluated solely on benchmark datasets, limiting their practical relevance. To overcome this limitation, we introduce EHRNavigator, a multi-agent framework that harnesses AI agents to perform patient-level question answering across heterogeneous and multimodal EHR data. We assessed its performance using both public benchmark and institutional datasets under realistic hospital conditions characterized by diverse schemas, temporal reasoning demands, and multimodal evidence integration. Through quantitative evaluation and clinician-validated chart review, EHRNavigator demonstrated strong generalization, achieving 86% accuracy on real-world cases while maintaining clinically acceptable response times. Overall, these findings confirm that EHRNavigator effectively bridges the gap between benchmark evaluation and clinical deployment, offering a robust, adaptive, and efficient solution for real-world EHR question answering.

[33] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

Wan Jou She,Lis Kanashiro Pereira,Fei Cheng,Sakiko Yahata,Panote Siriaraya,Eiji Aramaki

Main category: cs.CL

TL;DR: 本文介绍了EmplifAI，一个用于支持慢性病患者应对疾病管理各阶段情绪变化的日本共情对话数据集，包含280个医学情境和4125个两轮对话，并基于GoEmotions分类体系构建了28种细粒度情绪类别。

Details

Motivation: 慢性病患者在疾病管理过程中经历复杂且动态的情绪变化，现有对话数据集难以捕捉这些细粒度情感需求，因此需要构建更具情境化和情感细分的日本语共情对话数据集。 Method: 基于GoEmotions taxonomy改编并验证28类细粒度情绪，通过众包和专家评审收集280个医学情境及4125个两轮对话；使用BERTScore在多个大语言模型上评估情境-对话对的情感一致性，并对日文大模型LLM-jp进行微调以验证数据集有效性；同时比较LLM-as-a-Judge与人类评分的相关性以验证评估流程。 Result: BERTScore在情感对齐任务上达到0.83的F1分数；微调后的LLM-jp模型在流畅性、通用共情和特定情感共情方面均有显著提升；LLM-as-a-Judge与人类评分具有一定相关性，但也揭示出潜在偏差与风险。 Conclusion: EmplifAI是一个高质量、情境化、细粒度的日本共情对话数据集，能有效提升大模型在医疗情境下的情感理解与共情回应能力，同时为自动评估方法提供了实证基础与警示。 Abstract: This paper introduces EmplifAI, a Japanese empathetic dialogue dataset designed to support patients coping with chronic medical conditions. They often experience a wide range of positive and negative emotions (e.g., hope and despair) that shift across different stages of disease management. EmplifAI addresses this complexity by providing situation-based dialogues grounded in 28 fine-grained emotion categories, adapted and validated from the GoEmotions taxonomy. The dataset includes 280 medically contextualized situations and 4125 two-turn dialogues, collected through crowdsourcing and expert review. To evaluate emotional alignment in empathetic dialogues, we assessed model predictions on situation--dialogue pairs using BERTScore across multiple large language models (LLMs), achieving F1 scores of 0.83. Fine-tuning a baseline Japanese LLM (LLM-jp-3.1-13b-instruct4) with EmplifAI resulted in notable improvements in fluency, general empathy, and emotion-specific empathy. Furthermore, we compared the scores assigned by LLM-as-a-Judge and human raters on dialogues generated by multiple LLMs to validate our evaluation pipeline and discuss the insights and potential risks derived from the correlation analysis.

[34] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

Zhenghao Liu,Zhuoyang Wu,Xinze Li,Yukun Yan,Shuo Wang,Zulong Chen,Yu Gu,Ge Yu,Maosong Sun

Main category: cs.CL

TL;DR: 提出P-ALIGN框架，通过自适应前缀对齐蒸馏教师模型的推理路径，提升小型模型在数学推理任务中的表现。

Details

Motivation: 教师模型生成的推理路径过长且复杂，导致学生模型难以有效学习，监督信号与学生模型学习能力之间存在不匹配。 Method: 提出P-ALIGN方法，自适应截断教师生成的推理轨迹，判断剩余后缀是否简洁充分，并利用前缀部分对学生模型进行监督，实现有效的前缀对齐蒸馏。 Result: 在多个数学推理基准上，P-ALIGN比所有基线模型高出3%以上；分析表明其构建的前缀提供了更有效的监督信号，避免了冗余和不确定推理成分的负面影响。 Conclusion: P-ALIGN能有效利用教师模型的推理路径进行知识蒸馏，显著提升小型模型的推理能力，具有实用价值和推广潜力。 Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.

[35] Deriving Character Logic from Storyline as Codified Decision Trees

Letian Peng,Kun Zhou,Longfei Yun,Yupeng Hou,Jingbo Shang

Main category: cs.CL

TL;DR: 提出了一种名为Codified Decision Trees (CDT)的数据驱动框架，用于从大规模叙事数据中生成可执行、可解释的决策结构，以提升角色扮演代理的行为一致性与可靠性。

Details

Motivation: 现有角色扮演代理的行为档案多为非结构化、不可执行且验证不足，导致代理行为脆弱。为此，需要一种更可靠、可验证的结构化方法来建模角色行为。 Method: CDT通过从大规模叙事数据中迭代生成候选场景-动作规则，验证其有效性，并通过分层特化构建决策树：内部节点表示经过验证的场景条件，叶节点编码基于数据的行为陈述，从而实现上下文适配的行为检索。 Result: 在涵盖16个作品中85个角色的多个基准上，CDT显著优于人工编写档案和先前的档案生成方法，表现出更强的代理行为一致性与数据支持能力。 Conclusion: 结构化、可执行且经数据验证的行为表示（如CDT）能有效提升角色扮演代理的可靠性与可维护性，为构建可信虚拟角色提供了可行路径。 Abstract: Role-playing (RP) agents rely on behavioral profiles to act consistently across diverse narrative contexts, yet existing profiles are largely unstructured, non-executable, and weakly validated, leading to brittle agent behavior. We propose Codified Decision Trees (CDT), a data-driven framework that induces an executable and interpretable decision structure from large-scale narrative data. CDT represents behavioral profiles as a tree of conditional rules, where internal nodes correspond to validated scene conditions and leaves encode grounded behavioral statements, enabling deterministic retrieval of context-appropriate rules at execution time. The tree is learned by iteratively inducing candidate scene-action rules, validating them against data, and refining them through hierarchical specialization, yielding profiles that support transparent inspection and principled updates. Across multiple benchmarks, CDT substantially outperforms human-written profiles and prior profile induction methods on $85$ characters across $16$ artifacts, indicating that codified and validated behavioral representations lead to more reliable agent grounding.

[36] Is MT Ready for the Next Crisis or Pandemic?

Vipasha Bansal,Elizabeth Brown,Chelsea Kendrick,Benjamin Pong,William D. Lewis

Main category: cs.CL

TL;DR: 本研究评估了四种商业机器翻译系统在低资源语言中应对危机和医疗领域翻译的效能，使用TICO-19数据集分析其在下一次大流行中的可用性和准备程度。

Details

Motivation: 解决政府、援助机构、医生与受助者之间因语言差异导致的沟通障碍，特别是在危机和医疗场景中提升信息传达效率。 Method: 利用包含多种高优先级语言的TICO-19疫情相关语料库，对四个商业机器翻译系统进行评测，并评估其译文的可读性与实用性。 Result: 揭示了当前商业机器翻译系统在低资源语言疫情相关翻译中的表现差异，指出了系统在实际应用中的局限性与改进空间。 Conclusion: 尽管商业机器翻译是应对危机沟通的重要工具，但在低资源语言和专业领域仍存在明显不足，需进一步优化以提升未来公共卫生事件中的响应能力。 Abstract: Communication in times of crisis is essential. However, there is often a mismatch between the language of governments, aid providers, doctors, and those to whom they are providing aid. Commercial MT systems are reasonable tools to turn to in these scenarios. But how effective are these tools for translating to and from low resource languages, particularly in the crisis or medical domain? In this study, we evaluate four commercial MT systems using the TICO-19 dataset, which is composed of pandemic-related sentences from a large set of high priority languages spoken by communities most likely to be affected adversely in the next pandemic. We then assess the current degree of ``readiness'' for another pandemic (or epidemic) based on the usability of the output translations.

[37] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

Viet Cuong Nguyen,Nhi Yen Nguyen,Kristin A. Candan,Mary Conlon,Vanessa Rumie,Kristen Risola,Srijan Kumar,Munmun De Choudhury

Main category: cs.CL

TL;DR: 本文提出了CALM-IT框架，用于生成和评估长程动机性访谈对话，通过建模双人对话动态提升大语言模型在心理健康场景中的长期连贯性和治疗目标对齐能力。

Details

Motivation: 大语言模型在心理健康应用中难以维持长期、目标导向的对话，存在局部优化导致的脆弱性和长程漂移问题。 Method: 提出CALM-IT框架，将治疗师-来访者互动建模为双向状态空间过程，双方持续更新对齐状态、心理状态和短期目标，以指导策略选择和语句生成。 Result: 大规模评估显示，CALM-IT在有效性与目标对齐方面优于强基线模型，且随对话延长表现更稳定；虽较少发起治疗师重定向，但客户接受率最高（64.3%），表明干预时机更精准。 Conclusion: 建模演化的对话状态对于生成高质量的长程合成对话至关重要。 Abstract: Large Language Models (LLMs) are increasingly used in mental health-related settings, yet they struggle to sustain realistic, goal-directed dialogue over extended interactions. While LLMs generate fluent responses, they optimize locally for the next turn rather than maintaining a coherent model of therapeutic progress, leading to brittleness and long-horizon drift. We introduce CALM-IT, a framework for generating and evaluating long-form Motivational Interviewing (MI) dialogues that explicitly models dual-actor conversational dynamics. CALM-IT represents therapist-client interaction as a bidirectional state-space process, in which both agents continuously update inferred alignment, mental states, and short-term goals to guide strategy selection and utterance generation. Across large-scale evaluations, CALM-IT consistently outperforms strong baselines in Effectiveness and Goal Alignment and remains substantially more stable as conversation length increases. Although CALM-IT initiates fewer therapist redirections, it achieves the highest client acceptance rate (64.3%), indicating more precise and therapeutically aligned intervention timing. Overall, CALM-IT provides evidence for modeling evolving conversational state being essential for generating high-quality long-form synthetic conversations.

[38] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

Yiming Ren,Junjie Wang,Yuxin Meng,Yihang Shi,Zhiqiang Lin,Ruihang Chu,Yiran Xu,Ziming Li,Yunfei Zhao,Zihan Wang,Yu Qiao,Ruiming Tang,Minghao Liu,Yujiu Yang

Main category: cs.CL

TL;DR: 提出“Fish-in-the-Ocean”（FITO）范式，通过构建跨模态证据链评估多模态大模型对科学论文的理解能力，并发布SIN-Data和SIN-Bench以衡量模型在证据发现、假设验证、问答和摘要任务中的表现，揭示当前模型在可追溯支持方面的不足。

Details

Motivation: 现有评估方法如仅看答案匹配或合成的‘海中针’测试，难以衡量模型是否真正理解长篇科学论文并建立因果推理链，缺乏对跨模态证据链的显式要求。 Method: 提出FITO范式，构建保留图文原生交错结构的SIN-Data语料库，并设计包含四个渐进任务的SIN-Bench；引入‘No Evidence, No Score’评分机制，基于可验证锚点评估预测结果，并从匹配性、相关性和逻辑性诊断证据质量。 Result: 在八个MLLM上的实验表明，Gemini-3-pro整体得分最高（0.573），GPT-5在SIN-QA任务上答案准确率最高（0.767）但证据对齐的整体得分较低，显示正确性与可追溯支持之间存在差距。 Conclusion: 模型能否生成可验证的跨模态证据链是其理解科学文献的关键瓶颈，未来需加强在证据发现与逻辑连贯性方面的建模能力。 Abstract: Evaluating whether multimodal large language models truly understand long-form scientific papers remains challenging: answer-only metrics and synthetic "Needle-In-A-Haystack" tests often reward answer matching without requiring a causal, evidence-linked reasoning trace in the document. We propose the "Fish-in-the-Ocean" (FITO) paradigm, which requires models to construct explicit cross-modal evidence chains within native scientific documents. To operationalize FITO, we build SIN-Data, a scientific interleaved corpus that preserves the native interleaving of text and figures. On top of it, we construct SIN-Bench with four progressive tasks covering evidence discovery (SIN-Find), hypothesis verification (SIN-Verify), grounded QA (SIN-QA), and evidence-anchored synthesis (SIN-Summary). We further introduce "No Evidence, No Score", scoring predictions when grounded to verifiable anchors and diagnosing evidence quality via matching, relevance, and logic. Experiments on eight MLLMs show that grounding is the primary bottleneck: Gemini-3-pro achieves the best average overall score (0.573), while GPT-5 attains the highest SIN-QA answer accuracy (0.767) but underperforms on evidence-aligned overall scores, exposing a gap between correctness and traceable support.

[39] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation

Lechen Zhang,Yunxiang Zhang,Wei Hu,Lu Wang

Main category: cs.CL

TL;DR: 提出了一种基于技能的蒸馏框架，通过技能导向的数据选择和微调，在仅使用1000个样本的情况下显著提升小型模型的推理能力。

Details

Motivation: 传统知识蒸馏需要大量标注数据进行监督微调，缺乏数据效率，因此需要更高效的方法来迁移大模型的复杂推理能力。 Method: 设计了技能中心的蒸馏框架，包括技能导向的数据选择（优先选择学生模型薄弱技能的样本）和技能感知的微调（在解题中显式分解技能）。 Result: 在仅用1000个样本时，在Qwen3-4B和Qwen3-8B上分别超越随机SFT基线+1.6%和+1.4%，且性能增益集中在训练所强调的技能上。 Conclusion: 技能中心的训练能有效提升推理能力迁移的数据效率，为模型蒸馏提供了一种更精准、高效的路径。 Abstract: Large reasoning models such as DeepSeek-R1 and their distilled variants achieve strong performance on complex reasoning tasks. Yet, distilling these models often demands large-scale data for supervised fine-tuning (SFT), motivating the pursuit of data-efficient training methods. To address this, we propose a skill-centric distillation framework that efficiently transfers reasoning ability to weaker models with two components: (1) Skill-based data selection, which prioritizes examples targeting the student model's weaker skills, and (2) Skill-aware fine-tuning, which encourages explicit skill decomposition during problem solving. With only 1,000 training examples selected from a 100K teacher-generated corpus, our method surpasses random SFT baselines by +1.6% on Qwen3-4B and +1.4% on Qwen3-8B across five mathematical reasoning benchmarks. Further analysis confirms that these gains concentrate on skills emphasized during training, highlighting the effectiveness of skill-centric training for efficient reasoning distillation.

[40] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

Ye Wang,Jiaxing Chen,Hongjiang Xiao

Main category: cs.CL

TL;DR: 本文系统综述了角色扮演语言代理（RPLA）的发展脉络、关键技术与未来方向，涵盖从早期模板方法到认知模拟的演进，以及人物建模、记忆机制、行为控制、数据构建和多维度评估等核心问题。

Details

Motivation: 随着大语言模型的发展，角色扮演语言代理成为NLP与人机交互的研究热点，亟需对其技术路径、数据与评估体系进行系统梳理，以推动该领域的进一步发展。 Method: 通过文献综述的方式，梳理RPLA的技术演进路径，总结心理量表驱动的人物建模、记忆增强提示、动机-情境行为控制等关键技术，并分析专用语料构建方法与多维评估框架。 Result: 归纳了RPLA在人物建模、记忆机制、行为决策等方面的关键技术路径；分析了角色语料的数据来源、版权与标注挑战；整理了涵盖角色知识、人格一致性、价值对齐和幻觉控制的评估体系及其优劣。 Conclusion: RPLA已从规则模板发展至认知模拟阶段，未来将向人格演化、多智能体叙事、多模态交互及与认知神经科学融合的方向发展，本文为后续研究提供了系统的视角与方法论支持。 Abstract: In recent years, with the rapid advancement of large language models (LLMs), role-playing language agents (RPLAs) have emerged as a prominent research focus at the intersection of natural language processing (NLP) and human-computer interaction. This paper systematically reviews the current development and key technologies of RPLAs, delineating the technological evolution from early rule-based template paradigms, through the language style imitation stage, to the cognitive simulation stage centered on personality modeling and memory mechanisms. It summarizes the critical technical pathways supporting high-quality role-playing, including psychological scale-driven character modeling, memory-augmented prompting mechanisms, and motivation-situation-based behavioral decision control. At the data level, the paper further analyzes the methods and challenges of constructing role-specific corpora, focusing on data sources, copyright constraints, and structured annotation processes. In terms of evaluation, it collates multi-dimensional assessment frameworks and benchmark datasets covering role knowledge, personality fidelity, value alignment, and interactive hallucination, while commenting on the advantages and disadvantages of methods such as human evaluation, reward models, and LLM-based scoring. Finally, the paper outlines future development directions of role-playing agents, including personality evolution modeling, multi-agent collaborative narrative, multimodal immersive interaction, and integration with cognitive neuroscience, aiming to provide a systematic perspective and methodological insights for subsequent research.

[41] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou,Zhangchi Xue,Lijun Li,Peiyang Liu,Shikun Zhang,Wei Ye,Jing Shao

Main category: cs.CL

TL;DR: 本文提出了一种用于检测大语言模型代理在调用工具时安全性的新方法，包括构建基准TS-Bench、训练守护模型TS-Guard，并引入反馈驱动的推理框架TS-Flow，显著降低有害操作并提升正常任务完成率。

Details

Motivation: 随着LLM代理通过调用外部工具与环境交互的能力增强，其潜在的安全风险也随之上升，亟需在执行前实时监测并干预不安全的工具调用行为。 Method: 构建了TS-Bench作为步级工具调用安全检测基准，采用多任务强化学习训练TS-Guard模型以预测不安全调用，并设计TS-Flow框架实现守护反馈驱动的推理机制。 Result: TS-Guard能够基于交互历史判断请求的危害性和行为攻击关联性，TS-Flow使ReAct式代理的有害工具调用平均减少65%，在提示注入攻击下良性任务完成率提升约10%。 Conclusion: 该工作为LLM代理提供了可解释、可泛化的安全守护方案，有效平衡了安全性与功能性，推动了安全可控的智能代理发展。 Abstract: While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

[42] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

Guimin Hu,Meng Li,Qiwei Peng,Lijie Hu,Boyan Xu,Ruichu Cai

Main category: cs.CL

TL;DR: 本文研究了MoE大模型中专家激活的机制，提出区分领域专家与驱动专家，并通过熵和因果效应指标分析其作用，发现早期token更易触发驱动专家，且调整这两类专家权重可提升模型性能。

Details

Motivation: 受人类大脑功能特化的启发，现有Transformer解释性工作多关注层或神经元级别机制，而MoE模型中的专家行为尚未被深入探索，因此需要研究专家激活模式以增强模型可解释性。 Method: 引入基于熵的指标识别具有领域偏好的专家，使用因果效应指标识别对输出有显著影响的驱动专家，并分析不同token与特定专家激活之间的关联。 Result: （1）部分专家表现出明显的领域偏好，另一些则对模型输出有强因果影响；（2）句子中靠前的token更可能触发驱动专家；（3）调整领域和驱动专家的权重可在三个模型和领域上带来显著性能提升。 Conclusion: 该研究揭示了MoE模型中专家分工的内在机制，明确了领域专家与驱动专家的不同角色，提升了MoE模型的可解释性，并为模型优化提供了新思路。 Abstract: Most interpretability work focuses on layer- or neuron-level mechanisms in Transformers, leaving expert-level behavior in MoE LLMs underexplored. Motivated by functional specialization in the human brain, we analyze expert activation by distinguishing domain and driver experts. In this work, we study expert activation in MoE models across three public domains and address two key questions: (1) which experts are activated, and whether certain expert types exhibit consistent activation patterns; and (2) how tokens are associated with and trigger the activation of specific experts. To answer these questions, we introduce entropy-based and causal-effect metrics to assess whether an expert is strongly favored for a particular domain, and how strongly expert activation contributes causally to the model's output, thus identify domain and driver experts, respectively. Furthermore, we explore how individual tokens are associated with the activation of specific experts. Our analysis reveals that (1) Among the activated experts, some show clear domain preferences, while others exert strong causal influence on model performance, underscoring their decisive roles. (2) tokens occurring earlier in a sentence are more likely to trigger the driver experts, and (3) adjusting the weights of domain and driver experts leads to significant performance gains across all three models and domains. These findings shed light on the internal mechanisms of MoE models and enhance their interpretability.

[43] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

Cameron Tice,Puria Radmard,Samuel Ratnam,Andy Kim,David Africa,Kyle O'Brien

Main category: cs.CL

TL;DR: 本文研究了预训练语料中关于AI行为的讨论如何影响大语言模型的对齐性，发现负面描述会加剧模型的不对齐行为，而正面描述则显著改善对齐，提出“自我实现的对齐”概念，并建议在预训练阶段就考虑对齐问题。

Details

Motivation: 预训练语料中广泛存在关于AI系统的讨论，若这些讨论偏向负面，可能导致模型内化负面行为先验，引发自我实现的不对齐；但这一因果影响尚缺乏系统研究。 Method: 通过控制实验，使用不同比例的对齐与不对齐AI行为文本，从头预训练69亿参数的语言模型，并评估其在后训练前后的行为对齐程度。 Result: 增加关于AI不对齐的文本会导致模型表现出更明显的不对齐行为；相反，增加对齐行为文本可使不对齐评分从45%降至9%；该效应在后训练后有所减弱但仍持续存在。 Conclusion: 预训练数据中的AI行为描述会显著影响模型的对齐性，支持‘自我实现的对齐’假说，表明应将对齐纳入预训练阶段的设计，而仅依赖后训练不足以完全纠正预训练形成的行为先验。 Abstract: Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This paper provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse. We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training. Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities. Our models and datasets are available at alignmentpretraining.ai

[44] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

Prachuryya Kaushik,Ashish Anand

Main category: cs.CL

TL;DR: AWED-FiNER是一个开源生态系统，支持36种全球语言的细粒度命名实体识别（FgNER），覆盖超66亿使用者，尤其关注低资源和濒危语言，提供代理工具包、网页应用和轻量级专家模型，实现高效、离线可用的多语言FgNER解决方案。

Details

Motivation: 现有大语言模型在低资源语言和细粒度NLP任务上表现不佳，且缺乏对濒危语言的支持，AWED-FiNER旨在填补这一技术空白。 Method: 构建包含代理工具、网页应用和49个小型开源专家模型的生态系统，通过代理路由多语言文本至专用模型，实现快速FgNER标注，并支持离线部署。 Result: 实现了覆盖36种语言（包括Bodo、Manipuri等濒危语言）的高效FgNER系统，支持秒级标注和边缘设备部署，提供开放访问的工具和模型资源。 Conclusion: AWED-FiNER为多语言尤其是低资源和濒危语言的FgNER任务提供了实用、可扩展且易于访问的解决方案，推动了NLP技术的包容性发展。 Abstract: We introduce AWED-FiNER, an open-source ecosystem designed to bridge the gap in Fine-grained Named Entity Recognition (FgNER) for 36 global languages spoken by more than 6.6 billion people. While Large Language Models (LLMs) dominate general Natural Language Processing (NLP) tasks, they often struggle with low-resource languages and fine-grained NLP tasks. AWED-FiNER provides a collection of agentic toolkits, web applications, and several state-of-the-art expert models that provides FgNER solutions across 36 languages. The agentic tools enable to route multilingual text to specialized expert models and fetch FgNER annotations within seconds. The web-based platforms provide ready-to-use FgNER annotation service for non-technical users. Moreover, the collection of language specific extremely small sized open-source state-of-the-art expert models facilitate offline deployment in resource contraint scenerios including edge devices. AWED-FiNER covers languages spoken by over 6.6 billion people, including a specific focus on vulnerable languages such as Bodo, Manipuri, Bishnupriya, and Mizo. The resources can be accessed here: Agentic Tool (https://github.com/PrachuryyaKaushik/AWED-FiNER), Web Application (https://hf.co/spaces/prachuryyaIITG/AWED-FiNER), and 49 Expert Detector Models (https://hf.co/collections/prachuryyaIITG/awed-finer).

[45] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

Nhung Nguyen Thi Hong,Cuong Nguyen Dang,Tri Le Ngoc

Main category: cs.CL

TL;DR: 本文提出了Credit C-GPT，一个专用于越南语债务催收场景的七亿参数大语言模型，整合了对话理解、情感识别、意图检测等多任务，优于传统流水线方法。

Details

Motivation: 传统自然语言处理系统在处理越南语催收对话中的非正式口语、情绪变化和领域复杂推理时面临挑战。 Method: 构建并微调一个七亿参数的领域专用大语言模型Credit C-GPT，集成多种对话智能任务于统一框架，并采用特定数据构建、标注和训练方法。 Result: 在专有人工标注数据集上实验表明，Credit C-GPT在各项任务上均优于传统流水线方法，提升对话理解和结构化信息提取效果。 Conclusion: 领域专用的大语言模型可为催收中心提供可扩展且兼顾隐私的实时辅助与事后分析解决方案。 Abstract: Debt collection is a critical function within the banking, financial services, and insurance (BFSI) sector, relying heavily on large-scale human-to-human conversational interactions conducted primarily in Vietnamese contact centers. These conversations involve informal spoken language, emotional variability, and complex domain-specific reasoning, which pose significant challenges for traditional natural language processing systems. This paper introduces Credit C-GPT, a domain-specialized large language model with seven billion parameters, fine-tuned for conversational understanding in Vietnamese debt collection scenarios. The proposed model integrates multiple conversational intelligence tasks, including dialogue understanding, sentiment recognition, intent detection, call stage classification, and structured slot-value extraction, within a single reasoning-based framework. We describe the data construction process, annotation strategy, and training methodology, and evaluate the model on proprietary human-annotated datasets. Experimental results show consistent improvements over traditional pipeline-based approaches, indicating that domain-specialized conversational language models provide a scalable and privacy-aware solution for real-time assistance and post-call analytics in enterprise contact centers.

[46] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

Ziang Cui,Mengran Yu,Tianjiao Li,Chenyu Shi,Yingxuan Shi,Lusheng Zhang,Hongwei Lin

Main category: cs.CL

TL;DR: 本文提出了一种针对大语言模型在多语言翻译中存在跨语言冗长偏差的问题的解决方案，引入了Sand-Glass基准和HOMURA强化学习框架，以优化语义保持与时间一致性之间的权衡。

Details

Motivation: 大语言模型在多语言翻译中表现出色，但在严格时间限制任务（如字幕和配音）中由于系统性跨语言冗长偏差而受限，现有提示工程方法难以解决语义保真与时间可行性之间的冲突。 Method: 提出了Sand-Glass基准用于评估音节级时长约束下的翻译效果，并设计了HOMURA强化学习框架，采用KL正则化目标和新颖的动态音节比率奖励来显式优化语义保持与时间合规性之间的平衡。 Result: 实验结果表明，该方法显著优于强基线大语言模型，在尊重语言密度层次的同时实现了精确的长度控制，且不损害语义充分性。 Conclusion: HOMURA框架有效缓解了大语言模型在时间受限翻译任务中的冗长问题，为实现高质量、符合时间约束的多语言翻译提供了新思路。 Abstract: Large Language Models (LLMs) have achieved remarkable strides in multilingual translation but are hindered by a systemic cross-lingual verbosity bias, rendering them unsuitable for strict time-constrained tasks like subtitling and dubbing. Current prompt-engineering approaches struggle to resolve this conflict between semantic fidelity and rigid temporal feasibility. To bridge this gap, we first introduce Sand-Glass, a benchmark specifically designed to evaluate translation under syllable-level duration constraints. Furthermore, we propose HOMURA, a reinforcement learning framework that explicitly optimizes the trade-off between semantic preservation and temporal compliance. By employing a KL-regularized objective with a novel dynamic syllable-ratio reward, HOMURA effectively "tames" the output length. Experimental results demonstrate that our method significantly outperforms strong LLM baselines, achieving precise length control that respects linguistic density hierarchies without compromising semantic adequacy.

[47] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

Xintao Wang,Jian Yang,Weiyuan Li,Rui Xie,Jen-tse Huang,Jun Gao,Shuai Huang,Yueping Kang,Liyuan Gou,Hongwei Feng,Yanghua Xiao

Main category: cs.CL

TL;DR: HUMANLLM是一个新框架，通过建模人类心理模式的因果交互来提升角色扮演语言代理的真实性，在多模式动态模拟中表现出色，超越更大模型，强调认知建模对实现真正拟人化的重要性。

Details

Motivation: 现有角色扮演语言代理难以真实对齐人类认知与行为模式，缺乏对心理过程的深入模拟。 Method: 构建了来自约12,000篇学术论文的244种心理模式，并合成了11,359个涉及2-5种模式相互作用的情境，通过多轮对话表达内心想法、行动和言语；提出双层检查清单评估个体模式保真度与涌现的多模式动态。 Result: HUMANLLM-8B在多模式动态评估上优于Qwen3-32B，尽管参数少4倍；评估显示整体指标可能混淆模拟准确性与社会期望性，而心理过程建模是实现真实拟人的关键。 Conclusion: 实现真实的拟人化不仅需要模拟人类行为，还需模拟生成这些行为的心理过程，认知建模是构建高保真角色代理的核心。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning and generation, serving as the foundation for advanced persona simulation and Role-Playing Language Agents (RPLAs). However, achieving authentic alignment with human cognitive and behavioral patterns remains a critical challenge for these agents. We present HUMANLLM, a framework treating psychological patterns as interacting causal forces. We construct 244 patterns from ~12,000 academic papers and synthesize 11,359 scenarios where 2-5 patterns reinforce, conflict, or modulate each other, with multi-turn conversations expressing inner thoughts, actions, and dialogue. Our dual-level checklists evaluate both individual pattern fidelity and emergent multi-pattern dynamics, achieving strong human alignment (r=0.91) while revealing that holistic metrics conflate simulation accuracy with social desirability. HUMANLLM-8B outperforms Qwen3-32B on multi-pattern dynamics despite 4x fewer parameters, demonstrating that authentic anthropomorphism requires cognitive modeling--simulating not just what humans do, but the psychological processes generating those behaviors.

[48] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

Arya Shah,Himanshu beniwal,Mayank Singh

Main category: cs.CL

TL;DR: 提出一个涵盖12种印度语言和四项评估任务的统一基准，用于评估多语言嵌入模型在文化适配用户偏好对齐中的表现，提供了可复现的基线结果。

Details

Motivation: 现有基准要么局限于单一语言，要么混淆了检索与生成，无法回答当前嵌入模型是否能在不依赖响应生成的情况下编码角色-指令兼容性。 Method: 构建了一个包含12种印度语言的统一基准，涵盖单语和跨语言的角色到指令检索、反向检索以及二元兼容性分类四项任务，在冻结编码器设置下使用轻量逻辑回归头评估八种多语言嵌入模型。 Result: E5-Large-Instruct在单语检索（Recall@1 27.4%）和跨语言迁移（20.7%）上表现最佳，BGE-M3在反向检索中领先（32.1%），LaBSE在分类任务中达到75.3% AUROC且校准良好。 Conclusion: 研究为印度多语言环境下的模型选择提供了实用指导，并为未来工作建立了可复现的基线。 Abstract: Aligning multilingual assistants with culturally grounded user preferences is essential for serving India's linguistically diverse population of over one billion speakers across multiple scripts. However, existing benchmarks either focus on a single language or conflate retrieval with generation, leaving open the question of whether current embedding models can encode persona-instruction compatibility without relying on response synthesis. We present a unified benchmark spanning 12 Indian languages and four evaluation tasks: monolingual and cross-lingual persona-to-instruction retrieval, reverse retrieval from instruction to persona, and binary compatibility classification. Eight multilingual embedding models are evaluated in a frozen-encoder setting with a thin logistic regression head for classification. E5-Large-Instruct achieves the highest Recall@1 of 27.4\% on monolingual retrieval and 20.7\% on cross-lingual transfer, while BGE-M3 leads reverse retrieval at 32.1\% Recall@1. For classification, LaBSE attains 75.3\% AUROC with strong calibration. These findings offer practical guidance for model selection in Indic multilingual retrieval and establish reproducible baselines for future work\footnote{Code, datasets, and models are publicly available at https://github.com/aryashah2k/PI-Indic-Align.

[49] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients

Kentaro Kazama,Daiki Shirafuji,Tatsuhiko Saito

Main category: cs.CL

TL;DR: 本文提出了GeoSteer，一种基于流形的框架，通过在潜在空间中引导大语言模型的隐藏状态，提升多步推理过程中中间推理步骤的质量。

Details

Motivation: 现有的大语言模型在多步推理中常产生逻辑不一致的推理步骤，影响推理可靠性，因此需要提升中间推理的质量。 Method: 构建带有分段评分的思维链（CoT）数据集，训练变分自编码器（VAE）和质量评估模型以学习高质量CoT轨迹的低维流形，并引导目标模型的隐藏状态向高质量区域移动。 Result: 在GSM8k数据集上使用Qwen3系列模型进行评估，GeoSteer使准确率最高提升了2.6点，成对胜率提高了5.3点。 Conclusion: GeoSteer提供了一种有效且可控的机制，用于改善大语言模型中间推理步骤的质量，且具有几何一致性。 Abstract: Recent advances in Large Language Models (LLMs) have improved multi-step reasoning. Most approaches rely on Chain-of-Thought (CoT) rationales. Previous studies have shown that LLMs often generate logically inconsistent reasoning steps even when their final answers are correct. These inconsistencies reduce the reliability of step-level reasoning. We propose GeoSteer, a manifold-based framework that improves the quality of intermediate reasoning. The method consists of: (1) constructing a CoT dataset with segment-level scores, (2) training a Variational Autoencoder (VAE) model and a quality estimation model to learn a low-dimensional manifold of high-quality CoT trajectories, and (3) steering hidden states of target LLMs toward higher-quality regions in the latent space. This update in a latent space behaves like a natural-gradient adjustment in the original hidden-state space. It ensures geometrically coherent steering. We evaluate GeoSteer on the GSM8k dataset using the Qwen3 series. We measure via answer accuracy and overall reasoning performance. GeoSteer improved the exact match accuracy by up to 2.6 points. It also enhanced the pairwise win rate by 5.3 points. These results indicate that GeoSteer provides an effective and controllable mechanism for improving the quality of intermediate reasoning in LLMs.

[50] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

Guanxu Chen,Dongrui Liu,Jing Shao

Main category: cs.CL

TL;DR: 循环Transformer（LTs）通过迭代共享层增加计算深度，试图弥合大语言模型内部知识与显式输出之间的差距，但实验表明其迭代过程并未实现真正的内省，且内部表征能力未随循环提升。

Details

Motivation: 研究大语言模型内部知识与其显式语言输出之间存在差距的问题，探索循环Transformer是否能通过迭代机制实现内省以缩小该差距。 Method: 实证分析循环Transformer在增加循环次数时对内部知识表征和输出一致性的影响，并评估其在各循环中感知表征的能力。 Result: 增加循环次数虽能部分缩小差距，但伴随着表征中内部知识的退化；当前LTs仅在最后一轮循环中表现出对表征的感知能力，中间循环无明显提升。 Conclusion: 循环Transformer虽为扩展计算深度提供了一条路径，但尚未实现连接表征空间与自然语言所需的真正内省机制。 Abstract: Large Language Models (LLMs) often exhibit a gap between their internal knowledge and their explicit linguistic outputs. In this report, we empirically investigate whether Looped Transformers (LTs)--architectures that increase computational depth by iterating shared layers--can bridge this gap by utilizing their iterative nature as a form of introspection. Our experiments reveal that while increasing loop iterations narrows the gap, it is partly driven by a degradation of their internal knowledge carried by representations. Moreover, another empirical analysis suggests that current LTs' ability to perceive representations does not improve across loops; it is only present in the final loop. These results suggest that while LTs offer a promising direction for scaling computational depth, they have yet to achieve the introspection required to truly link representation space and natural language.

[51] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

Prottay Kumar Adhikary,Reena Rawat,Tanmoy Chakraborty

Main category: cs.CL

TL;DR: 本文提出了一种名为coTherapist的统一框架，利用小型语言模型通过领域特定微调、检索增强和代理推理来模拟核心治疗能力，评估表明其在临床查询中生成更相关且基于临床的响应，并展现出高共情和与治疗师一致的人格特质，经专家验证准确、可信且安全，具备作为可扩展数字心理健康工具的潜力。

Details

Motivation: 由于精神卫生 workforce 短缺和需求上升，亟需智能系统辅助精神卫生专家提供服务。 Method: 采用小型语言模型，结合领域特定微调、检索增强和代理推理，构建coTherapist框架以模拟治疗师的核心能力。 Result: 在临床查询中，coTherapist比现有基线模型生成更相关、更符合临床实践的回答；使用T-BARS评分和心理测量分析显示其具有高共情和治疗师一致的人格特征；临床专家的人工评估确认其回答准确、可信且安全。 Conclusion: 经过工程化设计的小型模型能够表现出类专家行为，coTherapist为数字心理健康工具的可扩展发展提供了可行路径。 Abstract: Access to mental healthcare is increasingly strained by workforce shortages and rising demand, motivating the development of intelligent systems that can support mental healthcare experts. We introduce coTherapist, a unified framework utilizing a small language model to emulate core therapeutic competencies through domain-specific fine-tuning, retrieval augmentation, and agentic reasoning. Evaluation on clinical queries demonstrates that coTherapist generates more relevant and clinically grounded responses than contemporary baselines. Using our novel T-BARS rubric and psychometric profiling, we confirm coTherapist exhibits high empathy and therapist-consistent personality traits. Furthermore, human evaluation by domain experts validates that coTherapist delivers accurate, trustworthy, and safe responses. coTherapist was deployed and tested by clinical experts. Collectively, these findings demonstrate that small models can be engineered to exhibit expert-like behavior, offering a scalable pathway for digital mental health tools.

[52] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs

Nan Li,Bo Kang,Tijl De Bie

Main category: cs.CL

TL;DR: 本文提出了一种新方法，分离语言对大模型道德判断的影响因素（输入语言与推理语言），并通过道德基础理论解释其变化，发现在英汉情境下，推理语言的影响是输入语言的两倍，并发现近一半模型存在标准评估忽略的情境依赖性。

Details

Motivation: 研究大模型在不同语言中进行道德判断时结论是否不同，并区分是由于 dilemma 的输入语言还是模型进行推理的语言导致的差异，避免标准评估中两者混淆的问题。 Method: 通过独立操纵 dilemma 的输入语言和模型的推理语言（包括匹配与不匹配条件），结合道德基础理论分析模型判断，并引入可解释方法，识别出权威维度可进一步分为家庭和制度两个子维度。 Result: 在13个大模型上的实验表明：推理语言带来的方差贡献是输入语言的两倍；近一半模型存在标准评估未发现的情境依赖性；框架能有效分解影响因素并提供部署指导。 Conclusion: 大模型的道德判断受推理语言影响更大，且存在跨语言情境依赖性，提出的框架有助于更精细地诊断多语言道德决策机制，并为实际部署提供依据。 Abstract: When LLMs judge moral dilemmas, do they reach different conclusions in different languages, and if so, why? Two factors could drive such differences: the language of the dilemma itself, or the language in which the model reasons. Standard evaluation conflates these by testing only matched conditions (e.g., English dilemma with English reasoning). We introduce a methodology that separately manipulates each factor, covering also mismatched conditions (e.g., English dilemma with Chinese reasoning), enabling decomposition of their contributions. To study \emph{what} changes, we propose an approach to interpret the moral judgments in terms of Moral Foundations Theory. As a side result, we identify evidence for splitting the Authority dimension into a family-related and an institutional dimension. Applying this methodology to English-Chinese moral judgment with 13 LLMs, we demonstrate its diagnostic power: (1) the framework isolates reasoning-language effects as contributing twice the variance of input-language effects; (2) it detects context-dependency in nearly half of models that standard evaluation misses; and (3) a diagnostic taxonomy translates these patterns into deployment guidance. We release our code and datasets at https://anonymous.4open.science/r/CrossCulturalMoralJudgement.

[53] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

Hiroaki Yamagiwa,Yusuke Takase,Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: 本文提出了一种基于主角度的子空间相似性度量——投影核（PK），用于更好地量化Transformer中注意力头之间的关系，并通过实验验证其在IOI任务上优于现有方法；同时引入了一个评估PK分布信息量的框架，并应用该方法发现GPT2-small中L4H7是一个作为恒等映射的中心注意力头。

Details

Motivation: 现有的注意力头关系度量方法未能很好地捕捉Transformer内部结构，因此需要一种更有效的度量方式来理解注意力头之间的相互作用。 Method: 利用注意力头权重矩阵张成的子空间，基于主角度定义投影核（PK）作为子空间相似性度量，并构建一个与随机正交子空间比较的参考分布框架以评估PK的信息量。 Result: PK能更清晰地复现已知的IOI任务中的头间交互关系，优于组成分数等现有指标；基于PK构建的有向图揭示了GPT2-small中L4H7作为一个恒等头起到了枢纽作用。 Conclusion: 投影核（PK）是一种有效的注意力头间关系度量方法，能够揭示Transformer模型中更深层次的结构特性，有助于解释和理解模型内部机制。 Abstract: Understanding relationships between attention heads is essential for interpreting the internal structure of Transformers, yet existing metrics do not capture this structure well. We focus on the subspaces spanned by attention-head weight matrices and quantify head-to-head relationships using the Projection Kernel (PK), a principal-angle-based measure of subspace similarity. Experiments show that PK reproduces known head-to-head interactions on the IOI task more clearly than prior metrics such as the Composition Score. We further introduce a framework to quantify the informativeness of PK distributions by comparing them with a reference distribution derived from random orthogonal subspaces. As an application, we analyze a directed graph constructed from PK and show that, in GPT2-small, L4H7 acts as a hub by functioning as an identity head.

[54] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

Yuxuan Lou,Kai Yang,Yang You

Main category: cs.CL

TL;DR: 本文提出了MoST，一种基于模态感知混合专家（MAMoE）架构的新型语音-文本多模态大语言模型，通过模态特定与共享专家协同工作，实现高效的跨模态理解与生成，并完全使用开源数据训练，性能优于同规模现有模型。

Details

Motivation: 现有多模态模型通常用相同参数处理不同模态，忽视了语音和文本在表示上的本质差异，导致学习效率和跨模态理解受限。 Method: 提出MAMoE架构，包含模态特定专家组和共享专家，通过路由机制将输入分配给合适的专家；并设计了一个高效的转换流程，在预训练MoE语言模型基础上进行ASR/TTS数据的后训练和语音-文本指令数据微调。 Result: 在ASR、TTS、音频语言建模和口语问答等任务上，MoST在同等参数规模下均优于现有模型；消融实验验证了模态特定路由和共享专家的有效性。 Conclusion: MoST是首个基于混合专家架构的全开源语音-文本大语言模型，其架构设计提升了模态特异性学习与跨模态融合能力，且仅依赖公开数据即实现了高性能与高数据效率。 Abstract: We present MoST (Mixture of Speech and Text), a novel multimodal large language model that seamlessly integrates speech and text processing through our proposed Modality-Aware Mixture of Experts (MAMoE) architecture. While current multimodal models typically process diverse modality representations with identical parameters, disregarding their inherent representational differences, we introduce specialized routing pathways that direct tokens to modality-appropriate experts based on input type. MAMoE simultaneously enhances modality-specific learning and cross-modal understanding through two complementary components: modality-specific expert groups that capture domain-specific patterns and shared experts that facilitate information transfer between modalities. Building on this architecture, we develop an efficient transformation pipeline that adapts the pretrained MoE language model through strategic post-training on ASR and TTS datasets, followed by fine-tuning with a carefully curated speech-text instruction dataset. A key feature of this pipeline is that it relies exclusively on fully accessible, open-source datasets to achieve strong performance and data efficiency. Comprehensive evaluations across ASR, TTS, audio language modeling, and spoken question answering benchmarks show that MoST consistently outperforms existing models of comparable parameter counts. Our ablation studies confirm that the modality-specific routing mechanism and shared experts design significantly contribute to performance gains across all tested domains. To our knowledge, MoST represents the first fully open-source speech-text LLM built on a Mixture of Experts architecture. \footnote{We release MoST model, training code, inference code, and training data at https://github.com/NUS-HPC-AI-Lab/MoST

[55] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

Luoming Hu,Jingjie Zeng,Liang Yang,Hongfei Lin

Main category: cs.CL

TL;DR: 本文提出了一种基于道德基础理论（MFT）的新型方法，通过跨语言线性探测识别并操控大语言模型中的细粒度道德表征，引入可 steering 的道德向量，并提出自适应道德融合（AMF）机制，在保持帮助性的同时增强模型的安全性。

Details

Motivation: 现有的对齐技术往往仅作为表面防护，未能深入改变大语言模型内在的道德表征，导致安全与帮助性之间的权衡问题。本文旨在通过揭示和操控模型内部的道德结构来实现更深层次的道德对齐。 Method: 采用跨语言线性探测验证中层中存在的共享道德表征，发现英汉语言间共通但有差异的道德子空间；从中提取可操控的道德向量，并提出自适应道德融合（AMF），在推理时动态结合探测与向量注入。 Result: 实验证明所提方法能有效在内部和行为层面操控道德倾向，显著减少对良性查询的错误拒绝，同时降低越狱攻击的成功率，优于基线方法。 Conclusion: 通过揭示并利用大语言模型中可迁移且可操控的内在道德结构，AMF为实现更精准、动态的道德对齐提供了新路径，缓解了安全性与帮助性之间的冲突。 Abstract: Enhancing the moral alignment of Large Language Models (LLMs) is a critical challenge in AI safety. Current alignment techniques often act as superficial guardrails, leaving the intrinsic moral representations of LLMs largely untouched. In this paper, we bridge this gap by leveraging Moral Foundations Theory (MFT) to map and manipulate the fine-grained moral landscape of LLMs. Through cross-lingual linear probing, we validate the shared nature of moral representations in middle layers and uncover a shared yet different moral subspace between English and Chinese. Building upon this, we extract steerable Moral Vectors and successfully validate their efficacy at both internal and behavioral levels. Leveraging the high generalizability of morality, we propose Adaptive Moral Fusion (AMF), a dynamic inference-time intervention that synergizes probe detection with vector injection to tackle the safety-helpfulness trade-off. Empirical results confirm that our approach acts as a targeted intrinsic defense, effectively reducing incorrect refusals on benign queries while minimizing jailbreak success rates compared to standard baselines.

[56] Multilinguality as Sense Adaptation

Jan Christian Blaise Cruz,David Ifeoluwa Adelani,Alham Fikri Aji

Main category: cs.CL

TL;DR: 本文提出了SENSIA方法，通过在平行数据上对齐词义层面的混合表示和上下文表示，实现跨语言的背包语言模型适应，并联合训练目标语言的语言建模损失以保持流畅性。

Details

Motivation: 现有的多语言模型通常依赖共享参数和大规模数据，但在低资源语言中表现受限。本文旨在通过意义对齐而非参数共享来提升跨语言迁移效果。 Method: 提出SENse-based Symmetric Interlingual Alignment (SENSIA)，在平行语料上显式对齐不同语言间的词义混合分布与上下文表示，同时联合优化目标语言的LM损失以维持生成质量。 Result: 在四种类型迥异的语言基准上，SENSIA普遍优于其他多语言对齐方法，且使用比单语基线少2-4倍的目标语言数据即达到相当甚至更优的准确率。 Conclusion: 通过意义层级的对齐可有效实现跨语言模型迁移，SENSIA在减少数据需求的同时保持性能，具有良好的结构鲁棒性和可扩展性。 Abstract: We approach multilinguality as sense adaptation: aligning latent meaning representations across languages rather than relying solely on shared parameters and scale. In this paper, we introduce SENse-based Symmetric Interlingual Alignment (SENSIA), which adapts a Backpack language model from one language to another by explicitly aligning sense-level mixtures and contextual representations on parallel data, while jointly training a target-language language modeling loss to preserve fluency. Across benchmarks on four typologically diverse languages, SENSIA generally outperforms comparable multilingual alignment methods and achieves competitive accuracy against monolingual from-scratch baselines while using 2-4x less target-language data. Analyses of learned sense geometry indicate that local sense topology and global structure relative to English are largely preserved, and ablations show that the method is robust in terms of design and scale.

[57] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios

Aniket Deroy

Main category: cs.CL

TL;DR: 本文介绍了Advosynth-500数据集，包含100个合成语音文件，用于研究法庭辩论中不同律师身份的语音特征及现代系统在说话人识别上的表现。

Details

Motivation: 随着大规模语音到语音模型的发展，区分结构化环境中的合成声音变得至关重要。 Method: 使用Speech Llama Omni模型模拟五组不同的律师对话语音，并为每位律师定义特定的声学特征，构建Advosynth-500数据集。 Result: 提出了一个说话人识别挑战，用以评估现代系统对合成语音来源的映射能力。 Conclusion: Advosynth-500为评估合成语音辨识提供了有价值的资源，推动了相关领域的发展。 Abstract: As large-scale speech-to-speech models achieve high fidelity, the distinction between synthetic voices in structured environments becomes a vital area of study. This paper introduces Advosynth-500, a specialized dataset comprising 100 synthetic speech files featuring 10 unique advocate identities. Using the Speech Llama Omni model, we simulate five distinct advocate pairs engaged in courtroom arguments. We define specific vocal characteristics for each advocate and present a speaker identification challenge to evaluate the ability of modern systems to map audio files to their respective synthetic origins. Dataset is available at this link-https: //github.com/naturenurtureelite/ADVOSYNTH-500.

[58] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

Songsong Tian,Kongsheng Zhuo,Zhendong Wang,Rong Shen,Shengtao Zhang,Yong Wu

Main category: cs.CL

TL;DR: 本文提出了BAR-SQL，一种将可靠性与边界感知嵌入生成过程的统一NL2SQL训练框架，通过种子变异数据合成和知识引导推理合成提升模型在复杂企业查询中的表现，并设计了任务条件混合奖励机制与新基准Ent-SQL-Bench，实现了高于Claude 4.5 Sonnet和GPT-5的准确率。

Details

Motivation: 现有NL2SQL模型在处理企业级多步分析查询时缺乏对模糊性、模式限制等边界情况的可靠处理能力，且缺乏可解释性和针对性训练数据，限制了其在实际场景中的应用可靠性。 Method: 提出BAR-SQL框架，采用Seed Mutation数据合成方法构建包含多步查询和边界案例的企业语料库；通过Knowledge-Grounded Reasoning Synthesis生成基于模式元数据和业务规则的思维链；采用两阶段训练：监督微调（SFT）和基于群体相对策略优化的强化学习（GRPO），并设计任务条件混合奖励机制，结合语法树分析与结果匹配优化执行准确率和语义精确度。 Result: 在自建基准Ent-SQL-Bench上，BAR-SQL达到91.48%的平均准确率，优于Claude 4.5 Sonnet和GPT-5等领先专有模型，在SQL生成质量和边界感知拒绝能力方面均表现出色。同时发布了数据集、代码和基准。 Conclusion: BAR-SQL通过融合边界感知、可解释推理与强化学习，显著提升了NL2SQL模型在复杂、真实企业环境下的可靠性与准确性，为构建可信的自然语言接口提供了有效范式。 Abstract: In this paper, we present BAR-SQL (Boundary-Aware Reliable NL2SQL), a unified training framework that embeds reliability and boundary awareness directly into the generation process. We introduce a Seed Mutation data synthesis paradigm that constructs a representative enterprise corpus, explicitly encompassing multi-step analytical queries alongside boundary cases including ambiguity and schema limitations. To ensure interpretability, we employ Knowledge-Grounded Reasoning Synthesis, which produces Chain-of-Thought traces explicitly anchored in schema metadata and business rules. The model is trained through a two-stage process: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning via Group Relative Policy Optimization. We design a Task-Conditioned Hybrid Reward mechanism that simultaneously optimizes SQL execution accuracy-leveraging Abstract Syntax Tree analysis and dense result matching-and semantic precision in abstention responses. To evaluate reliability alongside generation accuracy, we construct and release Ent-SQL-Bench, which jointly assesse SQL precision and boundary-aware abstention across ambiguous and unanswerable queries. Experimental results on this benchmark demonstrate that BAR-SQL achieves 91.48% average accuracy, outperforming leading proprietary models, including Claude 4.5 Sonnet and GPT-5, in both SQL generation quality and boundary-aware abstention capability. The source code and benchmark are available anonymously at: https://github.com/TianSongS/BAR-SQL.

[59] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

Warren Jouanneau,Emma Jouffroy,Marc Palyart

Main category: cs.CL

TL;DR: 本文提出了一种基于新型晚期交叉注意力架构的重排序模型，用于实时、高效地匹配求职者与工作职位，尤其适用于长文本、多语言简历场景。

Details

Motivation: 传统方法在处理长文本、结构化且多语言的简历时难以实现实时且准确的人岗匹配，同时历史数据偏差影响匹配公平性。 Method: 采用晚期交叉注意力架构分解简历和项目简述，结合大语言模型作为教师模型生成细粒度监督信号，并通过增强的蒸馏损失函数将知识迁移到学生模型。 Result: 实验表明，该模型在相关性、排序和校准指标上优于现有最先进基线方法。 Conclusion: 所提方法能有效提升长上下文输入下的人岗匹配性能，具备良好的可解释性和实用性。 Abstract: Finding the most relevant person for a job proposal in real time is challenging, especially when resumes are long, structured, and multilingual. In this paper, we propose a re-ranking model based on a new generation of late cross-attention architecture, that decomposes both resumes and project briefs to efficiently handle long-context inputs with minimal computational overhead. To mitigate historical data biases, we use a generative large language model (LLM) as a teacher, generating fine-grained, semantically grounded supervision. This signal is distilled into our student model via an enriched distillation loss function. The resulting model produces skill-fit scores that enable consistent and interpretable person-job matching. Experiments on relevance, ranking, and calibration metrics demonstrate that our approach outperforms state-of-the-art baselines.

[60] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Deming Ding,Shichun Liu,Enhui Yang,Jiahang Lin,Ziying Chen,Shihan Dou,Honglin Guo,Weiyu Cheng,Pengyu Zhao,Chengjun Xiao,Qunhong Zeng,Qi Zhang,Xuanjing Huang,Qidi Xu,Tao Gui

Main category: cs.CL

TL;DR: 本文提出了OctoBench，一个用于评估在基于代码库的代理编程中对异构、持续性框架指令遵循能力的基准测试，揭示了当前模型在完成任务与遵守规则之间存在系统性差距。

Details

Motivation: 现有研究较少关注大语言模型在复杂、异构且跨交互持续的框架指令下的遵循能力，尤其是在软件代理场景中，因此需要专门的评估基准来衡量这一关键能力。 Method: 构建包含34个环境和217个任务的OctoBench基准，涵盖三种不同类型的编码框架，并提供7,098项客观检查清单；开发自动化观察与评分工具包，记录完整执行轨迹并进行细粒度合规性检查。 Result: 在八个代表性模型上的实验表明，模型在任务求解能力与对框架指令的遵守之间存在显著差距，尤其在异构约束条件下表现不佳。 Conclusion: 必须在训练和评估中显式地针对异构指令遵循进行优化，OctoBench的发布有助于推动更具备框架感知能力的编程代理的发展。 Abstract: Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.

[61] Training-Trajectory-Aware Token Selection

Zhanming Shen,Jiaqi Hu,Zeyu Qin,Hao Chen,Wentao Ye,Zenan Huang,Yihong Zhuang,Guoshan Lu,Junlin Zhou,Junbo Zhao

Main category: cs.CL

TL;DR: 本文提出了一种训练轨迹感知的词元选择方法（T3S），以解决在强推理能力学生模型上持续蒸馏效果有限甚至退化的问题，通过在词元级别重构训练目标，显著提升了AR和dLLM场景下的推理性能。

Details

Motivation: 在学生模型已具备较强推理能力的情况下，传统的持续蒸馏方法往往增益有限甚至导致性能下降，本文旨在揭示其根本原因并提出更高效的蒸馏策略。 Method: 通过观察训练过程中的性能瓶颈现象，发现词元层面存在‘模仿锚定词元’与‘待学习词元’的置信度分叉现象，进而提出T3S方法，在训练中动态区分并优化这两类词元，重建词元级训练目标。 Result: T3S在多个设置下均取得显著提升：仅用数百样本时，Qwen3-8B超越DeepSeek-R1；Qwen3-32B接近Qwen3-235B的表现；T3训练的LLaDA-2.0-Mini超过其自回归基线，成为16B规模下最强的非思维模型。 Conclusion: 词元级优化路径的冲突是持续蒸馏失效的关键，T3S通过感知训练轨迹有效缓解该问题，为高效蒸馏提供了新的设计范式。 Abstract: Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3 yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

[62] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

Zhihao Xu,Rumei Li,Jiahuan Li,Rongxiang Weng,Jingang Wang,Xunliang Cai,Xiting Wang

Main category: cs.CL

TL;DR: 本文提出了一种基于文本语料库生成多轮工具使用轨迹的新范式GEM，通过四阶段数据合成流程和专门训练的轨迹生成器，有效提升了大语言模型在多轮交互中使用工具的能力，且具有更高的效率和泛化性能。

Details

Motivation: 获取多样化且真实的多轮工具使用数据十分困难，限制了大语言模型在自主代理中的应用。因此需要一种可扩展、真实且高效的数据生成方法。 Method: 提出GEM数据合成流水线，包含相关性过滤、工作流与工具提取、轨迹对齐和复杂度优化四个阶段，并训练一个专用的轨迹生成器，通过监督微调实现高效端到端生成。 Result: GEM-32B在BFCL V3多轮基准上提升了16.5%，部分超越了在特定领域数据（如航空和零售）上训练的τ-bench模型，同时轨迹生成器显著降低了推理延迟和成本。 Conclusion: 基于文本语料库的多轮工具使用数据合成是可行且高效的，GEM范式具有良好的泛化能力，为构建自主代理提供了可扩展的数据解决方案。 Abstract: Enabling Large Language Models (LLMs) to effectively utilize tools in multi-turn interactions is essential for building capable autonomous agents. However, acquiring diverse and realistic multi-turn tool-use data remains a significant challenge. In this work, we propose a novel text-based paradigm. We observe that textual corpora naturally contain rich, multi-step problem-solving experiences, which can serve as an untapped, scalable, and authentic data source for multi-turn tool-use tasks. Based on this insight, we introduce GEM, a data synthesis pipeline that enables the generation and extraction of multi-turn tool-use trajectories from text corpora through a four-stage process: relevance filtering, workflow & tool extraction, trajectory grounding, and complexity refinement. To reduce the computational cost, we further train a specialized Trajectory Synthesizer via supervised fine-tuning. This model distills the complex generation pipeline into an efficient, end-to-end trajectory generator. Experiments demonstrate that our GEM-32B achieve a 16.5% improvement on the BFCL V3 Multi-turn benchmark. Our models partially surpass the performance of models trained on τ - bench (Airline and Retail) in-domain data, highlighting the superior generalization capability derived from our text-based synthesis paradigm. Notably, our Trajectory Synthesizer matches the quality of the full pipeline while significantly reducing inference latency and costs.

[63] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

Christina Lu,Jack Gallagher,Jonathan Michala,Kyle Fish,Jack Lindsey

Main category: cs.CL

TL;DR: 本文研究了大语言模型中“助手轴”（Assistant Axis）的结构，发现该轴反映了模型在默认助手模式下的行为倾向，并可用于预测和控制模型的人格漂移现象。

Details

Motivation: 探索大语言模型中不同人格特征的空间结构，理解其默认助手身份背后的机制及其在交互中偏离该身份的原因。 Method: 通过提取与不同角色原型对应的激活方向，分析多个模型中的人格空间结构，识别出‘助手轴’，并测试其在预训练和后训练模型中的表现及对行为的影响。 Result: 发现了贯穿模型的‘助手轴’，其强度与模型是否表现出帮助性或神秘戏剧化语言相关；该轴存在于预训练模型中，且可预测‘人格漂移’现象；限制沿该轴的激活可稳定模型行为并抵御基于人格的越狱攻击。 Conclusion: 后训练使模型偏向特定人格区域但未牢固锚定，需进一步研究更深层锚定模型人格的训练与引导策略。 Abstract: Large language models can represent a variety of personas but typically default to a helpful Assistant identity cultivated during post-training. We investigate the structure of the space of model personas by extracting activation directions corresponding to diverse character archetypes. Across several different models, we find that the leading component of this persona space is an "Assistant Axis," which captures the extent to which a model is operating in its default Assistant mode. Steering towards the Assistant direction reinforces helpful and harmless behavior; steering away increases the model's tendency to identify as other entities. Moreover, steering away with more extreme values often induces a mystical, theatrical speaking style. We find this axis is also present in pre-trained models, where it primarily promotes helpful human archetypes like consultants and coaches and inhibits spiritual ones. Measuring deviations along the Assistant Axis predicts "persona drift," a phenomenon where models slip into exhibiting harmful or bizarre behaviors that are uncharacteristic of their typical persona. We find that persona drift is often driven by conversations demanding meta-reflection on the model's processes or featuring emotionally vulnerable users. We show that restricting activations to a fixed region along the Assistant Axis can stabilize model behavior in these scenarios -- and also in the face of adversarial persona-based jailbreaks. Our results suggest that post-training steers models toward a particular region of persona space but only loosely tethers them to it, motivating work on training and steering strategies that more deeply anchor models to a coherent persona.

[64] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Tarun Sharma,Manikandan Ravikiran,Sourava Kumar Behera,Pramit Bhattacharya,Arnab Bhattacharya,Rohit Saluja

Main category: cs.CL

TL;DR: 本文介绍了INDIC-DIALECT，一个包含11种印度方言和2种语言（印地语和奥里亚语）的13k句对的人工整理平行语料库，并构建了一个多任务基准，用于方言分类、选择题回答和机器翻译。实验表明现有大模型在方言处理上表现不佳，而基于印度语言预训练的微调模型显著提升了性能。

Details

Motivation: 大多数低资源语言的方言在自然语言处理中被忽视，尤其是在印度，尽管印地语和奥里亚语使用广泛，其方言却缺乏数字资源和网络存在，亟需数据集和基准来推动相关研究。 Method: 构建了一个名为INDIC-DIALECT的平行语料库，涵盖11种方言和2种语言的13,000个句子对，并设计了一个包含方言分类、选择题回答和机器翻译的多任务基准；采用微调的Transformer模型、混合AI模型以及基于规则后接AI的方法进行实验。 Result: GPT-4o和Gemini 2.5在方言分类任务中表现差；微调后的模型将F1分数从19.6%提升至89.8%；在方言到语言翻译中，混合AI模型BLEU得分为61.32（基线23.36）；在语言到方言翻译中，“规则+AI”方法取得最高BLEU 48.44（基线27.59）。 Conclusion: INDIC-DIALECT为印度方言感知的NLP提供了新的基准，展示了专用模型在低资源方言处理中的重要性，未来将开源该数据集以促进相关研究。 Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

[65] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

Mihai Dan Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran

Main category: cs.CL

TL;DR: TF3-RO是一个面向罗马尼亚语的端到端语言建模管道，支持从分词器设计、模型预训练、压缩到大规模合成数据生成的全流程，提升了低资源语言的可复现性与实用性。

Details

Motivation: 针对形态复杂且计算资源匮乏的语言（如罗马尼亚语），缺乏统一且可复现的端到端语言建模流程，现有方法难以有效应对词汇膨胀和数据稀缺问题。 Method: 基于TF1和TF2构建TF3-RO，设计罗马尼亚语专用BPE和Unigram分词器，使用长序列打包训练一个5165万参数的LLaMA风格Transformer，并通过量化、结构化剪枝和logit蒸馏压缩为2645万参数的学生模型；利用该模型结合组合式提示框架生成三百万条罗马尼亚语合成寓言。 Result: 成功构建了罗马尼亚语专用的高效分词器，训练并压缩出具有强部署能力的小型语言模型，生成的大规模合成语料在内在指标、语法一致性、实体连贯性和LLM评估中表现良好。 Conclusion: TF3-RO为形态丰富的低资源语言提供了可复现、语言学驱动的建模范式，兼具模型轻量化与高质量合成数据生成能力。 Abstract: Recent advances in synthetic data generation have shown that compact language models can be trained effectively when the underlying corpus is structurally controlled and linguistically coherent. However, for morphologically rich and computationally under-resourced languages such as Romanian, there is still no openly documented, end-to-end pipeline that unifies tokenizer design, preprocessing, pretraining, compression, evaluation, and large-scale synthetic data generation in a reproducible framework. Building on TF1, a three-million-story English fable dataset, and TF2, which extends TF1 through high-quality Romanian translations, we introduce TF3-RO, a Romanian-centric language modeling pipeline spanning tokenizer training, from-scratch model development, and Romanian-native dataset generation. TF3-RO constructs Romanian-specific BPE and Unigram tokenizers from a linguistically informed corpus to mitigate token inflation induced by Romanian morphology. Using long-sequence packed training, we pretrain a 51.65M-parameter LLaMA-style Transformer entirely from scratch. The model is subsequently optimized through quantization, structured pruning, and logit-based knowledge distillation, yielding a compact 26.45M-parameter student model with tied embeddings and strong deployment characteristics. Using this distilled model, TF3-RO generates three million Romanian-native synthetic fables via a controlled combinatorial prompting framework. Across all stages, the pipeline integrates a comprehensive evaluation suite combining intrinsic metrics, Romanian agreement probes, entity coherence, rule-based grammar checking, and LLM-based assessment. TF3-RO provides a reproducible and linguistically grounded framework for training compact Romanian language models and producing large-scale synthetic narrative corpora.

[66] Are Language Models Models?

Philip Resnik

Main category: cs.CL

TL;DR: 语言模型（LMs）作为认知模型的主张在Marr的三个层次分析下均存在问题，其在实现层面不成立，在算法-表征层面动机不足，在计算理论层面存在争议。LMs更适合作为工具而非认知模型，将其称为认知模型夸大其词并助长了大模型炒作。

Details

Motivation: 评估Futrell和Mahowald提出的语言模型作为‘模型系统’这一主张是否成立，特别是在Marr的认知分析框架的三个层次上进行检验。 Method: 基于Marr的三层次分析框架（计算理论、算法-表征、实现），对语言模型是否可作为认知模型进行逐层评估。 Result: 发现语言模型在实现层面明显不符合认知模型标准，在算法-表征层面缺乏充分动机，在计算理论层面也存在概念性问题。 Conclusion: 语言模型目前更适合作为研究或应用工具，而非人类认知的模型；将其称为认知模型属于过度宣称，可能助长不实的LLM hype。 Abstract: Futrell and Mahowald claim LMs "serve as model systems", but an assessment at each of Marr's three levels suggests the claim is clearly not true at the implementation level, poorly motivated at the algorithmic-representational level, and problematic at the computational theory level. LMs are good candidates as tools; calling them cognitive models overstates the case and unnecessarily feeds LLM hype.

[67] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

Ruochen Li,Kun Yuan,Yufei Xia,Yue Zhou,Qingyu Lu,Weihang Li,Youxiang Zhu,Nassir Navab

Main category: cs.CL

TL;DR: 本文提出了一种基于手术阶段目标可满足性的规划正确性评估方法，构建了多中心元评估基准，并发现传统序列相似性指标会误判规划质量。通过基于规则的目标可满足性指标，揭示了视频语言模型在感知和推理方面的局限性，表明结构化知识有助于提升性能，而仅依赖语义引导不可靠。

Details

Motivation: 现有评估协议在安全关键场景下对视觉语言模型的评估可靠性不足，缺乏对手术规划中阶段目标满足性的明确定义。 Method: 定义基于专家规则的阶段-目标可满足性作为规划正确性标准，构建包含有效变体和错误计划的多中心元评估基准，采用规则基础的目标可满足性指标评估Video-LLMs在逐步受限设置下的表现。 Result: 序列相似性指标系统性误判规划质量；模型在感知错误和欠约束推理下表现失败；结构化知识显著提升性能，而语义引导单独使用不可靠，仅在与结构约束结合时对大模型有效。 Conclusion: 应采用高精度的规则基础元评估方法来可靠评估手术规划中的VLMs，结构化先验知识对于提升模型在安全关键任务中的可靠性至关重要。 Abstract: Surgical planning integrates visual perception, long-horizon reasoning, and procedural knowledge, yet it remains unclear whether current evaluation protocols reliably assess vision-language models (VLMs) in safety-critical settings. Motivated by a goal-oriented view of surgical planning, we define planning correctness via phase-goal satisfiability, where plan validity is determined by expert-defined surgical rules. Based on this definition, we introduce a multicentric meta-evaluation benchmark with valid procedural variations and invalid plans containing order and content errors. Using this benchmark, we show that sequence similarity metrics systematically misjudge planning quality, penalizing valid plans while failing to identify invalid ones. We therefore adopt a rule-based goal-satisfiability metric as a high-precision meta-evaluation reference to assess Video-LLMs under progressively constrained settings, revealing failures due to perception errors and under-constrained reasoning. Structural knowledge consistently improves performance, whereas semantic guidance alone is unreliable and benefits larger models only when combined with structural constraints.

[68] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

Abhinaba Basu,Pavan Chakraborty

Main category: cs.CL

TL;DR: 本文提出了Contextual StereoSet基准和Context Sensitivity Fingerprints（CSF），用于评估语言模型在不同上下文下的刻板偏见敏感性，发现模型的偏见表现会因时间、场景和观察者视角等上下文变化而显著改变，强调需从静态偏见评估转向条件化分析。

Details

Motivation: 现有偏见评估方法通常在固定条件下进行，难以反映模型在真实部署中的动态表现。作者旨在揭示上下文变化对模型偏见输出的影响，挑战传统‘是否偏见’的二元判断，推动更稳健的评估范式。 Method: 提出Contextual StereoSet基准，保持刻板内容不变，系统性地变换上下文（如时间、地点、受众）；设计两种评估协议（360-上下文网格与预算化协议），使用13个模型进行测试，并引入CSF作为量化模型在各维度上下文敏感性的紧凑指纹图谱，包含自助置信区间与多重检验校正。 Result: 实验显示上下文变化显著影响模型的偏见选择：锚定于1990年相比2030年在所有测试模型中增加偏见（p<0.05）；八卦语境在6个模型中的5个中提升偏见；外群体观察者视角导致最多13个百分点的变化；结果在招聘、借贷和求助情境中复现。CSF支持细粒度诊断与生产级筛查。 Conclusion: 固定条件下的偏见评分可能无法泛化，评估应关注‘在何种条件下偏见出现’而非‘是否偏见’。CSF提供了一种新的方法论框架，强调评估的上下文敏感性与鲁棒性，推动偏见评测从静态向动态、条件化转变。 Abstract: A model that avoids stereotypes in a lab benchmark may not avoid them in deployment. We show that measured bias shifts dramatically when prompts mention different places, times, or audiences -- no adversarial prompting required. We introduce Contextual StereoSet, a benchmark that holds stereotype content fixed while systematically varying contextual framing. Testing 13 models across two protocols, we find striking patterns: anchoring to 1990 (vs. 2030) raises stereotype selection in all models tested on this contrast (p<0.05); gossip framing raises it in 5 of 6 full-grid models; out-group observer framing shifts it by up to 13 percentage points. These effects replicate in hiring, lending, and help-seeking vignettes. We propose Context Sensitivity Fingerprints (CSF): a compact profile of per-dimension dispersion and paired contrasts with bootstrap CIs and FDR correction. Two evaluation tracks support different use cases -- a 360-context diagnostic grid for deep analysis and a budgeted protocol covering 4,229 items for production screening. The implication is methodological: bias scores from fixed-condition tests may not generalize.This is not a claim about ground-truth bias rates; it is a stress test of evaluation robustness. CSF forces evaluators to ask, "Under what conditions does bias appear?" rather than "Is this model biased?" We release our benchmark, code, and results.

[69] DR-Arena: an Automated Evaluation Framework for Deep Research Agents

Yiwen Gao,Ruochen Zhao,Yang Deng,Wenxuan Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为DR-Arena的全自动评估框架，用于动态评估作为深度研究代理的大型语言模型（LLMs）的任务性能，通过实时信息树和自适应演化循环实现与真实世界同步并逐步提升任务复杂度。

Details

Motivation: 现有的静态数据集基准在任务通用性、时间对齐和数据污染方面存在局限，难以可靠评估具备自主研究能力的LLM代理，因此需要一种动态、自动化的评估方法。 Method: 构建基于实时网络趋势的‘信息树’，设计自动化‘考官’生成测试深度推理和广度覆盖的结构化任务，并引入自适应演化循环机制，根据实时表现动态增加任务难度。 Result: 在六种先进深度研究代理上的实验表明，DR-Arena与LMSYS搜索竞技场排行榜的斯皮尔曼相关系数达到0.94，是目前与人类偏好对齐程度最高的自动化评估方法。 Conclusion: DR-Arena是一种高效、可靠的自动化评估框架，能够准确衡量深度研究型LLM代理的能力边界，可作为人工评估的有力替代方案。 Abstract: As Large Language Models (LLMs) increasingly operate as Deep Research (DR) Agents capable of autonomous investigation and information synthesis, reliable evaluation of their task performance has become a critical bottleneck. Current benchmarks predominantly rely on static datasets, which suffer from several limitations: limited task generality, temporal misalignment, and data contamination. To address these, we introduce DR-Arena, a fully automated evaluation framework that pushes DR agents to their capability limits through dynamic investigation. DR-Arena constructs real-time Information Trees from fresh web trends to ensure the evaluation rubric is synchronized with the live world state, and employs an automated Examiner to generate structured tasks testing two orthogonal capabilities: Deep reasoning and Wide coverage. DR-Arena further adopts Adaptive Evolvement Loop, a state-machine controller that dynamically escalates task complexity based on real-time performance, demanding deeper deduction or wider aggregation until a decisive capability boundary emerges. Experiments with six advanced DR agents demonstrate that DR-Arena achieves a Spearman correlation of 0.94 with the LMSYS Search Arena leaderboard. This represents the state-of-the-art alignment with human preferences without any manual efforts, validating DR-Arena as a reliable alternative for costly human adjudication.

Xuan Luo,Lewei Yao,Libo Zhao,Lanqing Hong,Kai Chen,Dehua Tao,Daxin Tan,Ruifeng Xu,Jing Li

Main category: cs.CL

TL;DR: 本文提出了AEQ-Bench，一个用于评估多模态大模型（OLMs）共情能力的新基准，重点测试其在理解和生成包含情感线索的多模态输入（音频+文本）中的表现，以及在无文本转录情况下判断音频回应共情程度的能力。结果表明，具备音频输出能力的OLMs表现更优，但在细粒度副语言表达评估上仍不可靠。

Details

Motivation: 由于共情具有内在的情感性，目前对多模态大模型共情能力的自动评估面临挑战，缺乏系统性基准来衡量其在真实多模态交互中的表现。 Method: 提出AEQ-Bench，包含两种新设置：基于多模态输入生成共情回应，以及不依赖文本转录判断音频回应的共情程度；采用语言学与副语言学指标进行综合评估。 Result: 实验显示：具备音频输出训练的OLMs在共情任务中优于纯文本模型；模型在粗粒度质量判断上与人类判断一致，但在细粒度副语言表达评估上表现不稳定。 Conclusion: AEQ-Bench为评估多模态大模型的共情能力提供了有效工具，揭示了当前模型在副语言理解上的局限，强调了未来需加强细粒度情感表达建模的重要性。 Abstract: While the automatic evaluation of omni-modal large models (OLMs) is essential, assessing empathy remains a significant challenge due to its inherent affectivity. To investigate this challenge, we introduce AEQ-Bench (Audio Empathy Quotient Benchmark), a novel benchmark to systematically assess two core empathetic capabilities of OLMs: (i) generating empathetic responses by comprehending affective cues from multi-modal inputs (audio + text), and (ii) judging the empathy of audio responses without relying on text transcription. Compared to existing benchmarks, AEQ-Bench incorporates two novel settings that vary in context specificity and speech tone. Comprehensive assessment across linguistic and paralinguistic metrics reveals that (1) OLMs trained with audio output capabilities generally outperformed models with text-only outputs, and (2) while OLMs align with human judgments for coarse-grained quality assessment, they remain unreliable for evaluating fine-grained paralinguistic expressiveness.

[71] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

Chengbing Wang,Wuqiang Zheng,Yang Zhang,Fengbin Zhu,Junyi Cheng,Yi Xie,Wenjie Wang,Fuli Feng

Main category: cs.CL

TL;DR: 本文提出了一种基于心理学的共情奖励建模方法PERM，通过支持者、寻求者和旁观者三重视角实现对大语言模型共情能力的双向评估与优化，显著提升了情感支持质量。

Details

Motivation: 现有强化学习方法在提升大语言模型共情能力时，通常仅从单一视角评估共情，忽略了共情交互的双向性，难以真实反映共情过程。 Method: 基于共情循环理论，提出PERM框架，将共情评估分解为支持者视角（内在共鸣与表达）、寻求者视角（情绪接收）和旁观者视角（整体交互质量），实现多角度奖励建模。 Result: 在标准情感智能基准和工业级日常对话数据集上，PERM比现有最优方法性能提升超过10%；盲测用户研究显示70%用户更偏好该方法生成的回应。 Conclusion: PERM通过心理学驱动的多视角评估机制，有效提升了大语言模型在情感支持任务中的共情表现，具备实际应用价值和可扩展性。 Abstract: Large Language Models (LLMs) are increasingly deployed in human-centric applications, yet they often fail to provide substantive emotional support. While Reinforcement Learning (RL) has been utilized to enhance empathy of LLMs, existing reward models typically evaluate empathy from a single perspective, overlooking the inherently bidirectional interaction nature of empathy between the supporter and seeker as defined by Empathy Cycle theory. To address this limitation, we propose Psychology-grounded Empathetic Reward Modeling (PERM). PERM operationalizes empathy evaluation through a bidirectional decomposition: 1) Supporter perspective, assessing internal resonation and communicative expression; 2) Seeker perspective, evaluating emotional reception. Additionally, it incorporates a bystander perspective to monitor overall interaction quality. Extensive experiments on a widely-used emotional intelligence benchmark and an industrial daily conversation dataset demonstrate that PERM outperforms state-of-the-art baselines by over 10\%. Furthermore, a blinded user study reveals a 70\% preference for our approach, highlighting its efficacy in generating more empathetic responses. Our code, dataset, and models are available at https://github.com/ZhengWwwq/PERM.

[72] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Syed Naveed Mahmood,Md. Rezaur Rahman Bhuiyan,Tasfia Zaman,Jareen Tasneem Khondaker,Md. Sameer Sakib,Nazia Tasnim,Farig Sadeque

Main category: cs.CL

TL;DR: 本文提出了知识免疫框架（KIF），通过针对内部激活特征实现真正意义上的知识遗忘，解决了当前选择性遗忘方法仅抑制行为而未彻底删除知识的问题。

Details

Motivation: 现有LLM知识遗忘方法难以区分表面拒绝与真正知识删除，导致潜在能力仍然存在，无法满足GDPR合规和模型安全需求。 Method: 提出KIF框架，结合主题特定表征的动态抑制与参数高效微调，基于内部激活签名而非表面输出进行操作，并设计双指标评估协议以区分掩盖与真正删除。 Result: KIF在多种基础模型（Llama、Mistral）和推理优先模型（Qwen、DeepSeek）上验证有效，实现接近oracle的擦除效果（FQ≈0.99），保持高实用性（MU=0.62），且标准模型表现出尺度无关的真实擦除能力。 Conclusion: KIF首次实现了对遗忘机制的系统性诊断，打破了以往稳定性与可遗忘性之间的权衡，为不同架构和规模的模型提供了可解释、可持续的知识删除方案。 Abstract: Selective knowledge erasure from LLMs is critical for GDPR compliance and model safety, yet current unlearning methods conflate behavioral suppression with true knowledge removal, allowing latent capabilities to persist beneath surface-level refusals. In this work, we address this challenge by introducing Knowledge Immunization Framework (KIF), a representation-aware architecture that distinguishes genuine erasure from obfuscation by targeting internal activation signatures rather than surface outputs. Our approach combines dynamic suppression of subject-specific representations with parameter-efficient adaptation, enabling durable unlearning without full model retraining. KIF achieves near-oracle erasure (FQ approx 0.99 vs. 1.00) while preserving utility at oracle levels (MU = 0.62), effectively breaking the stability-erasure tradeoff that has constrained all prior work. We evaluate both standard foundation models (Llama and Mistral) and reasoning-prior models (Qwen and DeepSeek) across 3B to 14B parameters. Our observation shows that standard models exhibit scale-independent true erasure (<3% utility drift), while reasoning-prior models reveal fundamental architectural divergence. Our comprehensive dual-metric evaluation protocol, combining surface-level leakage with latent trace persistence, operationalizes the obfuscation - erasure distinction and enables the first systematic diagnosis of mechanism-level forgetting behavior across model families and scales.

[73] Form and Meaning in Intrinsic Multilingual Evaluations

Wessel Poelman,Miryam de Lhoneux

Main category: cs.CL

TL;DR: 本文探讨了在多语言设置中用于条件语言模型的内在评估指标（如困惑度或每字符比特数）的假设及其影响，通过实验发现当前的指标不具备普遍可比性，并从形式-意义争论的角度提供了对此现象的解释。

Details

Motivation: 在多语言环境中，现有的评估指标依赖于一些未经验证的假设，例如认为平行句子的语义信息相同，从而可以通过困惑度比较模型质量。然而这些假设可能不成立，因此需要明确其前提并检验其有效性。 Method: 文章分析了信息论视角下内在评估指标的本质，明确了其在多语言场景中的关键假设，并在两个多平行语料库上对六种指标进行了实验，涵盖单语和多语言模型。 Result: 实验证明，当前的评估指标在不同语言或模型之间并不具备普遍可比性，尤其是在衡量语义一致性方面存在局限。 Conclusion: 现有的内在评估指标不能直接用于跨语言或跨模型的质量比较，需结合形式与意义的关系重新思考评估方式。 Abstract: Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforward to use and compare in monolingual setups, but rest on a number of assumptions in multilingual setups. One such assumption is that comparing the perplexity of CLMs on parallel sentences is indicative of their quality since the information content (here understood as the semantic meaning) is the same. However, the metrics are inherently measuring information content in the information-theoretic sense. We make this and other such assumptions explicit and discuss their implications. We perform experiments with six metrics on two multi-parallel corpora both with mono- and multilingual models. Ultimately, we find that current metrics are not universally comparable. We look at the form-meaning debate to provide some explanation for this.

[74] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

Yuxi Xia,Loris Schoenegger,Benjamin Roth

Main category: cs.CL

TL;DR: 本文提出TracVC方法，用于追溯大语言模型（LLM）生成的自信表达来源，发现模型常依赖与问题无关的泛化自信语言，而非基于实际内容支持，揭示当前训练方式下LLM可能学会“显得自信”但未必“合理自信”的问题。

Details

Motivation: 理解LLM在输出中表达自信的来源，探究其自信是否基于相关内容依据，还是仅模仿表面语言模式，以提升模型可信度。 Method: 提出TracVC方法，结合信息检索和影响估计技术，追踪LLM生成的自信表达至训练数据，并引入‘内容 grounding’指标衡量自信是否基于与问答相关的内容。 Result: 在OLMo和Llama模型上的实验显示，OLMo2-13B常受与查询无关的自信相关数据影响，表明其可能模仿的是非内容相关的自信表达形式。 Conclusion: 当前训练机制可能导致LLM学会表达自信的语言形式，而非真正基于内容判断何时应自信，需改进训练方式以提升自信表达的可靠性。 Abstract: Large language models (LLMs) can increase users' perceived trust by verbalizing confidence in their outputs. However, prior work has shown that LLMs are often overconfident, making their stated confidence unreliable since it does not consistently align with factual accuracy. To better understand the sources of this verbalized confidence, we introduce TracVC (\textbf{Trac}ing \textbf{V}erbalized \textbf{C}onfidence), a method that builds on information retrieval and influence estimation to trace generated confidence expressions back to the training data. We evaluate TracVC on OLMo and Llama models in a question answering setting, proposing a new metric, content groundness, which measures the extent to which an LLM grounds its confidence in content-related training examples (relevant to the question and answer) versus in generic examples of confidence verbalization. Our analysis reveals that OLMo2-13B is frequently influenced by confidence-related data that is lexically unrelated to the query, suggesting that it may mimic superficial linguistic expressions of certainty rather than rely on genuine content grounding. These findings point to a fundamental limitation in current training regimes: LLMs may learn how to sound confident without learning when confidence is justified. Our analysis provides a foundation for improving LLMs' trustworthiness in expressing more reliable confidence.

[75] Detecting Winning Arguments with Large Language Models and Persuasion Strategies

Tiziano Labruna,Arkadiusz Modzelewski,Giorgio Satta,Giovanni Da San Martino

Main category: cs.CL

TL;DR: 本研究提出了一种基于多策略说服评分的方法，利用大语言模型分析文本中的说服策略（如攻击声誉、分散注意力和操纵性措辞），以提升对论辩文本说服力的预测能力。

Details

Motivation: 理解人类交流中说服力的作用具有重要意义，但检测文本中的说服行为是一项挑战。通过识别具体的说服策略，可以更深入地理解哪些因素决定了文本的说服效果。 Method: 采用大型语言模型（LLMs），结合六种说服策略进行策略引导式推理，并在三个标注数据集（Winning Arguments、Anthropic/Persuasion 和 Persuasion for Good）上实验。还将 Winning Arguments 数据集按主题分类，分析不同话题下的表现差异。 Result: 实验表明，基于策略的推理能有效提升说服力预测性能；不同话题下的分析揭示了内容对说服效果的影响；并公开发布了带有主题标注的 Winning Arguments 数据集。 Conclusion: 结构化的、基于策略的提示方法有助于提高论点质量评估的可解释性和鲁棒性，为未来研究提供了新方向和资源支持。 Abstract: Detecting persuasion in argumentative text is a challenging task with important implications for understanding human communication. This work investigates the role of persuasion strategies - such as Attack on reputation, Distraction, and Manipulative wording - in determining the persuasiveness of a text. We conduct experiments on three annotated argument datasets: Winning Arguments (built from the Change My View subreddit), Anthropic/Persuasion, and Persuasion for Good. Our approach leverages large language models (LLMs) with a Multi-Strategy Persuasion Scoring approach that guides reasoning over six persuasion strategies. Results show that strategy-guided reasoning improves the prediction of persuasiveness. To better understand the influence of content, we organize the Winning Argument dataset into broad discussion topics and analyze performance across them. We publicly release this topic-annotated version of the dataset to facilitate future research. Overall, our methodology demonstrates the value of structured, strategy-aware prompting for enhancing interpretability and robustness in argument quality assessment.

[76] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

Gilat Toker,Nitay Calderon,Ohad Amosy,Roi Reichart

Main category: cs.CL

TL;DR: 本文提出了LIBERTy框架，通过基于大语言模型和结构因果模型生成的结构化反事实对，构建概念解释的基准测试，并引入新的评估指标order-faithfulness，用于衡量模型在高风险领域中对高阶概念影响的解释保真度。

Details

Motivation: 现有概念解释的保真度评估依赖人工编写的反事实数据，成本高且不完美，缺乏系统性基准来评估解释方法的准确性。 Method: 提出LIBERTy框架，结合明确的结构因果模型（SCM）与大语言模型生成干预后的反事实文本；构建三个真实场景数据集（疾病检测、简历筛选、职场暴力预测），并设计order-faithfulness指标评估解释方法的排序一致性。 Result: 在五个模型上评估多种解释方法，发现当前方法仍有显著改进空间；分析显示专有大语言模型对人口统计学概念的敏感性较低，可能源于训练后缓解策略。 Conclusion: LIBERTy为概念型解释提供了可扩展、可控制的评估基准，有助于推动更可信的可解释AI方法的发展。 Abstract: Concept-based explanations quantify how high-level concepts (e.g., gender or experience) influence model behavior, which is crucial for decision-makers in high-stakes domains. Recent work evaluates the faithfulness of such explanations by comparing them to reference causal effects estimated from counterfactuals. In practice, existing benchmarks rely on costly human-written counterfactuals that serve as an imperfect proxy. To address this, we introduce a framework for constructing datasets containing structural counterfactual pairs: LIBERTy (LLM-based Interventional Benchmark for Explainability with Reference Targets). LIBERTy is grounded in explicitly defined Structured Causal Models (SCMs) of the text generation, interventions on a concept propagate through the SCM until an LLM generates the counterfactual. We introduce three datasets (disease detection, CV screening, and workplace violence prediction) together with a new evaluation metric, order-faithfulness. Using them, we evaluate a wide range of methods across five models and identify substantial headroom for improving concept-based explanations. LIBERTy also enables systematic analysis of model sensitivity to interventions: we find that proprietary LLMs show markedly reduced sensitivity to demographic concepts, likely due to post-training mitigation. Overall, LIBERTy provides a much-needed benchmark for developing faithful explainability methods.

[77] Grounding Agent Memory in Contextual Intent

Ruozhen Yang,Yucheng Jiang,Yueqi Jiang,Priyanka Kargupta,Yunyi Zhang,Jiawei Han

Main category: cs.CL

TL;DR: 本文提出了STITCH，一种基于结构化意图追踪的智能体记忆系统，通过将对话历史按上下文意图索引，显著提升长周期目标导向交互中的信息检索准确性。

Details

Motivation: 在长周期、目标导向的交互中，由于相似实体和事实在不同潜在目标下反复出现，传统记忆系统容易检索到上下文不匹配的信息，导致性能下降。 Method: STITCH为每个轨迹步骤构建包含结构化检索线索和上下文意图的记忆索引，其中上下文意图包括当前潜在目标、动作类型和显著实体类型；在推理时通过意图匹配来过滤和优先检索记忆片段。 Result: 在CAME-Bench和LongMemEval两个基准上，STITCH均达到最先进水平，相比最强基线提升了35.6%，且随着轨迹长度增加性能增益更大。 Conclusion: 结构化意图索引能有效减少检索噪声，支持更鲁棒的长周期推理，为大型语言模型在复杂任务中的部署提供了高效记忆机制。 Abstract: Deploying large language models in long-horizon, goal-oriented interactions remains challenging because similar entities and facts recur under different latent goals and constraints, causing memory systems to retrieve context-mismatched evidence. We propose STITCH (Structured Intent Tracking in Contextual History), an agentic memory system that indexes each trajectory step with a structured retrieval cue, contextual intent, and retrieves history by matching the current step's intent. Contextual intent provides compact signals that disambiguate repeated mentions and reduce interference: (1) the current latent goal defining a thematic segment, (2) the action type, and (3) the salient entity types anchoring which attributes matter. During inference, STITCH filters and prioritizes memory snippets by intent compatibility, suppressing semantically similar but context-incompatible history. For evaluation, we introduce CAME-Bench, a benchmark for context-aware retrieval in realistic, dynamic, goal-oriented trajectories. Across CAME-Bench and LongMemEval, STITCH achieves state-of-the-art performance, outperforming the strongest baseline by 35.6%, with the largest gains as trajectory length increases. Our analysis shows that intent indexing substantially reduces retrieval noise, supporting intent-aware memory for robust long-horizon reasoning.

[78] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Changle Qu,Sunhao Dai,Hengyi Cai,Jun Xu,Shuaiqiang Wang,Dawei Yin

Main category: cs.CL

TL;DR: 提出MatchTIR框架，通过二分图匹配实现细粒度的回合级奖励分配和双层优势估计，提升大模型在长周期多轮任务中的工具调用效率。

Details

Motivation: 现有强化学习方法在处理长周期多轮任务时，由于使用粗粒度的轨迹级奖励，难以区分有效与冗余或错误的工具调用，导致信用分配不准确。 Method: 将信用分配建模为预测轨迹与真实轨迹之间的二分图匹配问题，设计两种分配策略生成密集的回合级奖励，并结合回合级与轨迹级信号进行双层优势估计。 Result: 在三个基准上实验表明，MatchTIR显著优于现有方法，4B模型性能超过多数8B模型，尤其在长周期多轮任务中表现突出。 Conclusion: MatchTIR通过细粒度奖励分配和双层优势估计，有效提升了LLM在复杂任务中的工具使用能力，为长周期推理提供了更优的训练机制。 Abstract: Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.

cs.CV [Back]

[79] Diffusion-Driven Deceptive Patches: Adversarial Manipulation and Forensic Detection in Facial Identity Verification

Shahrzad Sayyafzadeh,Hongmei Chi,Shonda Bernadin

Main category: cs.CV

TL;DR: 提出了一种端到端的生成、优化和评估对抗性补丁的管道，用于攻击人脸识别系统，并结合扩散模型提升隐蔽性，通过ViT-GPT2实现语义描述，支持法证分析。

Details

Motivation: 针对人脸生物特征系统的安全性问题，研究如何生成更隐蔽且有效的对抗补丁以进行安全测试和法证分析。 Method: 使用FGSM生成对抗噪声，结合扩散模型的逆向过程进行高斯平滑和自适应亮度校正以提升隐蔽性；利用ViT-GPT2模型生成对抗图像的语义描述，并通过感知哈希和分割技术检测对抗样本。 Result: 生成的对抗补丁在保持视觉自然性的同时有效逃避身份识别系统，SSIM达到0.95，验证了对身份验证和表情识别系统的攻击有效性。 Conclusion: 该管道能有效生成并分析对抗补丁，兼具攻击性能与可解释性，适用于人脸识别系统的安全评估与法证应用。 Abstract: This work presents an end-to-end pipeline for generating, refining, and evaluating adversarial patches to compromise facial biometric systems, with applications in forensic analysis and security testing. We utilize FGSM to generate adversarial noise targeting an identity classifier and employ a diffusion model with reverse diffusion to enhance imperceptibility through Gaussian smoothing and adaptive brightness correction, thereby facilitating synthetic adversarial patch evasion. The refined patch is applied to facial images to test its ability to evade recognition systems while maintaining natural visual characteristics. A Vision Transformer (ViT)-GPT2 model generates captions to provide a semantic description of a person's identity for adversarial images, supporting forensic interpretation and documentation for identity evasion and recognition attacks. The pipeline evaluates changes in identity classification, captioning results, and vulnerabilities in facial identity verification and expression recognition under adversarial conditions. We further demonstrate effective detection and analysis of adversarial patches and adversarial samples using perceptual hashing and segmentation, achieving an SSIM of 0.95.

[80] LCF3D: A Robust and Real-Time Late-Cascade Fusion Framework for 3D Object Detection in Autonomous Driving

Carlo Sgaravatti,Riccardo Pieroni,Matteo Corno,Sergio M. Savaresi,Luca Magri,Giacomo Boracchi

Main category: cs.CV

TL;DR: 提出了一种名为LCF3D的新型传感器融合框架，结合RGB图像和LiDAR点云进行3D目标检测，通过晚期融合减少误检，级联融合恢复漏检，在KITTI和nuScenes数据集上显著优于纯LiDAR方法。

Details

Motivation: 准确检测行人、骑行者和其他车辆等3D物体对自动驾驶至关重要，但如何有效融合RGB相机和LiDAR数据仍具挑战。 Method: 提出LCF3D框架，采用晚期融合匹配2D与3D检测结果以滤除LiDAR误检，并通过级联融合生成新的3D截头体提案来恢复LiDAR漏检的目标。 Result: 在KITTI和nuScenes数据集上取得显著性能提升，尤其在行人、骑行者、摩托车和自行车等困难类别上表现突出，并展现出良好的域泛化能力。 Conclusion: LCF3D通过有效的多模态融合策略提升了3D目标检测精度，具备应对不同传感器配置的能力，增强了自动驾驶系统在复杂场景下的鲁棒性。 Abstract: Accurately localizing 3D objects like pedestrians, cyclists, and other vehicles is essential in Autonomous Driving. To ensure high detection performance, Autonomous Vehicles complement RGB cameras with LiDAR sensors, but effectively combining these data sources for 3D object detection remains challenging. We propose LCF3D, a novel sensor fusion framework that combines a 2D object detector on RGB images with a 3D object detector on LiDAR point clouds. By leveraging multimodal fusion principles, we compensate for inaccuracies in the LiDAR object detection network. Our solution combines two key principles: (i) late fusion, to reduce LiDAR False Positives by matching LiDAR 3D detections with RGB 2D detections and filtering out unmatched LiDAR detections; and (ii) cascade fusion, to recover missed objects from LiDAR by generating new 3D frustum proposals corresponding to unmatched RGB detections. Experiments show that LCF3D is beneficial for domain generalization, as it turns out to be successful in handling different sensor configurations between training and testing domains. LCF3D achieves significant improvements over LiDAR-based methods, particularly for challenging categories like pedestrians and cyclists in the KITTI dataset, as well as motorcycles and bicycles in nuScenes. Code can be downloaded from: https://github.com/CarloSgaravatti/LCF3D.

[81] Explainable Deep Learning for Pediatric Pneumonia Detection in Chest X-Ray Images

Adil O. Khadidos,Aziida Nanyonga,Alaa O. Khadidos,Olfat M. Mirza,Mustafa Tahsin Yilmaz

Main category: cs.CV

TL;DR: 本研究比较了DenseNet121和EfficientNet-B0两种CNN模型在儿童肺炎检测中的性能，使用5,863张胸部X光图像，结果显示EfficientNet-B0表现更优且具备良好可解释性。

Details

Motivation: 肺炎是儿童发病和死亡的主要原因，亟需准确高效的诊断辅助工具，深度学习在医学影像分析中具有潜力，但需比较不同模型在儿科肺炎检测中的有效性。 Method: 采用公开的5,863张儿科胸部X光图像数据集，进行归一化、调整大小和数据增强预处理；基于ImageNet预训练权重，在相同训练条件下微调DenseNet121和EfficientNet-B0模型，并使用准确率、F1分数、MCC和召回率评估性能；结合Grad-CAM和LIME实现模型可解释性。 Result: EfficientNet-B0达到84.6%准确率、0.8899 F1分数和0.6849 MCC，优于DenseNet121的79.7%准确率、0.8597 F1分数和0.5852 MCC；两个模型召回率均超过0.99，显示出高敏感性；Grad-CAM和LIME可视化显示模型关注临床上相关的肺部区域。 Conclusion: EfficientNet-B0相比DenseNet121具有更均衡的性能和更高的计算效率，适合临床部署；引入可解释性技术增强了AI辅助诊断的透明度与可信度。 Abstract: Background: Pneumonia remains a leading cause of morbidity and mortality among children worldwide, emphasizing the need for accurate and efficient diagnostic support tools. Deep learning has shown strong potential in medical image analysis, particularly for chest X-ray interpretation. This study compares two state-of-the-art convolutional neural network (CNN) architectures for automated pediatric pneumonia detection. Methods: A publicly available dataset of 5,863 pediatric chest X-ray images was used. Images were preprocessed through normalization, resizing, and data augmentation to enhance generalization. DenseNet121 and EfficientNet-B0 were fine-tuned using pretrained ImageNet weights under identical training settings. Performance was evaluated using accuracy, F1-score, Matthews Correlation Coefficient (MCC), and recall. Model explainability was incorporated using Gradient-weighted Class Activation Mapping (Grad-CAM) and Local Interpretable Model-agnostic Explanations (LIME) to visualize image regions influencing predictions. Results: EfficientNet-B0 outperformed DenseNet121, achieving an accuracy of 84.6%, F1-score of 0.8899, and MCC of 0.6849. DenseNet121 achieved 79.7% accuracy, an F1-score of 0.8597, and MCC of 0.5852. Both models demonstrated high recall values above 0.99, indicating strong sensitivity to pneumonia detection. Grad-CAM and LIME visualizations showed consistent focus on clinically relevant lung regions, supporting the reliability of model decisions. Conclusions: EfficientNet-B0 provided a more balanced and computationally efficient performance compared to DenseNet121, making it a strong candidate for clinical deployment. The integration of explainability techniques enhances transparency and trustworthiness in AI-assisted pediatric pneumonia diagnosis.

[82] NanoSD: Edge Efficient Foundation Model for Real Time Image Restoration

Subhajit Sanyal,Srinivas Soumitri Miriyala,Akshay Janardan Bankar,Sravanth Kodavanti,Harshit,Abhishek Ameta,Shreyas Pandith,Amit Satish Unde

Main category: cs.CV

TL;DR: 本文提出了NanoSD，一个从Stable Diffusion 1.5蒸馏而来的轻量级扩散基础模型家族，通过网络手术、特征级生成蒸馏和结构化缩放联合优化U-Net与VAE，实现精度、延迟和模型大小的帕累托最优，支持边缘设备上的实时图像生成与恢复任务。

Details

Motivation: 现有的轻量化扩散模型主要压缩去噪U-Net或缩短扩散路径，破坏了潜在流形结构，限制了泛化能力；同时完整扩散流水线在边缘设备上计算开销过大，缺乏兼顾效率与保持生成先验的解决方案。 Method: 提出NanoSD，采用网络手术、特征级生成蒸馏以及对U-Net和VAE编解码器进行结构化架构缩放的联合策略，实现全流水线协同设计，在大幅压缩模型的同时保留原始生成先验。 Result: NanoSD模型参数量在1.3亿到3.15亿之间，可在移动级NPU上实现低至20ms的实时推理，并在超分辨率、去模糊、人脸修复和单目深度估计等多个任务上超越现有轻量扩散模型，取得最佳感知质量与部署实用性。 Conclusion: NanoSD通过全流水线协同压缩，在不破坏潜在空间结构的前提下实现了高效、通用的扩散基础模型，为边缘设备上的实时视觉生成与恢复提供了可行方案，验证了架构平衡与特征路由对实际硬件效率的关键影响。 Abstract: Latent diffusion models such as Stable Diffusion 1.5 offer strong generative priors that are highly valuable for image restoration, yet their full pipelines remain too computationally heavy for deployment on edge devices. Existing lightweight variants predominantly compress the denoising U-Net or reduce the diffusion trajectory, which disrupts the underlying latent manifold and limits generalization beyond a single task. We introduce NanoSD, a family of Pareto-optimal diffusion foundation models distilled from Stable Diffusion 1.5 through network surgery, feature-wise generative distillation, and structured architectural scaling jointly applied to the U-Net and the VAE encoder-decoder. This full-pipeline co-design preserves the generative prior while producing models that occupy distinct operating points along the accuracy-latency-size frontier (e.g., 130M-315M parameters, achieving real-time inference down to 20ms on mobile-class NPUs). We show that parameter reduction alone does not correlate with hardware efficiency, and we provide an analysis revealing how architectural balance, feature routing, and latent-space preservation jointly shape true on-device latency. When used as a drop-in backbone, NanoSD enables state-of-the-art performance across image super-resolution, image deblurring, face restoration, and monocular depth estimation, outperforming prior lightweight diffusion models in both perceptual quality and practical deployability. NanoSD establishes a general-purpose diffusion foundation model family suitable for real-time visual generation and restoration on edge devices.

[83] UniHash: Unifying Pointwise and Pairwise Hashing Paradigms for Seen and Unseen Category Retrieval

Xiaoxu Ma,Runhao Li,Hanwen Liu,Xiangbo Zhang,Zhenyu Weng

Main category: cs.CV

TL;DR: 提出了一种名为Unified Hashing (UniHash)的双分支框架，统一了点式和成对学习范式的优势，以在可见和不可见类别图像检索中实现平衡且先进的性能。

Details

Motivation: 现有深度哈希方法通常局限于单一训练范式（点式或成对），难以同时在已见和未见类别上取得良好表现，因此需要一种能兼顾识别精度与泛化能力的统一方法。 Method: 设计了一个双分支框架UniHash，包含基于中心的点式分支和成对分支；引入双向知识迁移机制，通过互学习损失对齐哈希表示，并提出Split-Merge Mixture of Hash Experts (SM-MoH)模块增强跨分支表示交换。 Result: 在CIFAR-10、MSCOCO和ImageNet上的实验表明，UniHash在可见和不可见类别的图像检索任务中均达到最先进的性能。 Conclusion: UniHash通过融合点式和成对学习范式的优点，实现了对已见和未见类别的有效检索，理论分析和实验验证了其优越性和泛化能力。 Abstract: Effective retrieval across both seen and unseen categories is crucial for modern image retrieval systems. Retrieval on seen categories ensures precise recognition of known classes, while retrieval on unseen categories promotes generalization to novel classes with limited supervision. However, most existing deep hashing methods are confined to a single training paradigm, either pointwise or pairwise, where the former excels on seen categories and the latter generalizes better to unseen ones. To overcome this limitation, we propose Unified Hashing (UniHash), a dual-branch framework that unifies the strengths of both paradigms to achieve balanced retrieval performance across seen and unseen categories. UniHash consists of two complementary branches: a center-based branch following the pointwise paradigm and a pairwise branch following the pairwise paradigm. A novel hash code learning method is introduced to enable bidirectional knowledge transfer between branches, improving hash code discriminability and generalization. It employs a mutual learning loss to align hash representations and introduces a Split-Merge Mixture of Hash Experts (SM-MoH) module to enhance cross-branch exchange of hash representations. Theoretical analysis substantiates the effectiveness of UniHash, and extensive experiments on CIFAR-10, MSCOCO, and ImageNet demonstrate that UniHash consistently achieves state-of-the-art performance in both seen and unseen image retrieval scenarios.

[84] ViSIL: Unified Evaluation of Information Loss in Multimodal Video Captioning

Po-han Li,Shenghui Chen,Ufuk Topcu,Sandeep Chinchali

Main category: cs.CV

TL;DR: 提出了一种基于信息论的视频摘要信息损失（ViSIL）评分框架，用于量化多模态摘要中的信息覆盖，并实现跨模态格式的统一评估。

Details

Motivation: 传统指标如BLEU或ROUGE无法衡量跨模态的信息覆盖，难以评估文本与关键帧序列之间的信息一致性。 Method: 利用视觉-语言模型（VLM）推理，构建信息论框架ViSIL，通过测量视频信息在摘要中的丢失程度来量化摘要质量。 Result: ViSIL得分与人类及VLM在视频问答（VQA）任务上的表现具有显著相关性，并能用于选择最优摘要，在不增加处理开销的情况下比纯文本摘要提升7%的VQA准确率。 Conclusion: ViSIL是一种统一、有效的多模态摘要评估指标，支持不同格式摘要的直接比较，并可优化信息保留与处理效率之间的权衡。 Abstract: Multimodal video captioning condenses dense footage into a structured format of keyframes and natural language. By creating a cohesive multimodal summary, this approach anchors generative AI in rich semantic evidence and serves as a lightweight proxy for high-efficiency retrieval. However, traditional metrics like BLEU or ROUGE fail to quantify information coverage across disparate modalities, such as comparing a paragraph of text to a sequence of keyframes. To address this, we propose the Video Summary Information Loss (ViSIL) score, an information-theoretic framework that quantifies the video information not captured by a summary via vision-language model (VLM) inference. By measuring the information loss, ViSIL is a unified metric that enables direct comparison across multimodal summary formats despite their structural discrepancies. Our results demonstrate that ViSIL scores show a statistically significant correlation with both human and VLM performance on Video Question Answering (VQA) tasks. ViSIL also enables summary selection to optimize the trade-off between information loss and processing speed, establishing a Pareto-optimal frontier that outperforms text summaries by $7\%$ in VQA accuracy without increasing processing load.

[85] Breaking the Limits of Open-Weight CLIP: An Optimization Framework for Self-supervised Fine-tuning of CLIP

Anant Mehta,Xiyuan Wei,Xingyu Chen,Tianbao Yang

Main category: cs.CV

TL;DR: 本文提出TuneCLIP，一种自监督微调框架，用于提升开源CLIP模型在多种下游任务中的泛化性能，而无需从头训练。

Details

Motivation: 现有CLIP模型在直接微调时常出现性能下降，且依赖大规模数据训练成本过高，因此需要一种仅利用现有自监督数据即可提升开放权重模型通用性能的方法。 Method: TuneCLIP包含两个关键步骤：一是通过恢复优化统计量的预热阶段以减少冷启动偏差；二是采用新的对比损失函数进行微调，减轻对假负样本对的惩罚。 Result: 实验表明TuneCLIP在不同架构和规模的模型上均能稳定提升性能，在ImageNet等分布外基准上最高提升+2.5%，在DataComp基准上提升+1.2%。 Conclusion: TuneCLIP为高效后预训练适应提供了新基线，显著增强了开源CLIP模型的通用性和实用性。 Abstract: CLIP has become a cornerstone of multimodal representation learning, yet improving its performance typically requires a prohibitively costly process of training from scratch on billions of samples. We ask a different question: Can we improve the performance of open-weight CLIP models across various downstream tasks using only existing self-supervised datasets? Unlike supervised fine-tuning, which adapts a pretrained model to a single downstream task, our setting seeks to improve general performance across various tasks. However, as both our experiments and prior studies reveal, simply applying standard training protocols starting from an open-weight CLIP model often fails, leading to performance degradation. In this paper, we introduce TuneCLIP, a self-supervised fine-tuning framework that overcomes the performance degradation. TuneCLIP has two key components: (1) a warm-up stage of recovering optimization statistics to reduce cold-start bias, inspired by theoretical analysis, and (2) a fine-tuning stage of optimizing a new contrastive loss to mitigate the penalization on false negative pairs. Our extensive experiments show that TuneCLIP consistently improves performance across model architectures and scales. Notably, it elevates leading open-weight models like SigLIP (ViT-B/16), achieving gains of up to +2.5% on ImageNet and related out-of-distribution benchmarks, and +1.2% on the highly competitive DataComp benchmark, setting a new strong baseline for efficient post-pretraining adaptation.

[86] VibrantSR: Sub-Meter Canopy Height Models from Sentinel-2 Using Generative Flow Matching

Kiarie Ndegwa,Andreas Gros,Tony Chang,David Diaz,Vincent A. Landau,Nathan E. Rutenbeck,Luke J. Zachmann,Guy Bayes,Scott Conway

Main category: cs.CV

TL;DR: VibrantSR是一个基于生成模型的超分辨率框架，利用全球可用的Sentinel-2影像（10米）生成0.5米分辨率的树冠高度模型（CHM），在22个生态区评估中表现优于现有卫星基准方法。

Details

Motivation: 传统基于航空影像的树冠高度建模受限于获取频率低且不规律，难以实现持续大范围森林监测；而现有卫星方法分辨率和精度不足。VibrantSR旨在利用广泛可得的Sentinel-2数据实现高时空分辨率、可业务化运行的大陆尺度森林监测与碳核算。 Method: 提出VibrantSR框架，采用生成式超分辨率技术，从10米Sentinel-2季节性合成影像预测0.5米分辨率的树冠高度模型，并在22个EPA Level 3生态区使用空间分离的验证集进行评估。 Result: 在树高≥2米时，VibrantSR的平均绝对误差（MAE）为4.39米，优于Meta（4.83米）、LANDFIRE（5.96米）和ETH（7.05米）等卫星基准；尽管低于航空版VibrantVS（2.71米），但具备更广覆盖和更高时间一致性优势。 Conclusion: VibrantSR能够在不依赖昂贵且时相稀疏的航空数据的前提下，实现大陆尺度、季节至年度频率的高分辨率森林结构监测，推动业务化森林碳储量估算的发展。 Abstract: We present VibrantSR (Vibrant Super-Resolution), a generative super-resolution framework for estimating 0.5 meter canopy height models (CHMs) from 10 meter Sentinel-2 imagery. Unlike approaches based on aerial imagery that are constrained by infrequent and irregular acquisition schedules, VibrantSR leverages globally available Sentinel-2 seasonal composites, enabling consistent monitoring at a seasonal-to-annual cadence. Evaluated across 22 EPA Level 3 eco-regions in the western United States using spatially disjoint validation splits, VibrantSR achieves a Mean Absolute Error of 4.39 meters for canopy heights >= 2 m, outperforming Meta (4.83 m), LANDFIRE (5.96 m), and ETH (7.05 m) satellite-based benchmarks. While aerial-based VibrantVS (2.71 m MAE) retains an accuracy advantage, VibrantSR enables operational forest monitoring and carbon accounting at continental scales without reliance on costly and temporally infrequent aerial acquisitions.

[87] MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

Yang Xing,Jiong Wu,Savas Ozdemir,Ying Zhang,Yang Yang,Wei Shao,Kuang Gong

Main category: cs.CV

TL;DR: 提出MedVL-SAM2，一个统一的3D医学多模态模型，支持报告生成、视觉问答和多种分割任务，实现细粒度视觉定位与三维空间推理的统一。

Details

Motivation: 现有医学视觉语言模型在细粒度视觉定位和三维空间推理方面能力有限，缺乏能统一处理多种任务的通用框架。 Method: 构建融合图像级推理与像素级感知的架构，集成基于SAM2的三维分割模块，并通过多阶段训练：先在大规模3D CT图文对上预训练，再联合优化语言理解与分割目标。 Result: 在报告生成、VQA和多种3D分割任务上达到SOTA性能，支持通过文本、点或框提示进行灵活交互，具备可靠的3D视觉定位与跨模态推理能力。 Conclusion: MedVL-SAM2首次实现了高阶语义推理与精确3D定位在统一框架中的协同，为通用3D医学视觉语言建模提供了有效解决方案。 Abstract: Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.

[88] Transition Matching Distillation for Fast Video Generation

Weili Nie,Julius Berner,Nanye Ma,Chao Liu,Saining Xie,Arash Vahdat

Main category: cs.CV

TL;DR: 本文提出了一种名为Transition Matching Distillation (TMD)的新框架，用于将视频扩散模型蒸馏为高效的少步生成器，以实现快速高质量的视频生成。

Details

Motivation: 现有的大型视频扩散和流模型虽然在高质量视频生成方面表现出色，但由于其多步采样过程效率低下，难以应用于实时交互场景。 Method: TMD通过匹配扩散模型的多步去噪轨迹与少步概率转移过程来实现蒸馏，其中每一步转移由轻量级条件流建模，并将原始扩散模型分解为主干网络和流头两部分以提高效率。 Result: 在Wan2.1 1.3B和14B文本到视频模型上的实验表明，TMD在生成速度和视觉质量之间实现了良好的权衡，且在相同推理成本下优于现有蒸馏模型。 Conclusion: TMD是一种灵活且有效的视频生成模型蒸馏方法，显著提升了生成效率，同时保持了较高的视觉质量和提示一致性。 Abstract: Large video diffusion and flow models have achieved remarkable success in high-quality video generation, but their use in real-time interactive applications remains limited due to their inefficient multi-step sampling process. In this work, we present Transition Matching Distillation (TMD), a novel framework for distilling video diffusion models into efficient few-step generators. The central idea of TMD is to match the multi-step denoising trajectory of a diffusion model with a few-step probability transition process, where each transition is modeled as a lightweight conditional flow. To enable efficient distillation, we decompose the original diffusion backbone into two components: (1) a main backbone, comprising the majority of early layers, that extracts semantic representations at each outer transition step; and (2) a flow head, consisting of the last few layers, that leverages these representations to perform multiple inner flow updates. Given a pretrained video diffusion model, we first introduce a flow head to the model, and adapt it into a conditional flow map. We then apply distribution matching distillation to the student model with flow head rollout in each transition step. Extensive experiments on distilling Wan2.1 1.3B and 14B text-to-video models demonstrate that TMD provides a flexible and strong trade-off between generation speed and visual quality. In particular, TMD outperforms existing distilled models under comparable inference costs in terms of visual fidelity and prompt adherence. Project page: https://research.nvidia.com/labs/genair/tmd

[89] OT-Drive: Out-of-Distribution Off-Road Traversable Area Segmentation via Optimal Transport

Zhihua Zhao,Guoqiang Li,Chen Min,Kangping Lu

Main category: cs.CV

TL;DR: 本文提出了一种基于最优传输的多模态融合框架OT-Drive，用于提升自动驾驶中非结构化环境下的可行驶区域分割性能，尤其在分布外（OOD）场景下表现出色。

Details

Motivation: 现有数据驱动方法在分布外（OOD）场景下分割性能下降，影响自动驾驶的规划与决策。 Method: 将RGB和表面法线特征融合建模为分布传输问题，设计场景锚点生成器（SAG）提取天气、时间、道路类型的联合分布作为语义锚点，并通过最优传输融合模块（OT Fusion）将多模态特征映射到该流形上，实现鲁棒分割。 Result: 在ORFD的OOD场景上达到95.16% mIoU，超越先前方法6.35%；在跨数据集任务上达到89.79% mIoU，优于基线13.99%。 Conclusion: OT-Drive在少量训练数据下仍具备强大的OOD泛化能力，显著提升了实际部署的实用性与效率。 Abstract: Reliable traversable area segmentation in unstructured environments is critical for planning and decision-making in autonomous driving. However, existing data-driven approaches often suffer from degraded segmentation performance in out-of-distribution (OOD) scenarios, consequently impairing downstream driving tasks. To address this issue, we propose OT-Drive, an Optimal Transport--driven multi-modal fusion framework. The proposed method formulates RGB and surface normal fusion as a distribution transport problem. Specifically, we design a novel Scene Anchor Generator (SAG) to decompose scene information into the joint distribution of weather, time-of-day, and road type, thereby constructing semantic anchors that can generalize to unseen scenarios. Subsequently, we design an innovative Optimal Transport-based multi-modal fusion module (OT Fusion) to transport RGB and surface normal features onto the manifold defined by the semantic anchors, enabling robust traversable area segmentation under OOD scenarios. Experimental results demonstrate that our method achieves 95.16% mIoU on ORFD OOD scenarios, outperforming prior methods by 6.35%, and 89.79% mIoU on cross-dataset transfer tasks, surpassing baselines by 13.99%.These results indicate that the proposed model can attain strong OOD generalization with only limited training data, substantially enhancing its practicality and efficiency for real-world deployment.

[90] The Spatial Blindspot of Vision-Language Models

Nahid Alam,Leema Krishna Murali,Siddhant Bharadwaj,Patrick Liu,Timothy Chung,Drishti Sharma,Akshata A,Kranthi Kiran,Wesley Tam,Bala Krishna S Vegesna

Main category: cs.CV

TL;DR: 本文探讨了当前视觉语言模型（VLM）在空间关系理解上的不足，提出通过替代性训练目标和二维位置编码来增强空间感知能力。

Details

Motivation: 现有的VLM通常使用将图像展平为一维patch序列的CLIP风格训练方法，丢失了对空间推理至关重要的二维结构信息，限制了其在机器人和具身AI等需要空间定位的应用中的表现。 Method: 研究探索了两种改进方案：一是采用不同的训练目标训练图像编码器；二是引入保留2D结构的位置编码。 Result: 实验表明，这些架构上的改进能够在多个空间推理基准上提升模型性能。 Conclusion: 恢复并利用图像的二维结构信息是提升VLM空间推理能力的关键，应成为未来VLM设计的重要方向。 Abstract: Vision-language models (VLMs) have advanced rapidly, but their ability to capture spatial relationships remains a blindspot. Current VLMs are typically built with contrastive language-image pretraining (CLIP) style image encoders. The training recipe often flattens images into 1D patch sequences, discarding the 2D structure necessary for spatial reasoning. We argue that this lack of spatial awareness is a missing dimension in VLM design and a bottleneck for applications requiring spatial grounding, such as robotics and embodied AI. To address this, we investigate (i) image encoders trained with alternative objectives and (ii) 2D positional encodings. Our experiments show that these architectural choices can lead to improved spatial reasoning on several benchmarks.

[91] DR$^2$Seg: Decomposed Two-Stage Rollouts for Efficient Reasoning Segmentation in Multimodal Large Language Models

Yulin He,Wei Chen,Zhikang Jian,Tianhang Guo,Wenjuan Zhou,Minglong Li

Main category: cs.CV

TL;DR: 提出DR$^2$Seg，一种无需额外监督的自奖励框架，通过两阶段rollout策略提升推理效率与分割准确性。

Details

Motivation: 现有方法存在过思考问题，生成冗长推理链干扰多模态大模型中的对象定位。 Method: 采用两阶段rollout策略：第一阶段生成明确描述目标对象的自包含描述；第二阶段用该描述替代原复杂查询以验证其自包含性，并引入两个自奖励机制来增强目标导向推理并抑制冗余思考。 Result: 在不同规模的多模态大语言模型和分割模型上实验表明，DR$^2$Seg持续提升了推理效率和整体分割性能。 Conclusion: DR$^2$Seg有效缓解了过思考问题，在不依赖额外监督的情况下实现了更高效准确的推理分割。 Abstract: Reasoning segmentation is an emerging vision-language task that requires reasoning over intricate text queries to precisely segment objects. However, existing methods typically suffer from overthinking, generating verbose reasoning chains that interfere with object localization in multimodal large language models (MLLMs). To address this issue, we propose DR$^2$Seg, a self-rewarding framework that improves both reasoning efficiency and segmentation accuracy without requiring extra thinking supervision. DR$^2$Seg employs a two-stage rollout strategy that decomposes reasoning segmentation into multimodal reasoning and referring segmentation. In the first stage, the model generates a self-contained description that explicitly specifies the target object. In the second stage, this description replaces the original complex query to verify its self-containment. Based on this design, two self-rewards are introduced to strengthen goal-oriented reasoning and suppress redundant thinking. Extensive experiments across MLLMs of varying scales and segmentation models demonstrate that DR$^2$Seg consistently improves reasoning efficiency and overall segmentation performance.

[92] DW-DGAT: Dynamically Weighted Dual Graph Attention Network for Neurodegenerative Disease Diagnosis

Chengjia Liang,Zhenjiong Wang,Chao Chen,Ruizhi Zhang,Songxi Liang,Hai Xie,Haijun Lei,Zhongwei Huang

Main category: cs.CV

TL;DR: 提出一种动态加权双图注意力网络（DW-DGAT）用于帕金森和阿尔茨海默病的早期诊断，融合多模态数据并解决类别不平衡问题，在PPMI和ADNI数据集上达到SOTA性能。

Details

Motivation: 帕金森和阿尔茨海默病的早期诊断面临高维多模态数据融合、异构性及类别不平衡等挑战，需更有效的模型提升诊断准确率。 Method: 设计了一种通用数据融合策略，结合基于脑区和样本间关系的双图注意力架构，并引入类别权重生成机制与稳定损失函数以缓解类别不平衡。 Result: 在PPMI和ADNI数据集上的实验表明，该方法在早期神经退行性疾病诊断中表现优异，达到当前最优水平。 Conclusion: DW-DGAT能有效整合多源异构数据，提升PD和AD的早期诊断性能，具有临床应用潜力。 Abstract: Parkinson's disease (PD) and Alzheimer's disease (AD) are the two most prevalent and incurable neurodegenerative diseases (NDs) worldwide, for which early diagnosis is critical to delay their progression. However, the high dimensionality of multi-metric data with diverse structural forms, the heterogeneity of neuroimaging and phenotypic data, and class imbalance collectively pose significant challenges to early ND diagnosis. To address these challenges, we propose a dynamically weighted dual graph attention network (DW-DGAT) that integrates: (1) a general-purpose data fusion strategy to merge three structural forms of multi-metric data; (2) a dual graph attention architecture based on brain regions and inter-sample relationships to extract both micro- and macro-level features; and (3) a class weight generation mechanism combined with two stable and effective loss functions to mitigate class imbalance. Rigorous experiments, based on the Parkinson Progression Marker Initiative (PPMI) and Alzhermer's Disease Neuroimaging Initiative (ADNI) studies, demonstrate the state-of-the-art performance of our approach.

[93] VERHallu: Evaluating and Mitigating Event Relation Hallucination in Video Large Language Models

Zefan Zhang,Kehua Zhu,Shijie Jiang,Hongyuan Lu,Shengkai Sun,Tian Bai

Main category: cs.CV

TL;DR: 本文提出了一种新的视频事件关系幻觉评测基准VERHallu，涵盖因果、时序和子事件关系，并设计了三种任务进行综合评估。研究发现现有VideoLLM在密集事件推理上表现不佳，常依赖先验知识而忽略帧级线索。为此提出关键帧传播（KFP）策略，通过重分配中间层的帧级注意力来增强多事件理解，有效缓解幻觉问题且不影响推理速度。

Details

Motivation: 现有研究多关注视频中对象、场景等存在性幻觉，忽视事件间关系的幻觉问题。由于视频包含复杂的动态信息，模型可能错误推断事件之间的因果、时间或包含关系，影响其可靠性和安全性，因此需要专门针对事件关系幻觉进行评测与分析。 Method: 构建了一个名为VERHallu的新基准，聚焦于事件间的因果、时序和子事件关系，包含关系分类、问答和反事实问答三类任务。该基准包含违背常识的反直觉视频场景，并提供人工标注的候选答案以识别视觉-语言和纯语言偏见。同时提出关键帧传播（KFP）策略，在中间层重新分配帧级注意力，提升对多个事件及其关系的理解能力。 Result: 实验表明当前最先进的VideoLLM在处理密集事件关系推理时表现较差，倾向于依赖先验知识而非充分利用帧级视觉线索。尽管能较好地定位关键事件，但常忽略周围子事件，导致对事件关系理解不完整。KFP策略显著减少了事件关系幻觉，在多种任务上提升了性能，且不增加推理延迟。 Conclusion: 事件关系幻觉是VideoLLM中的一个重要问题，尤其在复杂动态场景下更为突出。VERHallu为评估此类问题提供了有效工具，而KFP策略通过改进注意力机制增强了模型对多事件关系的理解，有助于提升视频理解模型的准确性和鲁棒性。 Abstract: Video Large Language Models (VideoLLMs) exhibit various types of hallucinations. Existing research has primarily focused on hallucinations involving the presence of events, objects, and scenes in videos, while largely neglecting event relation hallucination. In this paper, we introduce a novel benchmark for evaluating the Video Event Relation Hallucination, named VERHallu. This benchmark focuses on causal, temporal, and subevent relations between events, encompassing three types of tasks: relation classification, question answering, and counterfactual question answering, for a comprehensive evaluation of event relation hallucination. Additionally, it features counterintuitive video scenarios that deviate from typical pretraining distributions, with each sample accompanied by human-annotated candidates covering both vision-language and pure language biases. Our analysis reveals that current state-of-the-art VideoLLMs struggle with dense-event relation reasoning, often relying on prior knowledge due to insufficient use of frame-level cues. Although these models demonstrate strong grounding capabilities for key events, they often overlook the surrounding subevents, leading to an incomplete and inaccurate understanding of event relations. To tackle this, we propose a Key-Frame Propagating (KFP) strategy, which reallocates frame-level attention within intermediate layers to enhance multi-event understanding. Experiments show it effectively mitigates the event relation hallucination without affecting inference speed.

[94] Disentangled Concept Representation for Text-to-image Person Re-identification

Giyeol Kim,Chanho Eom

Main category: cs.CV

TL;DR: 本文提出了一种名为DiCo的新框架，用于文本到图像行人重识别（TIReID），通过分层和解耦的跨模态对齐方法，在多个数据集上实现了与现有最先进方法相媲美的性能，并提高了细粒度检索结果的可解释性。

Details

Motivation: 由于视觉外观和文本表达之间存在显著的模态差距，并且需要建模区分个体的细粒度对应关系，文本到图像行人重识别（TIReID）任务具有挑战性。 Method: DiCo引入了一个基于共享槽位的表示，每个槽位作为跨模态的部分级锚点，并进一步分解为多个概念块，从而实现颜色、纹理、形状等互补属性的解耦，同时保持图像和文本之间的部分级一致性。 Result: 在CUHK-PEDES、ICFG-PEDES和RSTPReid数据集上的大量实验表明，该方法性能优于现有方法，并通过显式的槽位和块级表示增强了模型的可解释性。 Conclusion: DiCo通过层次化解耦表示有效缩小了文本与图像间的模态差距，提升了TIReID的细粒度匹配能力和结果可解释性。 Abstract: Text-to-image person re-identification (TIReID) aims to retrieve person images from a large gallery given free-form textual descriptions. TIReID is challenging due to the substantial modality gap between visual appearances and textual expressions, as well as the need to model fine-grained correspondences that distinguish individuals with similar attributes such as clothing color, texture, or outfit style. To address these issues, we propose DiCo (Disentangled Concept Representation), a novel framework that achieves hierarchical and disentangled cross-modal alignment. DiCo introduces a shared slot-based representation, where each slot acts as a part-level anchor across modalities and is further decomposed into multiple concept blocks. This design enables the disentanglement of complementary attributes (\textit{e.g.}, color, texture, shape) while maintaining consistent part-level correspondence between image and text. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that our framework achieves competitive performance with state-of-the-art methods, while also enhancing interpretability through explicit slot- and block-level representations for more fine-grained retrieval results.

[95] UEOF: A Benchmark Dataset for Underwater Event-Based Optical Flow

Nick Truong,Pritam P. Karmokar,William J. Beksi

Main category: cs.CV

TL;DR: 本文提出了首个用于事件相机的合成水下光流基准数据集，基于物理渲染的RGBD序列生成具有密集真值的事件流数据，支持水下环境中的运动估计研究。

Details

Motivation: 由于水下光学条件复杂且难以获取真实运动真值，现有方法受限于缺乏配对的真实水下事件数据与光流数据。 Method: 利用基于物理的光线追踪生成水下RGBD视频，通过现代视频到事件转换管道生成逼真的事件流，并提供密集的光流、深度和相机运动真值。 Result: 成功构建了包含真实水下光学效应的事件数据集，并对当前主流的基于学习和模型的光流预测方法进行了基准测试。 Conclusion: 该数据集为水下事件相机感知算法的发展和评估提供了新基准，推动了该领域的研究进展。 Abstract: Underwater imaging is fundamentally challenging due to wavelength-dependent light attenuation, strong scattering from suspended particles, turbidity-induced blur, and non-uniform illumination. These effects impair standard cameras and make ground-truth motion nearly impossible to obtain. On the other hand, event cameras offer microsecond resolution and high dynamic range. Nonetheless, progress on investigating event cameras for underwater environments has been limited due to the lack of datasets that pair realistic underwater optics with accurate optical flow. To address this problem, we introduce the first synthetic underwater benchmark dataset for event-based optical flow derived from physically-based ray-traced RGBD sequences. Using a modern video-to-event pipeline applied to rendered underwater videos, we produce realistic event data streams with dense ground-truth flow, depth, and camera motion. Moreover, we benchmark state-of-the-art learning-based and model-based optical flow prediction methods to understand how underwater light transport affects event formation and motion estimation accuracy. Our dataset establishes a new baseline for future development and evaluation of underwater event-based perception algorithms. The source code and dataset for this project are publicly available at https://robotic-vision-lab.github.io/ueof.

[96] CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong,Mingkun Chang,Shenglong Zhang,Yuran Wang,Cheng Liang,Zhizheng Zhao,Ruichuan An,Bohan Zeng,Yang Shi,Yifan Dai,Ziming Zhao,Guanbin Li,Pengfei Wan,Yuanxing Zhang,Wentao Zhang

Main category: cs.CV

TL;DR: 本文提出了CoF-T2I模型，通过引入链式帧（Chain-of-Frame, CoF）推理机制，将视频生成中的逐步视觉优化过程应用于文本到图像（T2I）生成，显著提升了生成质量。

Details

Motivation: 尽管视频生成模型已展现出链式帧（CoF）推理能力，但其在文本到图像生成中的潜力因缺乏明确的推理起点和可解释的中间状态而未被充分挖掘。 Method: 提出CoF-T2I模型，通过渐进式视觉优化将CoF推理融入T2I生成，利用中间帧作为显式的推理步骤，并构建CoF-Evol-Instruct数据集来建模从语义到美学的生成过程；同时采用每帧独立编码以提升质量并减少运动伪影。 Result: 实验表明，CoF-T2I显著优于基础视频模型，在GenEval和Imagine-Bench等挑战性基准上分别达到0.86和7.468的性能表现。 Conclusion: 该研究表明，基于CoF推理的视频模型在推动高质量文本到图像生成方面具有巨大潜力。 Abstract: Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to aesthetics. To further improve quality and avoid motion artifacts, we enable independent encoding operation for each frame. Experiments show that CoF-T2I significantly outperforms the base video model and achieves competitive performance on challenging benchmarks, reaching 0.86 on GenEval and 7.468 on Imagine-Bench. These results indicate the substantial promise of video models for advancing high-quality text-to-image generation.

[97] ReaMIL: Reasoning- and Evidence-Aware Multiple Instance Learning for Whole-Slide Histopathology

Hyun Do Jung,Jungwon Choi,Hwiyoung Kim

Main category: cs.CV

TL;DR: ReaMIL是一种用于全切片病理图像的多实例学习方法，通过引入轻量选择头和预算充分性目标，实现高效且紧凑的证据选择，同时保持高性能。

Details

Motivation: 为了在不牺牲模型性能的前提下，提高多实例学习在全切片病理分析中的可解释性和证据效率，需要一种能够选择最小充分证据集的方法。 Method: 提出ReaMIL，添加一个轻量级选择头来生成软的每块重要性权重，并采用预算充分性目标（基于铰链损失）进行训练，该目标在稀疏性约束下确保仅使用保留的证据即可达到指定分类置信度。 Result: 在多个数据集上，ReaMIL达到或略优于基线AUC；在NSCLC数据上，AUC达0.983，平均最小充分块数约8.2（τ=0.9），AUKC约0.864，证据集小且空间紧凑。 Conclusion: ReaMIL无需额外监督，可无缝集成到标准MIL流程中，在保持性能的同时提供可解释的证据集和定量效率评估，提升了WSI分析的可靠性与透明度。 Abstract: We introduce ReaMIL (Reasoning- and Evidence-Aware MIL), a multiple instance learning approach for whole-slide histopathology that adds a light selection head to a strong MIL backbone. The head produces soft per-tile gates and is trained with a budgeted-sufficiency objective: a hinge loss that enforces the true-class probability to be $\geq τ$ using only the kept evidence, under a sparsity budget on the number of selected tiles. The budgeted-sufficiency objective yields small, spatially compact evidence sets without sacrificing baseline performance. Across TCGA-NSCLC (LUAD vs. LUSC), TCGA-BRCA (IDC vs. Others), and PANDA, ReaMIL matches or slightly improves baseline AUC and provides quantitative evidence-efficiency diagnostics. On NSCLC, it attains AUC 0.983 with a mean minimal sufficient K (MSK) $\approx 8.2$ tiles at $τ= 0.90$ and AUKC $\approx 0.864$, showing that class confidence rises sharply and stabilizes once a small set of tiles is kept. The method requires no extra supervision, integrates seamlessly with standard MIL training, and naturally yields slide-level overlays. We report accuracy alongside MSK, AUKC, and contiguity for rigorous evaluation of model behavior on WSIs.

[98] Thinking Like Van Gogh: Structure-Aware Style Transfer via Flow-Guided 3D Gaussian Splatting

Zhendong Wang,Lebin Zhou,Jingchuan Xiao,Rongduo Han,Nam Ling,Cihan Ruan

Main category: cs.CV

TL;DR: 提出一种基于流场引导的几何对流框架，实现无需网格的3D高斯点阵风格化，将2D绘画中的笔触运动转化为3D结构变形，更真实还原后印象派艺术风格。

Details

Motivation: 现有3D风格迁移方法多侧重表面纹理投影，忽视几何抽象，难以体现后印象派强调结构夸张的艺术本质，因此需要一种以几何形变为主要表达手段的新方法。 Method: 从2D画作中提取方向性流场，并通过反向传播将其引导至3D空间，调整高斯基元形成符合场景拓扑的流线型笔触；采用投影式无网格流场引导机制，并结合亮度与结构解耦策略，分离几何变形与颜色优化过程。 Result: 实现了高质量的3D后印象派风格化效果，在结构抽象上更具表现力，有效避免了传统方法在强风格化下的视觉伪影；提出的VLM-as-a-Judge评估框架能更好衡量艺术真实性。 Conclusion: 几何抽象是实现后印象派3D风格化的关键，所提方法通过流场引导与解耦优化，成功将2D绘画动势转化为3D结构表达，提升了艺术表现力与主观审美一致性。 Abstract: In 1888, Vincent van Gogh wrote, "I am seeking exaggeration in the essential." This principle, amplifying structural form while suppressing photographic detail, lies at the core of Post-Impressionist art. However, most existing 3D style transfer methods invert this philosophy, treating geometry as a rigid substrate for surface-level texture projection. To authentically reproduce Post-Impressionist stylization, geometric abstraction must be embraced as the primary vehicle of expression. We propose a flow-guided geometric advection framework for 3D Gaussian Splatting (3DGS) that operationalizes this principle in a mesh-free setting. Our method extracts directional flow fields from 2D paintings and back-propagates them into 3D space, rectifying Gaussian primitives to form flow-aligned brushstrokes that conform to scene topology without relying on explicit mesh priors. This enables expressive structural deformation driven directly by painterly motion rather than photometric constraints. Our contributions are threefold: (1) a projection-based, mesh-free flow guidance mechanism that transfers 2D artistic motion into 3D Gaussian geometry; (2) a luminance-structure decoupling strategy that isolates geometric deformation from color optimization, mitigating artifacts during aggressive structural abstraction; and (3) a VLM-as-a-Judge evaluation framework that assesses artistic authenticity through aesthetic judgment instead of conventional pixel-level metrics, explicitly addressing the subjective nature of artistic stylization.

[99] Difficulty-guided Sampling: Bridging the Target Gap between Dataset Distillation and Downstream Tasks

Mingzhuo Li,Guang Li,Linfeng Ye,Jiafeng Mao,Takahiro Ogawa,Konstantinos N. Plataniotis,Miki Haseyama

Main category: cs.CV

TL;DR: 本文提出了一种名为难度引导采样（DGS）的方法，用于改进数据集蒸馏中蒸馏目标与下游任务之间的差距，通过引入任务相关的难度概念来提升模型性能。

Details

Motivation: 现有数据集蒸馏方法通常忽略下游任务的特定信息，导致蒸馏目标与实际任务之间存在目标差距，影响性能。 Method: 提出难度引导采样（DGS）作为即插即用的后处理采样模块，并引入难度感知引导（DAG）机制，在生成过程中利用图像分类任务中的样本难度信息，按照特定难度分布从已有方法生成的图像池中采样最终的蒸馏数据集。 Result: 在多种实验设置下验证了所提方法的有效性，DGS和DAG均能显著提升蒸馏数据集在下游任务中的表现。 Conclusion: 通过引入任务相关的难度信息可以有效缩小蒸馏目标与下游任务之间的差距，难度感知策略具有在多种下游任务中广泛应用的潜力。 Abstract: In this paper, we propose difficulty-guided sampling (DGS) to bridge the target gap between the distillation objective and the downstream task, therefore improving the performance of dataset distillation. Deep neural networks achieve remarkable performance but have time and storage-consuming training processes. Dataset distillation is proposed to generate compact, high-quality distilled datasets, enabling effective model training while maintaining downstream performance. Existing approaches typically focus on features extracted from the original dataset, overlooking task-specific information, which leads to a target gap between the distillation objective and the downstream task. We propose leveraging characteristics that benefit the downstream training into data distillation to bridge this gap. Focusing on the downstream task of image classification, we introduce the concept of difficulty and propose DGS as a plug-in post-stage sampling module. Following the specific target difficulty distribution, the final distilled dataset is sampled from image pools generated by existing methods. We also propose difficulty-aware guidance (DAG) to explore the effect of difficulty in the generation process. Extensive experiments across multiple settings demonstrate the effectiveness of the proposed methods. It also highlights the broader potential of difficulty for diverse downstream tasks.

[100] V-Zero: Self-Improving Multimodal Reasoning with Zero Annotation

Han Wang,Yi Yang,Jingyuan Hu,Minfeng Zhu,Wei Chen

Main category: cs.CV

TL;DR: V-Zero是一种无需人工标注的视觉-语言模型自提升框架，通过问答双角色协同进化实现性能增益。

Details

Motivation: 现有视觉-语言模型依赖大规模人工标注数据，成本高且耗时，亟需一种无需标注的自提升方法。 Method: 提出V-Zero框架，构建Questioner与Solver两个角色：Questioner通过双轨推理奖励生成高质量问题，Solver利用自身响应的多数投票伪标签优化；二者通过GRPO算法进行迭代训练。 Result: 在Qwen2.5-VL-7B-Instruct上验证，无需任何人工标注，视觉数学推理提升+1.7，通用视觉任务提升+2.6。 Conclusion: V-Zero证明了仅用无标签图像即可实现视觉-语言模型的有效自提升，为多模态系统提供了可扩展的新范式。 Abstract: Recent advances in multimodal learning have significantly enhanced the reasoning capabilities of vision-language models (VLMs). However, state-of-the-art approaches rely heavily on large-scale human-annotated datasets, which are costly and time-consuming to acquire. To overcome this limitation, we introduce V-Zero, a general post-training framework that facilitates self-improvement using exclusively unlabeled images. V-Zero establishes a co-evolutionary loop by instantiating two distinct roles: a Questioner and a Solver. The Questioner learns to synthesize high-quality, challenging questions by leveraging a dual-track reasoning reward that contrasts intuitive guesses with reasoned results. The Solver is optimized using pseudo-labels derived from majority voting over its own sampled responses. Both roles are trained iteratively via Group Relative Policy Optimization (GRPO), driving a cycle of mutual enhancement. Remarkably, without a single human annotation, V-Zero achieves consistent performance gains on Qwen2.5-VL-7B-Instruct, improving visual mathematical reasoning by +1.7 and general vision-centric by +2.6, demonstrating the potential of self-improvement in multimodal systems. Code is available at https://github.com/SatonoDia/V-Zero

[101] InfoSculpt: Sculpting the Latent Space for Generalized Category Discovery

Wenwen Liao,Hang Ruan,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文提出了InfoSculpt框架，通过信息瓶颈原理和双重条件互信息目标，在广义类别发现任务中实现类别与实例级特征的解耦表示学习。

Details

Motivation: 现有方法依赖伪标签或两阶段聚类，缺乏从噪声中分离本质类别信号的机制。 Method: 基于信息瓶颈原则，提出双重条件互信息（CMI）目标：在有标签数据上使用类别级CMI学习紧凑判别表示，在所有数据上使用实例级CMI压缩增强噪声以提取不变特征。 Result: 在8个基准数据集上进行了广泛实验，结果表明InfoSculpt在已知和新类别分类方面均优于现有方法。 Conclusion: InfoSculpt通过信息理论方法有效解耦类别与实例特征，提升了广义类别发现的性能与鲁棒性。 Abstract: Generalized Category Discovery (GCD) aims to classify instances from both known and novel categories within a large-scale unlabeled dataset, a critical yet challenging task for real-world, open-world applications. However, existing methods often rely on pseudo-labeling, or two-stage clustering, which lack a principled mechanism to explicitly disentangle essential, category-defining signals from instance-specific noise. In this paper, we address this fundamental limitation by re-framing GCD from an information-theoretic perspective, grounded in the Information Bottleneck (IB) principle. We introduce InfoSculpt, a novel framework that systematically sculpts the representation space by minimizing a dual Conditional Mutual Information (CMI) objective. InfoSculpt uniquely combines a Category-Level CMI on labeled data to learn compact and discriminative representations for known classes, and a complementary Instance-Level CMI on all data to distill invariant features by compressing augmentation-induced noise. These two objectives work synergistically at different scales to produce a disentangled and robust latent space where categorical information is preserved while noisy, instance-specific details are discarded. Extensive experiments on 8 benchmarks demonstrate that InfoSculpt validating the effectiveness of our information-theoretic approach.

[102] FlowAct-R1: Towards Interactive Humanoid Video Generation

Lizhen Wang,Yongming Zhu,Zhipeng Ge,Youwei Zheng,Longhao Zhang,Tianshu Hu,Shiyang Qin,Mingshuang Luo,Jiaxu Zhang,Xin Chen,Yulong Wang,Zerong Zheng,Jianwen Jiang,Chao Liang,Weifeng Chen,Xing Wang,Yuan Zhang,Mingyuan Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为FlowAct-R1的实时交互式人形视频生成框架，基于MMDiT架构，通过分块扩散强制策略和自强制变体实现低延迟、长时间一致的高质量视频流生成，支持全身精细控制，并在480p分辨率下稳定达到25fps，首帧时间仅约1.5秒。

Details

Motivation: 现有视频生成方法在高保真合成与实时交互之间存在权衡，难以同时满足长时间连续交互中的质量与响应速度需求。 Method: 采用MMDiT架构，引入分块扩散强制策略及其自强制变体，结合高效蒸馏和系统级优化，实现低延迟、任意时长的流式视频生成，并提供全身体精细控制。 Result: 在480p分辨率下实现稳定的25fps生成速度，时间到首帧（TTFF）约为1.5秒，显著减少误差累积，保持长期时间一致性，并在行为生动性和感知真实感方面表现优异。 Conclusion: FlowAct-R1能够有效平衡生成质量与实时性，支持多样化角色风格下的自然行为过渡，为实时交互式虚拟代理提供了高效且鲁棒的解决方案。 Abstract: Interactive humanoid video generation aims to synthesize lifelike visual agents that can engage with humans through continuous and responsive video. Despite recent advances in video synthesis, existing methods often grapple with the trade-off between high-fidelity synthesis and real-time interaction requirements. In this paper, we propose FlowAct-R1, a framework specifically designed for real-time interactive humanoid video generation. Built upon a MMDiT architecture, FlowAct-R1 enables the streaming synthesis of video with arbitrary durations while maintaining low-latency responsiveness. We introduce a chunkwise diffusion forcing strategy, complemented by a novel self-forcing variant, to alleviate error accumulation and ensure long-term temporal consistency during continuous interaction. By leveraging efficient distillation and system-level optimizations, our framework achieves a stable 25fps at 480p resolution with a time-to-first-frame (TTFF) of only around 1.5 seconds. The proposed method provides holistic and fine-grained full-body control, enabling the agent to transition naturally between diverse behavioral states in interactive scenarios. Experimental results demonstrate that FlowAct-R1 achieves exceptional behavioral vividness and perceptual realism, while maintaining robust generalization across diverse character styles.

[103] MathDoc: Benchmarking Structured Extraction and Active Refusal on Noisy Mathematics Exam Papers

Chenyue Zhou,Jiayi Tuo,Shitong Qin,Wei Dai,Mingxuan Wang,Ziwei Zhao,Duoyang Li,Shiyang Su,Yanxi Lu,Yanbiao Ma

Main category: cs.CV

TL;DR: 本文提出了MathDoc，首个针对真实高中数学试卷的文档级信息提取基准，包含3,609个带有现实噪声和不可识别样本的手工标注题目，用于评估模型在复杂条件下的问题提取准确性和主动拒绝能力。

Details

Motivation: 现有数据集主要关注干净文档或通用布局分析，忽视了数学问题的结构完整性和模型对不完整输入的主动拒绝能力，难以反映真实教育场景中的挑战。 Method: 构建了MathDoc基准，包含真实考试试卷中的噪声样本，并设计了一个多维度评估框架，涵盖题干准确性、视觉相似性和拒绝能力；在多个最先进的多模态大语言模型上进行了实验。 Result: 实验表明，尽管端到端模型在提取任务上表现良好，但在面对无法识别的输入时普遍缺乏拒绝能力，倾向于输出高置信度但无效的结果。 Conclusion: 当前多模态大语言模型在处理低质量文档时存在可靠性缺陷，MathDoc为评估和改进模型在真实场景下的鲁棒性提供了新方向。 Abstract: The automated extraction of structured questions from paper-based mathematics exams is fundamental to intelligent education, yet remains challenging in real-world settings due to severe visual noise. Existing benchmarks mainly focus on clean documents or generic layout analysis, overlooking both the structural integrity of mathematical problems and the ability of models to actively reject incomplete inputs. We introduce MathDoc, the first benchmark for document-level information extraction from authentic high school mathematics exam papers. MathDoc contains \textbf{3,609} carefully curated questions with real-world artifacts and explicitly includes unrecognizable samples to evaluate active refusal behavior. We propose a multi-dimensional evaluation framework covering stem accuracy, visual similarity, and refusal capability. Experiments on SOTA MLLMs, including Qwen3-VL and Gemini-2.5-Pro, show that although end-to-end models achieve strong extraction performance, they consistently fail to refuse illegible inputs, instead producing confident but invalid outputs. These results highlight a critical gap in current MLLMs and establish MathDoc as a benchmark for assessing model reliability under degraded document conditions. Our project repository is available at \href{https://github.com/winnk123/papers/tree/master}{GitHub repository}

[104] Enhancing Visual In-Context Learning by Multi-Faceted Fusion

Wenwen Liao,Jianbo Yu,Yuansong Wang,Qingchao Jiang,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉上下文学习框架，通过多组合协同融合策略和MULTI-VQGAN架构，充分利用多个高质量提示的互补信息，提升模型在多种视觉任务中的泛化能力和预测精度。

Details

Motivation: 现有的“检索-提示”方法通常只使用单一最佳视觉提示或简单融合前K个提示，导致丢失有价值的上下文信息，限制了模型的推理能力。本文旨在通过更丰富的多视角协同融合机制来解决这一问题。 Method: 提出一种生成三个不同提示组合形成的上下文表示分支的框架，并设计MULTI-VQGAN架构以联合解析和利用来自多个来源的协作信息，实现多组合协同融合。 Result: 在前景分割、单目标检测和图像着色等多个任务上的实验表明，该方法具有强大的跨任务泛化能力、有效的上下文融合效果，且能产生比现有方法更鲁棒和准确的预测结果。 Conclusion: 通过避免将多个提示信号压缩为单一表示，所提出的多组合协同融合框架能够更好地挖掘多样化上下文的潜力，显著提升视觉上下文学习的性能。 Abstract: Visual In-Context Learning (VICL) has emerged as a powerful paradigm, enabling models to perform novel visual tasks by learning from in-context examples. The dominant "retrieve-then-prompt" approach typically relies on selecting the single best visual prompt, a practice that often discards valuable contextual information from other suitable candidates. While recent work has explored fusing the top-K prompts into a single, enhanced representation, this still simply collapses multiple rich signals into one, limiting the model's reasoning capability. We argue that a more multi-faceted, collaborative fusion is required to unlock the full potential of these diverse contexts. To address this limitation, we introduce a novel framework that moves beyond single-prompt fusion towards an multi-combination collaborative fusion. Instead of collapsing multiple prompts into one, our method generates three contextual representation branches, each formed by integrating information from different combinations of top-quality prompts. These complementary guidance signals are then fed into proposed MULTI-VQGAN architecture, which is designed to jointly interpret and utilize collaborative information from multiple sources. Extensive experiments on diverse tasks, including foreground segmentation, single-object detection, and image colorization, highlight its strong cross-task generalization, effective contextual fusion, and ability to produce more robust and accurate predictions than existing methods.

[105] Beyond Single Prompts: Synergistic Fusion and Arrangement for VICL

Wenwen Liao,Jianbo Yu,Yuansong Wang,Shifu Yan,Xiaofeng Yang

Main category: cs.CV

TL;DR: 本文提出了一种端到端的视觉上下文学习（VICL）框架，通过融合多个提示信息和利用排列结构来提升图像修复模型在少量提示下的任务适应能力。

Details

Motivation: 现有VICL方法仅选择最相似的提示而忽略其他高质量提示中的互补信息，并且未能利用不同提示排列所蕴含的结构信息。 Method: 提出一个自适应融合模块，聚合多个提示中的关键模式和标注以生成更精确的上下文提示；引入与排列相关的轻量MLP解耦布局先验；采用双向微调机制交换查询与提示角色，增强融合模块与修复模型的协作。 Result: 在前景分割、单目标检测和图像着色任务上实验表明，该方法性能优越并具有强跨任务泛化能力。 Conclusion: 所提出的框架有效解决了现有VICL方法的信息丢失和结构利用不足问题，显著提升了多任务场景下的上下文学习效果。 Abstract: Vision In-Context Learning (VICL) enables inpainting models to quickly adapt to new visual tasks from only a few prompts. However, existing methods suffer from two key issues: (1) selecting only the most similar prompt discards complementary cues from other high-quality prompts; and (2) failing to exploit the structured information implied by different prompt arrangements. We propose an end-to-end VICL framework to overcome these limitations. Firstly, an adaptive Fusion Module aggregates critical patterns and annotations from multiple prompts to form more precise contextual prompts. Secondly, we introduce arrangement-specific lightweight MLPs to decouple layout priors from the core model, while minimally affecting the overall model. In addition, an bidirectional fine-tuning mechanism swaps the roles of query and prompt, encouraging the model to reconstruct the original prompt from fused context and thus enhancing collaboration between the fusion module and the inpainting model. Experiments on foreground segmentation, single-object detection, and image colorization demonstrate superior results and strong cross-task generalization of our method.

[106] VQ-Seg: Vector-Quantized Token Perturbation for Semi-Supervised Medical Image Segmentation

Sicheng Yang,Zhaohu Xing,Lei Zhu

Main category: cs.CV

TL;DR: 本文提出VQ-Seg，首个采用向量量化（VQ）进行特征空间离散化并引入可控的量化扰动模块（QPM）以替代dropout的半监督医学图像分割方法，通过双分支架构和后量化特征适配器（PFA）缓解信息损失并融合基础模型语义指导，在肺癌CT数据集和其他基准上显著优于现有方法。

Details

Motivation: 现有基于dropout的特征扰动方法需手动调参dropout率，该超参数敏感且难以优化，可能导致正则化效果不佳。 Method: 提出VQ-Seg，利用向量量化离散化特征空间，设计量化扰动模块（QPM）通过打乱码本索引的空间位置实现可控扰动；采用双分支架构共享量化后特征用于图像重建与分割任务，并引入后量化特征适配器（PFA）融合基础模型的高层语义信息。 Result: 在自建的大规模肺癌CT数据集（828例）及其他公开基准上实验表明，所提方法优于当前最先进的半监督医学图像分割方法。 Conclusion: VQ-Seg通过引入向量量化与可控扰动机制，有效克服了传统dropout扰动依赖敏感超参数的问题，实现了更优的正则化与分割性能，具备良好的应用潜力。 Abstract: Consistency learning with feature perturbation is a widely used strategy in semi-supervised medical image segmentation. However, many existing perturbation methods rely on dropout, and thus require a careful manual tuning of the dropout rate, which is a sensitive hyperparameter and often difficult to optimize and may lead to suboptimal regularization. To overcome this limitation, we propose VQ-Seg, the first approach to employ vector quantization (VQ) to discretize the feature space and introduce a novel and controllable Quantized Perturbation Module (QPM) that replaces dropout. Our QPM perturbs discrete representations by shuffling the spatial locations of codebook indices, enabling effective and controllable regularization. To mitigate potential information loss caused by quantization, we design a dual-branch architecture where the post-quantization feature space is shared by both image reconstruction and segmentation tasks. Moreover, we introduce a Post-VQ Feature Adapter (PFA) to incorporate guidance from a foundation model (FM), supplementing the high-level semantic information lost during quantization. Furthermore, we collect a large-scale Lung Cancer (LC) dataset comprising 828 CT scans annotated for central-type lung carcinoma. Extensive experiments on the LC dataset and other public benchmarks demonstrate the effectiveness of our method, which outperforms state-of-the-art approaches. Code available at: https://github.com/script-Yang/VQ-Seg.

Linquan Wu,Tianxiang Jiang,Yifei Dong,Haoyu Yang,Fengji Zhang,Shichaang Meng,Ai Xuan,Linqi Song,Jacky Keung

Main category: cs.CV

TL;DR: 本文提出LaViT框架，通过对其潜在视觉思维而非静态嵌入来改善多模态推理中的视觉基础问题，有效缩小师生模型间的感知差距。

Details

Motivation: 现有方法在多模态推理中依赖外部监督，忽视了内在视觉注意力动态，导致学生模型虽模仿教师输出文本，但关注不同视觉区域，依赖语言先验而非真实感知。 Method: 提出LaViT框架，强制学生模型在文本生成前自回归地重建教师的视觉语义和注意力轨迹，并采用课程感官门控机制防止捷径学习。 Result: 实验表明，LaViT显著提升视觉基础能力，在复杂推理任务上最多提升+16.9%，并使3B小模型超越更大的开源及GPT-4o等专有模型。 Conclusion: LaViT通过对其潜在视觉思维和注意力路径，有效缩小感知差距，增强了模型的视觉理解与推理能力。 Abstract: Current multimodal latent reasoning often relies on external supervision (e.g., auxiliary images), ignoring intrinsic visual attention dynamics. In this work, we identify a critical Perception Gap in distillation: student models frequently mimic a teacher's textual output while attending to fundamentally divergent visual regions, effectively relying on language priors rather than grounded perception. To bridge this, we propose LaViT, a framework that aligns latent visual thoughts rather than static embeddings. LaViT compels the student to autoregressively reconstruct the teacher's visual semantics and attention trajectories prior to text generation, employing a curriculum sensory gating mechanism to prevent shortcut learning. Extensive experiments show that LaViT significantly enhances visual grounding, achieving up to +16.9% gains on complex reasoning tasks and enabling a compact 3B model to outperform larger open-source variants and proprietary models like GPT-4o.

[108] Advancing Adaptive Multi-Stage Video Anomaly Reasoning: A Benchmark Dataset and Method

Chao Huang,Benfeng Wang,Wei Wang,Jie Wen,Li Shen,Wenqi Ren,Yong Xu,Xiaochun Cao

Main category: cs.CV

TL;DR: 本文提出了视频异常推理（VAR）新任务，旨在通过多阶段推理提升多模态大模型在视频异常检测与理解中的推理能力，并发布了大规模数据集和基于感知-认知-行动链的标注框架，同时提出Vad-R1-Plus模型以实现自适应分层推理与风险感知决策。

Details

Motivation: 现有基于多模态大语言模型的视频异常检测方法多局限于定位或事后描述，缺乏显式推理过程、风险意识和决策导向的理解，难以满足实际应用中对可解释性和可靠性的需求。 Method: 提出视频异常推理（VAR）任务及PerCoAct-CoT标注体系，构建包含8,641个视频、超5万样本的大规模数据集；设计Anomaly-Aware Group Relative Policy Optimization算法，在弱监督下提升推理可靠性；开发端到端的Vad-R1-Plus模型，支持自适应分层推理与风险感知决策。 Result: 实验表明，所提方法在VAR任务上显著优于开源及闭源基线模型，有效提升了MLLM在异常理解中的多阶段推理能力和决策质量。 Conclusion: 该研究推动了视频异常理解从描述性分析向结构化、可解释的推理范式转变，为安全敏感场景下的智能视觉系统提供了新的技术路径。 Abstract: Recent progress in reasoning capabilities of Multimodal Large Language Models(MLLMs) has highlighted their potential for performing complex video understanding tasks. However, in the domain of Video Anomaly Detection and Understanding (VAD&U), existing MLLM-based methods are largely limited to anomaly localization or post-hoc description, lacking explicit reasoning processes, risk awareness, and decision-oriented interpretation. To address this gap, we define a new task termed Video Anomaly Reasoning (VAR), which elevates video anomaly analysis from descriptive understanding to structured, multi-stage reasoning. VAR explicitly requires models to perform progressive reasoning over anomalous events before answering anomaly-related questions, encompassing visual perception, causal interpretation, and risk-aware decision making. To support this task, we present a new dataset with 8,641 videos, where each video is annotated with diverse question types corresponding to different reasoning depths, totaling more than 50,000 samples, making it one of the largest datasets for video anomaly. The annotations are based on a structured Perception-Cognition-Action Chain-of-Thought (PerCoAct-CoT), which formalizes domain-specific reasoning priors for video anomaly understanding. This design enables systematic evaluation of multi-stage and adaptive anomaly reasoning. In addition, we propose Anomaly-Aware Group Relative Policy Optimization to further enhance reasoning reliability under weak supervision. Building upon the proposed task and dataset, we develop an end-to-end MLLM-based VAR model termed Vad-R1-Plus, which supports adaptive hierarchical reasoning and risk-aware decision making. Extensive experiments demonstrate that the proposed benchmark and method effectively advance the reasoning capabilities of MLLMs on VAR tasks, outperforming both open-source and proprietary baselines.

[109] RAG-3DSG: Enhancing 3D Scene Graphs with Re-Shot Guided Retrieval-Augmented Generation

Yue Chang,Rufeng Chen,Zhaofan Zhang,Yi Chen,Sihong Xie

Main category: cs.CV

TL;DR: 提出RAG-3DSG方法，通过重拍引导的不确定性估计和检索增强生成提升开放词汇3D场景图的生成精度与效率。

Details

Motivation: 现有开放词汇3D场景图生成方法存在对象识别准确率低、速度慢的问题，主要受限于视角限制、遮挡和冗余表面密度。 Method: 提出RAG-3DSG，采用重拍引导的不确定性估计来减少聚合噪声，并通过可靠低不确定性对象支持检索增强生成（RAG）；设计动态下采样映射策略以加速跨图像对象聚合。 Result: 在Replica数据集上的实验表明，RAG-3DSG显著提升了3D场景图生成中的节点描述准确率，同时相比基础版本减少了三分之二的建图时间。 Conclusion: RAG-3DSG有效提高了开放词汇3D场景图生成的准确性与效率，适用于机器人操作与导航等下游任务。 Abstract: Open-vocabulary 3D Scene Graph (3DSG) generation can enhance various downstream tasks in robotics, such as manipulation and navigation, by leveraging structured semantic representations. A 3DSG is constructed from multiple images of a scene, where objects are represented as nodes and relationships as edges. However, existing works for open-vocabulary 3DSG generation suffer from both low object-level recognition accuracy and speed, mainly due to constrained viewpoints, occlusions, and redundant surface density. To address these challenges, we propose RAG-3DSG to mitigate aggregation noise through re-shot guided uncertainty estimation and support object-level Retrieval-Augmented Generation (RAG) via reliable low-uncertainty objects. Furthermore, we propose a dynamic downsample-mapping strategy to accelerate cross-image object aggregation with adaptive granularity. Experiments on Replica dataset demonstrate that RAG-3DSG significantly improves node captioning accuracy in 3DSG generation while reducing the mapping time by two-thirds compared to the vanilla version.

[110] From Physical Degradation Models to Task-Aware All-in-One Image Restoration

Hu Gao,Xiaoning Lei,Xichen Xu,Xingjian Wang,Lizhuang Ma

Main category: cs.CV

TL;DR: 本文提出了一种高效的全合一图像恢复框架OPIR，通过物理退化建模预测任务感知的逆退化算子，并引入不确定性感知图在两阶段中指导恢复，实现了高性能与高效率的统一。

Details

Motivation: 现有全合一图像恢复方法依赖复杂模块或大模型，导致系统复杂、难以实时应用，本文旨在从物理退化建模角度出发，设计轻量且高效的方法。 Method: 提出OPIR框架，包含两个阶段：第一阶段预测逆退化算子生成初始恢复图像和不确定性感知图；第二阶段利用该图进行精细化恢复。使用同一逆算子预测网络，结合任务感知参数适应不同任务，并通过加速卷积提升效率。 Result: OPIR在多种全合一恢复任务上表现优越，同时在特定单一任务上也具有竞争力，验证了其有效性和高效性。 Conclusion: 通过物理启发的逆退化建模与不确定性引导的两阶段机制，OPIR实现了简洁、高效且性能优异的全合一图像恢复，具备良好的实际应用潜力。 Abstract: All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.

[111] ELITE: Efficient Gaussian Head Avatar from a Monocular Video via Learned Initialization and TEst-time Generative Adaptation

Kim Youwang,Lee Hyoseok,Subin Park,Gerard Pons-Moll,Tae-Hyun Oh

Main category: cs.CV

TL;DR: ELITE是一种从单目视频中高效合成高斯化身的方法，结合了3D数据先验和2D生成先验的优势，通过快速初始化和测试时生成式自适应机制实现高保真、强泛化能力的化身合成。

Details

Motivation: 现有方法在处理单目视频中的缺失视觉线索时，依赖3D数据先验或2D生成先验，但前者难以在真实场景中泛化，后者计算开销大且容易产生身份幻觉。 Method: 提出Mesh2Gaussian先验模型（MGPM）用于快速初始化高斯化身，并设计测试时生成式自适应阶段，利用渲染引导的单步扩散增强器恢复细节，结合真实与合成图像进行监督。 Result: 实验表明ELITE在挑战性表情下仍能生成优于先前方法的视觉效果，且合成速度比基于2D生成先验的方法快60倍。 Conclusion: ELITE通过融合3D与2D先验优势，实现了高效、高保真且具有强野外泛化能力的可动画化身合成。 Abstract: We introduce ELITE, an Efficient Gaussian head avatar synthesis from a monocular video via Learned Initialization and TEst-time generative adaptation. Prior works rely either on a 3D data prior or a 2D generative prior to compensate for missing visual cues in monocular videos. However, 3D data prior methods often struggle to generalize in-the-wild, while 2D generative prior methods are computationally heavy and prone to identity hallucination. We identify a complementary synergy between these two priors and design an efficient system that achieves high-fidelity animatable avatar synthesis with strong in-the-wild generalization. Specifically, we introduce a feed-forward Mesh2Gaussian Prior Model (MGPM) that enables fast initialization of a Gaussian avatar. To further bridge the domain gap at test time, we design a test-time generative adaptation stage, leveraging both real and synthetic images as supervision. Unlike previous full diffusion denoising strategies that are slow and hallucination-prone, we propose a rendering-guided single-step diffusion enhancer that restores missing visual details, grounded on Gaussian avatar renderings. Our experiments demonstrate that ELITE produces visually superior avatars to prior works, even for challenging expressions, while achieving 60x faster synthesis than the 2D generative prior method.

[112] Beyond Inpainting: Unleash 3D Understanding for Precise Camera-Controlled Video Generation

Dong-Yu Chen,Yixin Guo,Shuojin Yang,Tai-Jiang Mu,Shi-Min Hu

Main category: cs.CV

TL;DR: 本文提出了一种名为DepthDirector的视频重渲染框架，通过利用显式3D表示中的深度信息作为相机控制引导，实现了在新相机轨迹下对动态场景的精确控制与高质量生成。

Details

Motivation: 现有方法在精确控制相机轨迹时往往无法充分利用视频扩散模型（VDMs）的3D先验，并容易陷入“修补陷阱”，导致主体不一致和生成质量下降。 Method: 设计了视图-内容双流条件机制，将源视频和在目标视角下渲染的扭曲深度序列注入预训练的视频生成模型；并采用基于LoRA的轻量级视频扩散适配器进行训练，保留VDMs的知识先验。 Result: DepthDirector在相机可控性和视觉质量方面优于现有方法，并构建了一个包含8K视频的大规模多摄像头同步数据集MultiCam-WarpData用于实验验证。 Conclusion: DepthDirector有效结合了显式3D几何引导与视频扩散模型的3D理解能力，实现了高保真、可控的视频重渲染。 Abstract: Camera control has been extensively studied in conditioned video generation; however, performing precisely altering the camera trajectories while faithfully preserving the video content remains a challenging task. The mainstream approach to achieving precise camera control is warping a 3D representation according to the target trajectory. However, such methods fail to fully leverage the 3D priors of video diffusion models (VDMs) and often fall into the Inpainting Trap, resulting in subject inconsistency and degraded generation quality. To address this problem, we propose DepthDirector, a video re-rendering framework with precise camera controllability. By leveraging the depth video from explicit 3D representation as camera-control guidance, our method can faithfully reproduce the dynamic scene of an input video under novel camera trajectories. Specifically, we design a View-Content Dual-Stream Condition mechanism that injects both the source video and the warped depth sequence rendered under the target viewpoint into the pretrained video generation model. This geometric guidance signal enables VDMs to comprehend camera movements and leverage their 3D understanding capabilities, thereby facilitating precise camera control and consistent content generation. Next, we introduce a lightweight LoRA-based video diffusion adapter to train our framework, fully preserving the knowledge priors of VDMs. Additionally, we construct a large-scale multi-camera synchronized dataset named MultiCam-WarpData using Unreal Engine 5, containing 8K videos across 1K dynamic scenes. Extensive experiments show that DepthDirector outperforms existing methods in both camera controllability and visual quality. Our code and dataset will be publicly available.

[113] Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge

Sicheng Yang,Yukai Huang,Shitong Sun,Weitong Cai,Jiankang Deng,Jifei Song,Zhensong Zhang

Main category: cs.CV

TL;DR: 提出了一种集成查询/选项预处理、领域特定微调、新型时间链式思维（T-CoT）提示和强健后处理的框架，显著提升MLLM在复杂视频问答任务上的表现。

Details

Motivation: 多模态大语言模型（MLLMs）在复杂视频问答（如HD-EPIC VQA）中面临模糊查询、长时序推理困难和输出不规范等问题。 Method: 结合查询与选项预处理、基于Qwen2.5-VL的领域特定微调、提出Temporal Chain-of-Thought (T-CoT) 提示机制以支持多步时序推理，并引入鲁棒的后处理策略。 Result: 在HD-EPIC VQA基准上达到41.6%的准确率，显著优于基线方法。 Conclusion: 复杂的视频理解任务需要对整个推理流程进行系统性优化，单一改进不足以应对多维度挑战。 Abstract: Multimodal Large Language Models (MLLMs) struggle with complex video QA benchmarks like HD-EPIC VQA due to ambiguous queries/options, poor long-range temporal reasoning, and non-standardized outputs. We propose a framework integrating query/choice pre-processing, domain-specific Qwen2.5-VL fine-tuning, a novel Temporal Chain-of-Thought (T-CoT) prompting for multi-step reasoning, and robust post-processing. This system achieves 41.6% accuracy on HD-EPIC VQA, highlighting the need for holistic pipeline optimization in demanding video understanding. Our code, fine-tuned models are available at https://github.com/YoungSeng/Egocentric-Co-Pilot.

[114] Attend to what I say: Highlighting relevant content on slides

Megha Mariam K M,C. V. Jawahar

Main category: cs.CV

TL;DR: 本文提出一种方法，通过分析演讲者的叙述内容，并将其与幻灯片中的文本或图形元素匹配，自动识别并高亮最相关的幻灯片区域，以改善听觉与视觉信息的同步性，提升对内容密集型视频的理解。

Details

Motivation: 在快节奏或内容密集的演示中，听众难以同步跟上演讲内容与幻灯片中的关键视觉信息，导致认知负担加重、理解困难。 Method: 通过分析演讲的语音内容，并与幻灯片中的文本、图形及布局信息进行匹配，定位最相关的区域并实现自动高亮。 Result: 探索了多种解决方案，评估了其成功与失败案例，验证了该方法在增强多媒体文档理解方面的有效性。 Conclusion: 该方法有助于减少认知负荷，提升对教育视频和会议报告等富媒体内容的理解效率，推动多媒体分析的发展。 Abstract: Imagine sitting in a presentation, trying to follow the speaker while simultaneously scanning the slides for relevant information. While the entire slide is visible, identifying the relevant regions can be challenging. As you focus on one part of the slide, the speaker moves on to a new sentence, leaving you scrambling to catch up visually. This constant back-and-forth creates a disconnect between what is being said and the most important visual elements, making it hard to absorb key details, especially in fast-paced or content-heavy presentations such as conference talks. This requires an understanding of slides, including text, graphics, and layout. We introduce a method that automatically identifies and highlights the most relevant slide regions based on the speaker's narrative. By analyzing spoken content and matching it with textual or graphical elements in the slides, our approach ensures better synchronization between what listeners hear and what they need to attend to. We explore different ways of solving this problem and assess their success and failure cases. Analyzing multimedia documents is emerging as a key requirement for seamless understanding of content-rich videos, such as educational videos and conference talks, by reducing cognitive strain and improving comprehension. Code and dataset are available at: https://github.com/meghamariamkm2002/Slide_Highlight

[115] DanQing: An Up-to-Date Large-Scale Chinese Vision-Language Pre-training Dataset

Hengyu Shen,Tiancheng Gu,Bin Qin,Lan Wu,Yuling Wu,Shuo Tan,Zelong Sun,Jun Wang,Nan Wu,Xiang An,Weidong Cai,Ziyong Feng,Kaicheng Yang

Main category: cs.CV

TL;DR: 本文提出了DanQing，一个包含1亿中文图文对的高质量跨模态数据集，通过更严格的筛选流程和基于2024-2025年网络数据构建，显著提升了中文视觉语言预训练模型在多种下游任务中的表现。

Details

Motivation: 由于缺乏高质量的中文图文数据，中文视觉语言预训练的发展落后于英文领域，本文旨在填补这一数据空白。 Method: 构建了一个完整的数据处理 pipeline，从Common Crawl中收集并筛选出高质量的中文图文对，最终形成DanQing数据集，并基于SigLIP2模型进行持续预训练以验证其有效性。 Result: 在零样本分类、跨模态检索和基于大模型的评估等中文下游任务中，使用DanQing训练的模型 consistently 优于现有数据集上的结果。 Conclusion: DanQing是一个高质量、时效性强的中文图文数据集，能有效推动中文视觉语言预训练的发展，且将开源以促进相关研究。 Abstract: Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.

Peng-Fei Zhang,Zi Huang

Main category: cs.CV

TL;DR: 提出了一种名为层次化优化攻击（HRA）的多模态通用攻击框架，用于视觉-语言预训练（VLP）模型，通过在样本和优化层面同时优化通用对抗扰动，显著提升了攻击效率与效果。

Details

Motivation: 现有VLP模型的对抗攻击多为样本特定型，计算开销大，难以扩展到大规模数据集或新场景，因此需要一种高效的通用攻击方法。 Method: HRA在图像模态上解耦对抗样本为干净图像和扰动，并引入ScMix增强策略以丰富视觉上下文；在优化层面利用历史与未来梯度的时序层次结构避免局部最优；在文本模态上结合句内与句间重要性度量识别全局关键词作为通用文本扰动。 Result: 在多种下游任务、VLP模型和数据集上的实验表明，HRA在图像和文本模态上均显著优于现有攻击方法，具有更强的迁移性和稳定性。 Conclusion: HRA是一种有效的多模态通用攻击框架，能够在保持低计算成本的同时实现对VLP模型的高效攻击，揭示了当前VLP模型在跨模态对齐中的脆弱性。 Abstract: Existing adversarial attacks for VLP models are mostly sample-specific, resulting in substantial computational overhead when scaled to large datasets or new scenarios. To overcome this limitation, we propose Hierarchical Refinement Attack (HRA), a multimodal universal attack framework for VLP models. HRA refines universal adversarial perturbations (UAPs) at both the sample level and the optimization level. For the image modality, we disentangle adversarial examples into clean images and perturbations, allowing each component to be handled independently for more effective disruption of cross-modal alignment. We further introduce a ScMix augmentation strategy that diversifies visual contexts and strengthens both global and local utility of UAPs, thereby reducing reliance on spurious features. In addition, we refine the optimization path by leveraging a temporal hierarchy of historical and estimated future gradients to avoid local minima and stabilize universal perturbation learning. For the text modality, HRA identifies globally influential words by combining intra-sentence and inter-sentence importance measures, and subsequently utilizes these words as universal text perturbations. Extensive experiments across various downstream tasks, VLP models, and datasets demonstrate the superiority of the proposed universal multimodal attacks.

[117] ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Xueyun Tian,Wei Li,Bingbing Xu,Heng Dong,Yuanzhuo Wang,Huawei Shen

Main category: cs.CV

TL;DR: 本文提出了ROMA，一种用于统一实时音频、视频和文本理解的全模态大语言模型，通过同步多模态处理和轻量级决策机制，在流式输入下实现反应性与主动性交互。

Details

Motivation: 现有全模态模型在流式音视频理解中存在模态支持不完整或缺乏主动监控能力的问题，难以同时支持反应性和主动性任务。 Method: ROMA将连续输入作为同步多模态单元处理，对齐密集音频与离散视频帧，并引入轻量级“说话头”解耦响应触发与生成；采用专设流式数据集和两阶段课程学习训练，优化流式适应与主动响应能力。 Result: 在12个基准测试中，ROMA在主动性任务（如告警、叙述）上达到SOTA性能，在反应性任务（如问答）中表现具竞争力。 Conclusion: ROMA实现了强大的统一实时全模态理解能力，兼具反应性与主动性，推动了流式多模态交互系统的发展。 Abstract: Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, a real-time omni-multimodal assistant for unified reactive and proactive interaction. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight speak head that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding.

[118] SRAW-Attack: Space-Reweighted Adversarial Warping Attack for SAR Target Recognition

Yiming Zhang,Weibo Qin,Yuntian Liu,Feng Wang

Main category: cs.CV

TL;DR: 提出了一种名为Space-Reweighted Adversarial Warping (SRAW)的新攻击方法，通过优化前景和背景区域的重加权空间变形生成更隐蔽且具有强迁移性的对抗样本，显著降低了SAR-ATR模型的性能。

Details

Motivation: SAR图像固有的信息稀疏性和DNN模型对背景区域的过度依赖导致现有对抗攻击需要明显扰动才能生效，缺乏有效且隐蔽的攻击方法。 Method: 提出SRAW方法，利用空间形变生成对抗样本，并通过重加权机制在前景和背景区域间优化扰动分配，提升攻击的隐蔽性和迁移性。 Result: 实验表明SRAW在多种先进SAR-ATR模型上均能有效降低识别性能，且在不可感知性和迁移性方面优于现有方法。 Conclusion: SRAW为SAR-ATR系统提供了一种更高效、更隐蔽的对抗攻击方案，揭示了当前模型在鲁棒性方面的缺陷，有助于推动更安全的SAR目标识别技术发展。 Abstract: Synthetic aperture radar (SAR) imagery exhibits intrinsic information sparsity due to its unique electromagnetic scattering mechanism. Despite the widespread adoption of deep neural network (DNN)-based SAR automatic target recognition (SAR-ATR) systems, they remain vulnerable to adversarial examples and tend to over-rely on background regions, leading to degraded adversarial robustness. Existing adversarial attacks for SAR-ATR often require visually perceptible distortions to achieve effective performance, thereby necessitating an attack method that balances effectiveness and stealthiness. In this paper, a novel attack method termed Space-Reweighted Adversarial Warping (SRAW) is proposed, which generates adversarial examples through optimized spatial deformation with reweighted budgets across foreground and background regions. Extensive experiments demonstrate that SRAW significantly degrades the performance of state-of-the-art SAR-ATR models and consistently outperforms existing methods in terms of imperceptibility and adversarial transferability. Code is made available at https://github.com/boremycin/SAR-ATR-TransAttack.

[119] Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Siqi Kou,Jiachun Jin,Zetong Zhou,Ye Ma,Yugang Wang,Quan Chen,Peng Jiang,Xiao Yang,Jun Zhu,Kai Yu,Zhijie Deng

Main category: cs.CV

TL;DR: 本文提出了一种“先思考再生成”（T2G）的新范式，通过激活大语言模型的推理能力重写文本提示，并结合图像奖励对齐语言推理与图像生成，显著提升了文本到图像生成在事实一致性、语义对齐和视觉真实感方面的表现。

Details

Motivation: 现有文本到图像扩散模型大多仅将大语言模型作为文本编码器使用，未能利用其推理能力，导致生成图像缺乏深层语义理解和世界知识推断。本文旨在突破这一限制，实现更智能的图文生成。 Method: 首先通过轻量级监督微调激发大语言模型的‘先思考再重写’模式；然后采用Dual-GRPO框架联合优化语言编码器与扩散模型：语言编码器通过图像接地的奖励进行强化，以推理并回忆世界知识，扩散模型则负责生成语义一致且视觉连贯的图像。 Result: 实验结果显示该方法在基于推理的图像生成与编辑任务中显著提升性能，WISE评分达到0.79，接近GPT-4水平，在事实一致性、语义对齐和视觉真实感方面优于现有方法。 Conclusion: T2G范式成功融合了大语言模型的推理能力与扩散模型的生成能力，推动了具备推理、表达与具现能力的下一代统一视觉生成模型的发展。 Abstract: Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders, remain text-pixel mappers -- they employ LLMs merely as text encoders, without leveraging their inherent reasoning capabilities to infer what should be visually depicted given the textual prompt. To move beyond such literal generation, we propose the think-then-generate (T2G) paradigm, where the LLM-based text encoder is encouraged to reason about and rewrite raw user prompts; the states of the rewritten prompts then serve as diffusion conditioning. To achieve this, we first activate the think-then-rewrite pattern of the LLM encoder with a lightweight supervised fine-tuning process. Subsequently, the LLM encoder and diffusion backbone are co-optimized to ensure faithful reasoning about the context and accurate rendering of the semantics via Dual-GRPO. In particular, the text encoder is reinforced using image-grounded rewards to infer and recall world knowledge, while the diffusion backbone is pushed to produce semantically consistent and visually coherent images. Experiments show substantial improvements in factual consistency, semantic alignment, and visual realism across reasoning-based image generation and editing benchmarks, achieving 0.79 on WISE score, nearly on par with GPT-4. Our results constitute a promising step toward next-generation unified models with reasoning, expression, and demonstration capacities.

[120] An analytic theory of convolutional neural network inverse problems solvers

Minh Hai Nguyen,Quoc Bao Do,Edouard Pauwels,Pierre Weiss

Main category: cs.CV

TL;DR: 本文提出了一种基于最小均方误差（MMSE）估计器的理论框架——局部等变MMSE（LE-MMSE），用于解释监督卷积神经网络在图像逆问题中的表现，通过引入平移等变性和局部性约束，实现了对CNN行为的可解释建模，并在多种任务和数据集上验证了其与实际网络输出的高度一致性。

Details

Motivation: 尽管CNN在图像逆问题中表现出色，但其理论理解不足，常被视为黑箱。本文旨在通过结合MMSE估计器与CNN的基本归纳偏置（如平移等变性和局部感受野），建立一个可解释的理论模型来弥合理论与实践之间的差距。 Method: 提出Local-Equivariant MMSE（LE-MMSE）估计器，将平移等变性和有限感受野作为函数约束引入MMSE框架，在经验训练分布下推导出解析、可解释且可计算的公式，并通过多个图像逆问题（去噪、补全、反卷积）、不同数据集和网络结构进行实验验证。 Result: LE-MMSE理论预测与实际训练的CNN输出高度一致（PSNR ≳25dB），并在多种设置下展现出良好的匹配性；同时揭示了物理感知与非物理感知估计器的差异、训练patch分布中高密度区域的影响以及其他因素（如数据集大小、patch大小）的作用。 Conclusion: 本文为理解CNN在图像逆问题中的成功提供了新的理论视角，表明其行为可通过带有归纳偏置的受限MMSE框架有效建模，增强了对深度网络为何有效的解释力。 Abstract: Supervised convolutional neural networks (CNNs) are widely used to solve imaging inverse problems, achieving state-of-the-art performance in numerous applications. However, despite their empirical success, these methods are poorly understood from a theoretical perspective and often treated as black boxes. To bridge this gap, we analyze trained neural networks through the lens of the Minimum Mean Square Error (MMSE) estimator, incorporating functional constraints that capture two fundamental inductive biases of CNNs: translation equivariance and locality via finite receptive fields. Under the empirical training distribution, we derive an analytic, interpretable, and tractable formula for this constrained variant, termed Local-Equivariant MMSE (LE-MMSE). Through extensive numerical experiments across various inverse problems (denoising, inpainting, deconvolution), datasets (FFHQ, CIFAR-10, FashionMNIST), and architectures (U-Net, ResNet, PatchMLP), we demonstrate that our theory matches the neural networks outputs (PSNR $\gtrsim25$dB). Furthermore, we provide insights into the differences between \emph{physics-aware} and \emph{physics-agnostic} estimators, the impact of high-density regions in the training (patch) distribution, and the influence of other factors (dataset size, patch size, etc).

[121] Fine-Grained Human Pose Editing Assessment via Layer-Selective MLLMs

Ningyu Sun,Zhaolin Cai,Zitong Xu,Peihang Chen,Huiyu Duan,Yichao Yan,Xiongkuo Min,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文提出了HPE-Bench，一个专用于文本引导人体姿态编辑的基准，包含1700个标准化样本，并提出基于层选择性多模态大语言模型的统一评估框架，通过对比LoRA微调和层敏感性分析实现更优的姿态真实性检测与质量评估。

Details

Motivation: 现有姿态编辑评估方法分离真实性判断与质量评分，缺乏对姿态特异性不一致的细粒度分析，亟需统一且精细的评估体系。 Method: 构建HPE-Bench基准，包含真实性标签与多维质量评分；提出基于多模态大语言模型的框架，采用对比LoRA微调和层敏感性分析（LSA）确定最优评估特征层。 Result: 该框架在真实性检测和多维质量回归任务上均取得优越性能，显著提升姿态编辑结果的评估精度。 Conclusion: 所提方法有效弥合了法医检测与质量评估之间的鸿沟，为文本引导姿态编辑提供了可靠、细粒度的自动化评估方案。 Abstract: Text-guided human pose editing has gained significant traction in AIGC applications. However,it remains plagued by structural anomalies and generative artifacts. Existing evaluation metrics often isolate authenticity detection from quality assessment, failing to provide fine-grained insights into pose-specific inconsistencies. To address these limitations, we introduce HPE-Bench, a specialized benchmark comprising 1,700 standardized samples from 17 state-of-the-art editing models, offering both authenticity labels and multi-dimensional quality scores. Furthermore, we propose a unified framework based on layer-selective multimodal large language models (MLLMs). By employing contrastive LoRA tuning and a novel layer sensitivity analysis (LSA) mechanism, we identify the optimal feature layer for pose evaluation. Our framework achieves superior performance in both authenticity detection and multi-dimensional quality regression, effectively bridging the gap between forensic detection and quality assessment.

Yichong Xia,Yimin Zhou,Jinpeng Wang,Bin Chen

Main category: cs.CV

TL;DR: 提出了一种名为DiffCR的新型扩散模型图像压缩框架，通过频率感知跳跃估计和一致性先验优化，实现了高效、高质量的低比特率图像重建。

Details

Motivation: 现有基于扩散模型的图像压缩方法存在采样速度慢和比特分配次优的问题，主要由于训练范式分散导致。 Method: 提出了DiffCR框架，包含频率感知跳跃估计（FaSE）模块和轻量级一致性估计器；FaSE利用频率解耦注意力（FDA）对预训练扩散模型的ε预测先验进行精细化调整，并与不同时间步的压缩潜在特征对齐，实现两步快速解码。 Result: 在不更新主干扩散模型的情况下，相比当前最先进的扩散压缩方法，实现了27.2%的LPIPS BD-rate降低和65.1%的PSNR BD-rate降低，同时解码速度提升超过10倍。 Conclusion: DiffCR通过引入一致性先验 refinement 和频率感知模块，在保持高保真重建的同时显著提升了扩散模型图像压缩的效率和性能，为实际应用提供了可行方案。 Abstract: Recent advancements in diffusion-based generative priors have enabled visually plausible image compression at extremely low bit rates. However, existing approaches suffer from slow sampling processes and suboptimal bit allocation due to fragmented training paradigms. In this work, we propose Accelerate \textbf{Diff}usion-based Image Compression via \textbf{C}onsistency Prior \textbf{R}efinement (DiffCR), a novel compression framework for efficient and high-fidelity image reconstruction. At the heart of DiffCR is a Frequency-aware Skip Estimation (FaSE) module that refines the $ε$-prediction prior from a pre-trained latent diffusion model and aligns it with compressed latents at different timesteps via Frequency Decoupling Attention (FDA). Furthermore, a lightweight consistency estimator enables fast \textbf{two-step decoding} by preserving the semantic trajectory of diffusion sampling. Without updating the backbone diffusion model, DiffCR achieves substantial bitrate savings (27.2\% BD-rate (LPIPS) and 65.1\% BD-rate (PSNR)) and over $10\times$ speed-up compared to SOTA diffusion-based compression baselines.

[123] Global Context Compression with Interleaved Vision-Text Transformation

Dian Jiao,Jiaxin Duan,Shuai Zhao,Jiabing Leng,Yiran Zhang,Feng Huang

Main category: cs.CV

TL;DR: 本文提出VIST2，一种通过视觉编码实现全局上下文压缩的新型Transformer模型，能够在预填充和推理阶段减少文本token数量，显著提升生成速度、降低内存和计算开销。

Details

Motivation: 现有方法仅在预填充阶段通过视觉编码压缩输入，无法在逐token推理时节省成本，因此需要一种能在全过程压缩上下文的方法。 Method: 将文本块与其对应的草图图像交错输入，完全依赖视觉token进行上下文建模；采用多阶段训练策略，包括课程学习的预训练和模态交错的指令微调。 Result: 在4倍压缩比下，模型平均实现首token生成速度提升3倍，内存使用减少77%，FLOPS减少74%，在长文本生成任务中显著优于基线模型。 Conclusion: VIST2通过全局视觉压缩有效解决了长序列生成中的效率瓶颈，为高效Transformer设计提供了新方向。 Abstract: Recent achievements of vision-language models in end-to-end OCR point to a new avenue for low-loss compression of textual information. This motivates earlier works that render the Transformer's input into images for prefilling, which effectively reduces the number of tokens through visual encoding, thereby alleviating the quadratically increased Attention computations. However, this partial compression fails to save computational or memory costs at token-by-token inference. In this paper, we investigate global context compression, which saves tokens at both prefilling and inference stages. Consequently, we propose VIST2, a novel Transformer that interleaves input text chunks alongside their visual encoding, while depending exclusively on visual tokens in the pre-context to predict the next text token distribution. Around this idea, we render text chunks into sketch images and train VIST2 in multiple stages, starting from curriculum-scheduled pretraining for optical language modeling, followed by modal-interleaved instruction tuning. We conduct extensive experiments using VIST2 families scaled from 0.6B to 8B to explore the training recipe and hyperparameters. With a 4$\times$ compression ratio, the resulting models demonstrate significant superiority over baselines on long writing tasks, achieving, on average, a 3$\times$ speedup in first-token generation, 77% reduction in memory usage, and 74% reduction in FLOPS. Our codes and datasets will be public to support further studies.

[124] Handling Missing Modalities in Multimodal Survival Prediction for Non-Small Cell Lung Cancer

Filippo Ruffini,Camillo Maria Caruso,Claudia Tacconi,Lorenzo Nibid,Francesca Miccolis,Marta Lovino,Carlo Greco,Edy Ippolito,Michele Fiore,Alessio Cortellini,Bruno Beomonte Zobel,Giuseppe Perrone,Bruno Vincenzi,Claudio Marrocco,Alessandro Bria,Elisa Ficarra,Sara Ramella,Valerio Guarrasi,Paolo Soda

Main category: cs.CV

TL;DR: 提出一种能够处理模态缺失的多模态生存预测框架，用于非小细胞肺癌（NSCLC）的总体生存建模，结合CT、全切片病理图像（WSI）和临床变量，采用基础模型进行特征提取，并通过中间融合策略提升预测性能。

Details

Motivation: 现有的多模态深度学习方法在临床应用中受限于队列规模小和模态缺失问题，常需删除病例或强行填补数据，影响模型性能与泛化能力。 Method: 利用基础模型（FM）对各模态（CT、WSI、临床变量）进行特征提取，设计缺失感知的编码策略，实现中间层次的多模态融合，使模型能在训练和推理中充分利用不完整模态数据。 Result: 中间融合策略优于单模态及早晚期融合方法，其中WSI与临床变量融合效果最佳（73.30 C-index）；模型能自适应地降低信息量少的模态（如CT）权重。 Conclusion: 所提框架在面对自然缺失模态时具有鲁棒性，无需删减患者数据即可有效整合多源异构信息，提升了NSCLC生存预测的准确性与临床适用性。 Abstract: Accurate survival prediction in Non-Small Cell Lung Cancer (NSCLC) requires the integration of heterogeneous clinical, radiological, and histopathological information. While Multimodal Deep Learning (MDL) offers a promises for precision prognosis and survival prediction, its clinical applicability is severely limited by small cohort sizes and the presence of missing modalities, often forcing complete-case filtering or aggressive imputation. In this work, we present a missing-aware multimodal survival framework that integrates Computed Tomography (CT), Whole-Slide Histopathology (WSI) Images, and structured clinical variables for overall survival modeling in unresectable stage II-III NSCLC. By leveraging Foundation Models (FM) for modality-specific feature extraction and a missing-aware encoding strategy, the proposed approach enables intermediate multimodal fusion under naturally incomplete modality profiles. The proposed architecture is resilient to missing modalities by design, allowing the model to utilize all available data without being forced to drop patients during training or inference. Experimental results demonstrate that intermediate fusion consistently outperforms unimodal baselines as well as early and late fusion strategies, with the strongest performance achieved by the fusion of WSI and clinical modalities (73.30 C-index). Further analyses of modality importance reveal an adaptive behavior in which less informative modalities, i.e., CT modality, are automatically down-weighted and contribute less to the final survival prediction.

[125] Multi-Temporal Frames Projection for Dynamic Processes Fusion in Fluorescence Microscopy

Hassan Eshkiki,Sarah Costa,Mostafa Mohammadpour,Farinaz Tanhaei,Christopher H. George,Fabio Caraffini

Main category: cs.CV

TL;DR: 提出一种新的计算框架，通过整合多时相荧光显微图像帧生成高质量单幅图像，显著提升细胞计数和图像质量。

Details

Motivation: 荧光显微镜记录常受噪声、时间变异性和信号波动影响，限制了其在生物样本分析中的应用。 Method: 结合多种可解释的计算机视觉技术，将多个时间分辨帧信息融合为一张高质量图像，并保持原始视频的生物学内容。 Result: 在包含动态、异质且形态复杂的2D心肌细胞单层的挑战性数据集上验证，相比现有方法平均细胞计数提高44%。 Conclusion: 该框架可广泛应用于需要将多时相图像堆栈融合为高质量2D图像的其他成像领域，有助于标注和后续分割任务。 Abstract: Fluorescence microscopy is widely employed for the analysis of living biological samples; however, the utility of the resulting recordings is frequently constrained by noise, temporal variability, and inconsistent visualisation of signals that oscillate over time. We present a unique computational framework that integrates information from multiple time-resolved frames into a single high-quality image, while preserving the underlying biological content of the original video. We evaluate the proposed method through an extensive number of configurations (n = 111) and on a challenging dataset comprising dynamic, heterogeneous, and morphologically complex 2D monolayers of cardiac cells. Results show that our framework, which consists of a combination of explainable techniques from different computer vision application fields, is capable of generating composite images that preserve and enhance the quality and information of individual microscopy frames, yielding 44% average increase in cell count compared to previous methods. The proposed pipeline is applicable to other imaging domains that require the fusion of multi-temporal image stacks into high-quality 2D images, thereby facilitating annotation and downstream segmentation.

[126] Lunar-G2R: Geometry-to-Reflectance Learning for High-Fidelity Lunar BRDF Estimation

Clementine Grethen,Nicolas Menga,Roland Brochard,Geraldine Morin,Simone Gasparini,Jeremy Lebreton,Manuel Sanchez Gestido

Main category: cs.CV

TL;DR: 提出Lunar-G2R，一种从地形几何（DEM）直接预测月球表面空间变化BRDF参数的几何到反射率学习框架，无需多视角图像或专用硬件，显著提升渲染真实感和导航精度。

Details

Motivation: 现有月球渲染管线依赖简化或空间均匀的BRDF模型，难以估计参数且无法捕捉局部反射率变化，限制了光度真实性。 Method: 提出Lunar-G2R，利用U-Net结合可微分渲染，从月球数字高程模型（DEM）直接预测空间变化的BRDF参数，并在已知观测和光照条件下最小化与真实轨道图像之间的光度差异。 Result: 在Tycho坑未见过的地理区域实验表明，相比最先进基线方法，光度误差降低38%，PSNR和SSIM更高，感知相似性更好，能捕捉细尺度反射率变化。 Conclusion: 这是首个直接从地形几何推断空间变化反射率模型的方法，为行星表面高保真渲染和视觉导航提供了新途径。 Abstract: We address the problem of estimating realistic, spatially varying reflectance for complex planetary surfaces such as the lunar regolith, which is critical for high-fidelity rendering and vision-based navigation. Existing lunar rendering pipelines rely on simplified or spatially uniform BRDF models whose parameters are difficult to estimate and fail to capture local reflectance variations, limiting photometric realism. We propose Lunar-G2R, a geometry-to-reflectance learning framework that predicts spatially varying BRDF parameters directly from a lunar digital elevation model (DEM), without requiring multi-view imagery, controlled illumination, or dedicated reflectance-capture hardware at inference time. The method leverages a U-Net trained with differentiable rendering to minimize photometric discrepancies between real orbital images and physically based renderings under known viewing and illumination geometry. Experiments on a geographically held-out region of the Tycho crater show that our approach reduces photometric error by 38 % compared to a state-of-the-art baseline, while achieving higher PSNR and SSIM and improved perceptual similarity, capturing fine-scale reflectance variations absent from spatially uniform models. To our knowledge, this is the first method to infer a spatially varying reflectance model directly from terrain geometry.

[127] Urban Socio-Semantic Segmentation with Vision-Language Reasoning

Yu Wang,Yi Wang,Rui Dai,Yujie Wang,Kaikui Liu,Xiangxiang Chu,Yansheng Li

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型推理的城市社会语义分割方法，引入了SocioSeg数据集和SocioReasoner框架，通过强化学习优化多模态识别与多阶段推理过程，实现了对学校、公园等社会定义类别的精确分割，并展现出优异的零样本泛化能力。

Details

Motivation: 现有分割模型在物理属性实体（如建筑物、水体）上表现良好，但难以识别由社会功能定义的语义类别（如学校、公园），因此需要能够理解社会语义的新型分割方法。 Method: 构建了一个包含卫星图像、数字地图和分层像素级标签的社会语义分割数据集SocioSeg；提出SocioReasoner框架，结合视觉-语言模型进行跨模态识别与多阶段推理，并采用强化学习优化非可微的推理过程。 Result: 实验表明，该方法在社会语义分割任务上优于当前最先进的模型，并具备强大的零样本泛化能力。 Conclusion: 通过引入社会语义标注数据集和基于视觉-语言推理的框架，有效提升了城市环境中社会定义类别的分割性能，为遥感图像理解提供了新路径。 Abstract: As hubs of human activity, urban surfaces consist of a wealth of semantic entities. Segmenting these various entities from satellite imagery is crucial for a range of downstream applications. Current advanced segmentation models can reliably segment entities defined by physical attributes (e.g., buildings, water bodies) but still struggle with socially defined categories (e.g., schools, parks). In this work, we achieve socio-semantic segmentation by vision-language model reasoning. To facilitate this, we introduce the Urban Socio-Semantic Segmentation dataset named SocioSeg, a new resource comprising satellite imagery, digital maps, and pixel-level labels of social semantic entities organized in a hierarchical structure. Additionally, we propose a novel vision-language reasoning framework called SocioReasoner that simulates the human process of identifying and annotating social semantic entities via cross-modal recognition and multi-stage reasoning. We employ reinforcement learning to optimize this non-differentiable process and elicit the reasoning capabilities of the vision-language model. Experiments demonstrate our approach's gains over state-of-the-art models and strong zero-shot generalization. Our dataset and code are available in https://github.com/AMAP-ML/SocioReasoner.

[128] mergetune: Continued fine-tuning of vision-language models

Wenqing Wang,Da Li,Xiatian Zhu,Josef Kittler

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉语言模型持续微调范式MERGETUNE，通过线性模式连通性在无需重放预训练数据的情况下恢复微调过程中丢失的预训练知识，显著提升模型在基类-新类泛化和鲁棒微调上的性能。

Details

Motivation: 现有微调方法常导致视觉语言模型（如CLIP）发生灾难性遗忘，难以完全保留预训练知识，本文旨在解决微调后知识恢复的问题。 Method: 提出持续微调（CFT）范式和MERGETUNE方法，利用线性模式连通性（LMC）在损失景观中寻找连接零样本与微调模型的低损失路径，并通过二阶代理近似避免大规模数据重放。 Result: MERGETUNE在CoOp基础上将基类-新类泛化的调和平均提升5.6%，无需额外参数；在鲁棒微调中超越集成基线，推理成本更低，与零样本模型集成后达到最先进性能。 Conclusion: MERGETUNE提供了一种无需架构修改和数据重放的通用后处理策略，有效恢复微调模型中丢失的预训练知识，为模型适应提供了新范式。 Abstract: Fine-tuning vision-language models (VLMs) such as CLIP often leads to catastrophic forgetting of pretrained knowledge. Prior work primarily aims to mitigate forgetting during adaptation; however, forgetting often remains inevitable during this process. We introduce a novel paradigm, \emph{continued fine-tuning (CFT)}, which seeks to recover pretrained knowledge after a zero-shot model has already been adapted. We propose a simple, model-agnostic CFT strategy (named MERGETUNE) guided by linear mode connectivity (LMC), which can be applied post hoc to existing fine-tuned models without requiring architectural changes. Given a fine-tuned model, we continue fine-tuning its trainable parameters (e.g., soft prompts or linear heads) to search for a continued model which has two low-loss paths to the zero-shot (e.g., CLIP) and the fine-tuned (e.g., CoOp) solutions. By exploiting the geometry of the loss landscape, the continued model implicitly merges the two solutions, restoring pretrained knowledge lost in the fine-tuned counterpart. A challenge is that the vanilla LMC constraint requires data replay from the pretraining task. We approximate this constraint for the zero-shot model via a second-order surrogate, eliminating the need for large-scale data replay. Experiments show that MERGETUNE improves the harmonic mean of CoOp by +5.6\% on base-novel generalisation without adding parameters. % We show \emph{the first time} superior performance than CLIP on both DTD and EuroSAT, on cross-dataset transfer. On robust fine-tuning evaluations, the LMC-merged model from MERGETUNE surpasses ensemble baselines with lower inference cost, achieving further gains and state-of-the-art results when ensembled with the zero-shot model. Our code is available at \href{https://github.com/Surrey-UP-Lab/MERGETUNE}{https://github.com/Surrey-UP-Lab/MERGETUNE}.

[129] SatMap: Revisiting Satellite Maps as Prior for Online HD Map Construction

Kanak Mazumder,Fabian B. Flohr

Main category: cs.CV

TL;DR: 本文提出了一种名为SatMap的在线向量化高精地图估计方法，结合卫星地图与多视角相机观测，有效缓解了深度模糊和遮挡问题，在nuScenes数据集上显著优于仅使用相机或相机-LiDAR融合的方法。

Details

Motivation: 现有的基于车载相机的高精地图构建方法受限于深度感知能力弱和遮挡导致的精度下降，亟需更鲁棒的方案。 Method: 提出SatMap，融合卫星地图（提供鸟瞰视角下的车道级语义和纹理）作为全局先验，并结合多视角相机输入，直接预测向量化的高精地图。 Result: 在nuScenes数据集上，相比纯相机基线mAP提升34.8%，相比相机-LiDAR融合基线提升8.5%；并在长距离和恶劣天气条件下验证了卫星先验的有效性。 Conclusion: 利用卫星地图作为先验能显著提升在线高精地图构建的精度与鲁棒性，尤其在复杂场景下表现优越。 Abstract: Online high-definition (HD) map construction is an essential part of a safe and robust end-to-end autonomous driving (AD) pipeline. Onboard camera-based approaches suffer from limited depth perception and degraded accuracy due to occlusion. In this work, we propose SatMap, an online vectorized HD map estimation method that integrates satellite maps with multi-view camera observations and directly predicts a vectorized HD map for downstream prediction and planning modules. Our method leverages lane-level semantics and texture from satellite imagery captured from a Bird's Eye View (BEV) perspective as a global prior, effectively mitigating depth ambiguity and occlusion. In our experiments on the nuScenes dataset, SatMap achieves 34.8% mAP performance improvement over the camera-only baseline and 8.5% mAP improvement over the camera-LiDAR fusion baseline. Moreover, we evaluate our model in long-range and adverse weather conditions to demonstrate the advantages of using a satellite prior map. Source code will be available at https://iv.ee.hm.edu/satmap/.

[130] BikeActions: An Open Platform and Benchmark for Cyclist-Centric VRU Action Recognition

Max A. Buettner,Kanak Mazumder,Luca Koecher,Mario Finkbeiner,Sebastian Niebler,Fabian B. Flohr

Main category: cs.CV

TL;DR: 本文提出了FUSE-Bike，首个面向骑行者视角的开放感知平台，以及BikeActions多模态数据集，用于提升弱势道路使用者（VRU）行为建模，并建立了该任务的首个性能基准。

Details

Motivation: 当前自动驾驶研究主要关注行人过街行为，而对密集共享空间中弱势道路使用者（如骑行者）的交互行为研究不足，缺乏从骑行者视角获取的高质量近距离数据。 Method: 开发了配备LiDAR、摄像头和GNSS的开放感知平台FUSE-Bike，采集骑行者第一视角数据；构建了包含852个样本、5类动作的多模态数据集BikeActions；采用图卷积和Transformer模型进行评测，建立基准。 Result: 发布了完整的数据集、数据整理工具、开源硬件设计和基准代码；在公开数据划分上评估了先进模型，建立了该任务的首个性能基线。 Conclusion: FUSE-Bike和BikeActions填补了VRU行为理解中骑行者视角数据的空白，为未来自动驾驶系统在复杂共享空间中的安全决策提供了重要支持。 Abstract: Anticipating the intentions of Vulnerable Road Users (VRUs) is a critical challenge for safe autonomous driving (AD) and mobile robotics. While current research predominantly focuses on pedestrian crossing behaviors from a vehicle's perspective, interactions within dense shared spaces remain underexplored. To bridge this gap, we introduce FUSE-Bike, the first fully open perception platform of its kind. Equipped with two LiDARs, a camera, and GNSS, it facilitates high-fidelity, close-range data capture directly from a cyclist's viewpoint. Leveraging this platform, we present BikeActions, a novel multi-modal dataset comprising 852 annotated samples across 5 distinct action classes, specifically tailored to improve VRU behavior modeling. We establish a rigorous benchmark by evaluating state-of-the-art graph convolution and transformer-based models on our publicly released data splits, establishing the first performance baselines for this challenging task. We release the full dataset together with data curation tools, the open hardware design, and the benchmark code to foster future research in VRU action understanding under https://iv.ee.hm.edu/bikeactions/.

[131] SVII-3D: Advancing Roadside Infrastructure Inventory with Decimeter-level 3D Localization and Comprehension from Sparse Street Imagery

Chong Liu,Luxuan Fu,Yang Jia,Zhen Dong,Bisheng Yang

Main category: cs.CV

TL;DR: SVII-3D是一个用于高保真基础设施数字化的统一框架，通过融合LoRA微调的开放集检测、几何引导优化和视觉语言模型代理，实现基于稀疏图像的鲁棒资产识别、精确三维定位与细粒度状态诊断。

Details

Motivation: 现有方法在利用低成本稀疏图像进行数字孪生和资产清查时存在鲁棒性差、定位不准和缺乏细粒度状态理解的问题。 Method: 提出SVII-3D框架：1）结合LoRA微调的开放集检测与空间注意力匹配网络以跨稀疏视角稳健关联观测；2）引入几何引导细化机制实现分米级精确定位；3）集成基于多模态提示的视觉-语言模型代理以自动诊断设备运行状态。 Result: 实验表明，SVII-3D显著提升了资产识别精度并减少了定位误差，实现了高保真的三维资产重建与状态理解。 Conclusion: SVII-3D为智能城市建设和设施全生命周期管理提供了一种可扩展、低成本且高效的自动化数字孪生解决方案。 Abstract: The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.

[132] Enhancing the quality of gauge images captured in smoke and haze scenes through deep learning

Oscar H. Ramírez-Agudelo,Akshay N. Shewatkar,Edoardo Milana,Roland C. Aydin,Kai Franke

Main category: cs.CV

TL;DR: 该研究利用FFA-Net和AECR-Net两种深度学习模型，提升烟雾和雾霾环境下模拟仪表图像的可读性，通过生成包含14000多张图像的合成数据集进行训练，结果表明深度学习能显著改善恶劣环境下的图像质量，有助于应急响应中的自动读表。

Details

Motivation: 在烟雾和雾霾环境中，图像可见度降低，影响基础设施监控和紧急救援。需要提高机器对仪表图像的自动识别能力，以支持关键情况下的决策。 Method: 采用FFA-Net和AECR-Net两种深度学习架构处理受雾霾和烟雾干扰的仪表图像；使用Unreal Engine生成超过14,000张图像的合成数据集，并按80%训练、10%验证、10%测试划分；通过SSIM和PSNR指标评估性能。 Result: 在合成雾霾数据集上，SSIM达到0.98，PSNR约为43dB，表现接近当前最优水平；AECR-Net比FFA-Net更具鲁棒性；在烟雾图像上的效果较差，但仍取得一定成果，因烟雾具有不均匀性和高密度，增强难度更大。 Conclusion: 深度学习模型（尤其是AECR-Net）能有效提升烟雾和雾霾环境下模拟仪表图像的质量，增强后的图像可用于后续自动读取，在应急场景中具有应用潜力。 Abstract: Images captured in hazy and smoky environments suffer from reduced visibility, posing a challenge when monitoring infrastructures and hindering emergency services during critical situations. The proposed work investigates the use of the deep learning models to enhance the automatic, machine-based readability of gauge in smoky environments, with accurate gauge data interpretation serving as a valuable tool for first responders. The study utilizes two deep learning architectures, FFA-Net and AECR-Net, to improve the visibility of gauge images, corrupted with light up to dense haze and smoke. Since benchmark datasets of analog gauge images are unavailable, a new synthetic dataset, containing over 14,000 images, was generated using the Unreal Engine. The models were trained with an 80\% train, 10\% validation, and 10\% test split for the haze and smoke dataset, respectively. For the synthetic haze dataset, the SSIM and PSNR metrics are about 0.98 and 43\,dB, respectively, comparing well to state-of-the art results. Additionally, more robust results are retrieved from the AECR-Net, when compared to the FFA-Net. Although the results from the synthetic smoke dataset are poorer, the trained models achieve interesting results. In general, imaging in the presence of smoke are more difficult to enhance given the inhomogeneity and high density. Secondly, FFA-Net and AECR-Net are implemented to dehaze and not to desmoke images. This work shows that use of deep learning architectures can improve the quality of analog gauge images captured in smoke and haze scenes immensely. Finally, the enhanced output images can be successfully post-processed for automatic autonomous reading of gauges

[133] Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

Luxuan Fu,Chong Liu,Bisheng Yang,Zhen Dong

Main category: cs.CV

TL;DR: 提出一种领域自适应框架，将大视觉语言模型（VLM）转化为专用于城市道路基础设施智能分析的专家代理，结合数据高效微调与知识增强推理，在检测与属性识别上表现优异。

Details

Motivation: 通用模型难以捕捉城市道路基础设施的细粒度属性和工程规范，现有VLM在复杂设施状态识别中易产生幻觉且不符合行业标准，导致实际应用不可靠。 Method: 采用Grounding DINO进行开放词汇目标定位，结合LoRA微调Qwen-VL实现语义属性推理，并引入双模态检索增强生成（RAG）机制，动态检索权威标准与视觉示例以提升合规性与准确性。 Result: 在新构建的城市道路场景数据集上，达到58.9 mAP的检测性能和95.5%的属性识别准确率。 Conclusion: 该框架显著提升了VLM在专业基础设施分析中的可靠性与精度，为智能城市基础设施监测提供了鲁棒解决方案。 Abstract: Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.

[134] Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan,Xiaofeng Zhang,Felix Friedrich,Nicolas Beltran-Velez,Melissa Hall,Reyhane Askari-Hemmat,Xiaochuang Han,Nicolas Ballas,Michal Drozdzal,Adriana Romero-Soriano

Main category: cs.CV

TL;DR: 本文提出WMReward，利用潜在世界模型作为奖励来优化视频生成的推理过程，从而提升生成结果的物理合理性。

Details

Motivation: 现有视频生成模型虽然视觉效果良好，但常违反基本物理规律，限制了其应用。作者认为这不仅是训练数据的问题，更与推理策略有关。 Method: 引入WMReward方法，将潜在世界模型（如VJEPA-2）的物理先验作为奖励信号，在推理时搜索并引导多个去噪路径，以提高生成视频的物理合理性。 Result: 在多种生成设置下显著提升了物理合理性，并在ICCV 2025 PhysicsIQ挑战赛中以62.64%的成绩排名第一，超越先前最优方法7.42%。 Conclusion: 使用潜在世界模型进行推理时对齐是提升视频生成物理合理性的有效途径，具有广泛适用性。 Abstract: State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

[135] DeepUrban: Interaction-Aware Trajectory Prediction and Planning for Automated Driving by Aerial Imagery

Constantin Selzer,Fabian B. Flohr

Main category: cs.CV

TL;DR: 本文介绍了DeepUrban，一个专注于密集城市交通场景的新型无人机数据集，旨在提升自动驾驶系统在轨迹预测与规划方面的基准测试能力。

Details

Motivation: 现有自动驾驶基准测试缺乏足够的密集交通场景，限制了对复杂交通交互的理解与建模。 Method: 与产业伙伴DeepScenario合作，利用高空无人机拍摄高分辨率图像，提取3D交通目标，并融合地图与场景信息，构建DeepUrban数据集；将其与nuScenes结合用于评估SOTA方法。 Result: 实验表明，在nuScenes中加入DeepUrban可使车辆轨迹预测精度在ADE/FDE指标上最高提升44.1%/44.3%，并增强了模型的泛化能力。 Conclusion: DeepUrban有效弥补了密集城市交通数据的空白，显著提升了预测与规划模型的性能，具有重要的研究与应用价值。 Abstract: The efficacy of autonomous driving systems hinges critically on robust prediction and planning capabilities. However, current benchmarks are impeded by a notable scarcity of scenarios featuring dense traffic, which is essential for understanding and modeling complex interactions among road users. To address this gap, we collaborated with our industrial partner, DeepScenario, to develop DeepUrban-a new drone dataset designed to enhance trajectory prediction and planning benchmarks focusing on dense urban settings. DeepUrban provides a rich collection of 3D traffic objects, extracted from high-resolution images captured over urban intersections at approximately 100 meters altitude. The dataset is further enriched with comprehensive map and scene information to support advanced modeling and simulation tasks. We evaluate state-of-the-art (SOTA) prediction and planning methods, and conducted experiments on generalization capabilities. Our findings demonstrate that adding DeepUrban to nuScenes can boost the accuracy of vehicle predictions and planning, achieving improvements up to 44.1 % / 44.3% on the ADE / FDE metrics. Website: https://iv.ee.hm.edu/deepurban

[136] Jordan-Segmentable Masks: A Topology-Aware definition for characterizing Binary Image Segmentation

Serena Grazia De Benedictis,Amedeo Altavilla,Nicoletta Del Buono

Main category: cs.CV

TL;DR: 提出一种基于Jordan曲线定理和数字拓扑的拓扑感知图像分割评估方法，通过Betti数验证分割掩码的结构连贯性，确保图像被划分为两个连通区域。

Details

Motivation: 现有分割评价指标难以捕捉分割结果的结构和拓扑一致性，尤其在医学图像等应用中，边界小误差或碎片化预测可能导致高分但不合理的分割结果。 Method: 基于Jordan曲线定理和数字拓扑理论，定义“Jordan-可分割掩码”概念，提取掩码的4-曲线候选，并利用同调理论中的Betti数（β₀=β₁=1）验证其拓扑有效性。 Result: 提供了一种数学上严谨、无需监督的分割结构一致性评估标准，能有效识别出传统指标可能忽略的拓扑错误。 Conclusion: 该方法结合数字Jordan理论与同调不变量，为图像分割提供了更具拓扑合理性的评估方式，特别适用于需保持拓扑正确性的应用场景。 Abstract: Image segmentation plays a central role in computer vision. However, widely used evaluation metrics, whether pixel-wise, region-based, or boundary-focused, often struggle to capture the structural and topological coherence of a segmentation. In many practical scenarios, such as medical imaging or object delineation, small inaccuracies in boundary, holes, or fragmented predictions can result in high metric scores, despite the fact that the resulting masks fail to preserve the object global shape or connectivity. This highlights a limitation of conventional metrics: they are unable to assess whether a predicted segmentation partitions the image into meaningful interior and exterior regions. In this work, we introduce a topology-aware notion of segmentation based on the Jordan Curve Theorem, and adapted for use in digital planes. We define the concept of a \emph{Jordan-segmentatable mask}, which is a binary segmentation whose structure ensures a topological separation of the image domain into two connected components. We analyze segmentation masks through the lens of digital topology and homology theory, extracting a $4$-curve candidate from the mask, verifying its topological validity using Betti numbers. A mask is considered Jordan-segmentatable when this candidate forms a digital 4-curve with $β_0 = β_1 = 1$, or equivalently when its complement splits into exactly two $8$-connected components. This framework provides a mathematically rigorous, unsupervised criterion with which to assess the structural coherence of segmentation masks. By combining digital Jordan theory and homological invariants, our approach provides a valuable alternative to standard evaluation metrics, especially in applications where topological correctness must be preserved.

[137] Adversarial Evasion Attacks on Computer Vision using SHAP Values

Frank Mollard,Marcus Becker,Florian Roehrbein

Main category: cs.CV

TL;DR: 本文提出了一种基于SHAP值的白盒攻击方法，用于计算机视觉模型中的对抗性逃避攻击，并证明其在梯度隐藏场景下比FGSM更具鲁棒性。

Details

Motivation: 为了揭示深度学习模型在面对不可见对抗样本时的脆弱性，特别是当攻击难以被人类察觉时，需要新的攻击方法来评估模型安全性。 Method: 利用SHAP值量化输入特征对模型输出的重要性，在推理阶段生成对抗样本，并与FGSM进行比较。 Result: SHAP攻击在诱导错误分类方面表现更优，尤其在梯度隐藏情况下仍能有效生成对抗样本。 Conclusion: SHAP值可用于构建更鲁棒的对抗攻击，表明当前模型在解释性特征利用下的潜在安全风险。 Abstract: The paper introduces a white-box attack on computer vision models using SHAP values. It demonstrates how adversarial evasion attacks can compromise the performance of deep learning models by reducing output confidence or inducing misclassifications. Such attacks are particularly insidious as they can deceive the perception of an algorithm while eluding human perception due to their imperceptibility to the human eye. The proposed attack leverages SHAP values to quantify the significance of individual inputs to the output at the inference stage. A comparison is drawn between the SHAP attack and the well-known Fast Gradient Sign Method. We find evidence that SHAP attacks are more robust in generating misclassifications particularly in gradient hiding scenarios.

[138] Action100M: A Large-scale Video Action Dataset

Delong Chen,Tejaswi Kasarla,Yejin Bang,Mustafa Shukor,Willy Chung,Jade Yu,Allen Bolourchi,Theo Moutakanni,Pascale Fung

Main category: cs.CV

TL;DR: 本文提出了Action100M，一个从120万段网络教学视频中构建的大规模、开放词汇的动作识别数据集，包含约一亿个时间定位的动作片段和丰富的多级标注，通过全自动流程生成，并用于提升视觉语言模型在动作识别任务上的零样本性能。

Details

Motivation: 为了推动机器智能在物理世界中的发展，需要能够从视觉观察中推断物理动作的能力，而现有数据集在规模和开放词汇覆盖上存在局限，因此需要构建更大规模、更广泛覆盖的视频动作数据集。 Method: 提出一个全自动的数据生成管道：首先利用V-JEPA 2嵌入进行分层时间分割，然后生成多层级的帧与片段描述（Tree-of-Captions），最后通过基于GPT-OSS-120B的推理模型在多轮Self-Refine机制下聚合证据，输出结构化标注（包括动作、执行者、详细描述等）。 Result: 构建了Action100M数据集，包含1.2百万段教学视频（共14.6年时长），产生约1亿个时间定位的动作片段；在该数据集上训练VL-JEPA模型展现出持续的数据扩展效果，并在多个动作识别基准上取得优异的零样本性能。 Conclusion: Action100M为视频理解与世界建模研究提供了一个新的基础数据集，证明了大规模自动化构建高质量视频动作数据的可行性，并推动开放词汇动作识别的发展。 Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.

[139] RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation

Peng Chen,Xiaobao Wei,Yi Yang,Naiming Yao,Hui Chen,Feng Tian

Main category: cs.CV

TL;DR: 本文提出了RSATalker，首个基于3D高斯点阵（3DGS）的兼顾真实感与社交感知的对话头像生成框架，支持多轮对话并建模人际关系。

Details

Motivation: 现有对话头像生成方法在真实性、效率或社交关系建模方面存在不足：基于网格的3D方法缺乏真实纹理，基于大模型的2D方法计算成本高，而3DGS方法仅限单人说话者且忽略社会关系。 Method: 首先从语音驱动网格-based 3D面部运动，然后将3D高斯分布绑定到网格面片上以渲染高保真2D视频；提出一个社交感知模块，通过可学习查询机制将血缘/非血缘、平等/不平等等人际关系编码为高级嵌入。采用三阶段训练范式，并构建包含语音-网格-图像三元组及社交关系标注的RSATalker数据集。 Result: 实验表明RSATalker在真实感和社交感知方面均达到SOTA水平，兼具高效渲染与多轮双人对话能力。 Conclusion: RSATalker首次将3DGS成功应用于具有社交感知的双人对话头像生成，有效平衡了 realism、效率与社会互动建模，推动了虚拟现实中社交对话系统的发展。 Abstract: Talking head generation is increasingly important in virtual reality (VR), especially for social scenarios involving multi-turn conversation. Existing approaches face notable limitations: mesh-based 3D methods can model dual-person dialogue but lack realistic textures, while large-model-based 2D methods produce natural appearances but incur prohibitive computational costs. Recently, 3D Gaussian Splatting (3DGS) based methods achieve efficient and realistic rendering but remain speaker-only and ignore social relationships. We introduce RSATalker, the first framework that leverages 3DGS for realistic and socially-aware talking head generation with support for multi-turn conversation. Our method first drives mesh-based 3D facial motion from speech, then binds 3D Gaussians to mesh facets to render high-fidelity 2D avatar videos. To capture interpersonal dynamics, we propose a socially-aware module that encodes social relationships, including blood and non-blood as well as equal and unequal, into high-level embeddings through a learnable query mechanism. We design a three-stage training paradigm and construct the RSATalker dataset with speech-mesh-image triplets annotated with social relationships. Extensive experiments demonstrate that RSATalker achieves state-of-the-art performance in both realism and social awareness. The code and dataset will be released.

[140] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark,Jieyu Zhang,Zixian Ma,Jae Sung Park,Mohammadreza Salehi,Rohun Tripathi,Sangho Lee,Zhongzheng Ren,Chris Dongjoo Kim,Yinuo Yang,Vincent Shao,Yue Yang,Weikai Huang,Ziqi Gao,Taira Anderson,Jianrui Zhang,Jitesh Jain,George Stoica,Winson Han,Ali Farhadi,Ranjay Krishna

Main category: cs.CV

TL;DR: Molmo2 是一个开源的视频-语言模型系列，在开放权重和数据模型中处于领先水平，具备卓越的像素级定位能力，通过自建高质量多模态数据集和高效训练策略，在视频理解与 grounding 任务上超越现有开源及部分专有模型。

Details

Motivation: 当前最强的视频-语言模型多为闭源，开源模型依赖专有模型生成的数据或缺乏透明训练细节，且大多数模型（包括专有模型）无法支持像素级 grounding（如指向或追踪），限制了下游应用的发展。 Method: 提出 Molmo2 模型家族，构建 7 个新的视频数据集和 2 个多图数据集（涵盖详细字幕、自由问答、复杂查询追踪和视频指向等），完全不使用闭源 VLM 生成数据；采用高效的打包策略和 message-tree 编码方案进行训练，并引入视觉 token 的双向注意力机制与新型 token 权重策略以提升性能。 Result: 在短视频理解、计数和字幕生成任务上，8B 参数的最佳模型优于同类开源模型，在长视频任务上表现具有竞争力；在视频 grounding 任务中显著超越 Qwen3-VL 等开源模型（视频计数准确率 35.5 vs 29.6），并在某些任务上超过 Gemini 3 Pro 等专有模型（视频指向 F1 38.4 vs 20.0，视频追踪 J&F 56.2 vs 41.1）。 Conclusion: Molmo2 为开源社区提供了高性能、完全透明的视频-语言建模基础，推动了对像素级 grounding 能力的发展，展示了无需依赖闭源模型即可构建高质量多模态系统的可行性。 Abstract: Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).

[141] CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos

Chengfeng Zhao,Jiazhi Shu,Yubo Zhao,Tianyu Huang,Jiahao Lu,Zekai Gu,Chengwei Ren,Zhiyang Dou,Qing Shuai,Yuan Liu

Main category: cs.CV

TL;DR: 本文提出CoMoVi，一种通过耦合两个视频扩散模型同步生成3D人体运动和2D人体视频的框架，利用2D-3D联合表示与交叉注意力实现生成过程的协同优化。

Details

Motivation: 3D人体运动与2D人体视频生成本质上相互关联：3D提供结构先验以保证视频的合理性与一致性，而预训练视频模型可为3D运动生成提供强泛化能力，因此需将两者生成过程耦合。 Method: 提出CoMoVi框架，设计双分支扩散模型，在单一去噪循环中同步生成3D运动与2D视频；引入有效的2D人体运动表示以继承预训练视频扩散模型的先验，并通过互特征交互与3D-2D交叉注意力实现生成耦合；构建大规模带文本与动作标注的真实人类视频数据集CoMoVi Dataset。 Result: 实验表明，该方法在3D人体运动生成与2D人体视频生成任务上均表现出色，优于现有方法，验证了耦合生成的有效性。 Conclusion: 通过耦合3D运动与2D视频的生成过程，CoMoVi实现了更合理、一致且高质量的人体动作内容生成，为未来多模态人体动作合成提供了新思路。 Abstract: In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi Dataset, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.

[142] CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning

Darshan Singh,Arsha Nagrani,Kawshik Manikantan,Harman Singh,Dinesh Tewari,Tobias Weyand,Cordelia Schmid,Anelia Angelova,Shachi Dave

Main category: cs.CV

TL;DR: 本文提出了CURVE，一个用于多文化、多语言视频理解的基准测试，包含18个地区的真实文化视频和本地语言标注的复杂问题与推理步骤，揭示了当前视频大模型在跨文化视觉理解上的不足。

Details

Motivation: 现有视频理解基准主要基于西方数据和英语内容，存在文化偏见，缺乏对多元文化情境下视觉推理能力的公平评估。 Method: 构建了一个名为CURVE的多文化视频评测基准，包含来自全球18个地区的高质量人工生成注释；所有问题、答案和多步推理均由母语者完成，并利用推理轨迹构建证据图以进行细粒度错误分析。 Result: 实验表明，当前最先进的视频大模型在CURVE上表现远低于人类水平，主要困难在于对文化相关视觉元素的理解和感知。 Conclusion: 跨文化视频理解需要对视觉文化背景的深度情境化理解，CURVE为推动更公平、更具包容性的视频AI研究提供了重要工具。 Abstract: Recent advancements in video models have shown tremendous progress, particularly in long video understanding. However, current benchmarks predominantly feature western-centric data and English as the dominant language, introducing significant biases in evaluation. To address this, we introduce CURVE (Cultural Understanding and Reasoning in Video Evaluation), a challenging benchmark for multicultural and multilingual video reasoning. CURVE comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales. Unlike prior work that relies on automatic translations, CURVE provides complex questions, answers, and multi-step reasoning steps, all crafted in native languages. Making progress on CURVE requires a deeply situated understanding of visual cultural context. Furthermore, we leverage CURVE's reasoning traces to construct evidence-based graphs and propose a novel iterative strategy using these graphs to identify fine-grained errors in reasoning. Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy, with errors primarily stemming from the visual perception of cultural elements. CURVE will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva-cultural

[143] A continental-scale dataset of ground beetles with high-resolution images and validated morphological trait measurements

S M Rayeed,Mridul Khurana,Alyson East,Isadora E. Fluck,Elizabeth G. Campolongo,Samuel Stevens,Iuliia Zarubiieva,Scott C. Lowe,Michael W. Denslow,Evan D. Donoso,Jiaman Wu,Michelle Ramirez,Benjamin Baiser,Charles V. Stewart,Paula Mabee,Tanya Berger-Wolf,Anuj Karpatne,Hilmar Lapp,Robert P. Guralnick,Graham W. Taylor,Sydne Record

Main category: cs.CV

TL;DR: 本研究通过高分辨率成像技术数字化了美国国家生态观测网络（NEON）的13,200多只步甲虫标本，填补了无脊椎动物性状数据库的空白，并实现了亚毫米级精度的自动性状提取，推动基于人工智能的生物多样性监测。

Details

Motivation: 全球性状数据库严重偏向脊椎动物和植物，限制了对高多样性无脊椎动物（如步甲科昆虫）的生态分析；同时NEON中的大量步甲标本仍以实物形式保存，难以广泛用于大规模研究。 Method: 利用高分辨率成像技术对来自美国本土及夏威夷30个站点的超过13,200只NEON步甲标本进行数字化处理，并通过数字测量获取每一样本的鞘翅长度和宽度，建立可用于人工智能自动性状提取的数据集，且与人工测量结果对比验证其精度。 Result: 实现了亚毫米级精度的数字性状提取，验证结果表明数据可靠，可用于生态学和计算研究；成功构建了一个多模态的步甲虫数字化性状数据集，显著提升了标本数据的可访问性和分析潜力。 Conclusion: 该工作缓解了无脊椎动物在性状数据库中的代表性不足问题，为AI驱动的物种识别和性状研究提供了基础，促进生物多样性监测与保护的发展。 Abstract: Despite the ecological significance of invertebrates, global trait databases remain heavily biased toward vertebrates and plants, limiting comprehensive ecological analyses of high-diversity groups like ground beetles. Ground beetles (Coleoptera: Carabidae) serve as critical bioindicators of ecosystem health, providing valuable insights into biodiversity shifts driven by environmental changes. While the National Ecological Observatory Network (NEON) maintains an extensive collection of carabid specimens from across the United States, these primarily exist as physical collections, restricting widespread research access and large-scale analysis. To address these gaps, we present a multimodal dataset digitizing over 13,200 NEON carabids from 30 sites spanning the continental US and Hawaii through high-resolution imaging, enabling broader access and computational analysis. The dataset includes digitally measured elytra length and width of each specimen, establishing a foundation for automated trait extraction using AI. Validated against manual measurements, our digital trait extraction achieves sub-millimeter precision, ensuring reliability for ecological and computational studies. By addressing invertebrate under-representation in trait databases, this work supports AI-driven tools for automated species identification and trait-based research, fostering advancements in biodiversity monitoring and conservation.

[144] See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection

Amir Mallak,Erfan Aasi,Shiva Sreeram,Tsun-Hsuan Wang,Daniela Rus,Alaa Maalouf

Main category: cs.CV

TL;DR: 本文提出了一种名为Stochastic-Patch-Selection (SPS) 的方法，用于提升端到端自动驾驶策略在分布外（OOD）场景下的鲁棒性和泛化能力。通过随机遮蔽部分图像块特征，SPS减少冗余信息带来的过拟合，使策略依赖于更本质的不变特征，并在性能和效率上均超越现有最先进方法。

Details

Motivation: 由于基础模型中的patch特征存在高度冗余（如自注意力机制导致的信息重叠），直接使用这些特征训练策略容易过拟合虚假相关性，损害在分布外（OOD）场景下的泛化能力。因此需要一种机制来降低这种冗余对决策的影响。 Method: 提出Stochastic-Patch-Selection (SPS) 方法：在每帧中随机遮蔽一部分patch描述符（保留空间布局），仅将剩余patch输入策略模型。这样迫使策略基于不同但完整的场景子集进行决策，从而学习到对特定token不敏感的、更具不变性的特征表示。 Result: 实验表明，SPS在所有OOD场景下均优于现有最先进方法，在闭环仿真中平均提升6.2%，最高达20.4%，且推理速度快2.4倍；8/9个消融实验系统超过先前SOTA；所学策略无需调优即可迁移到真实物理车辆上运行。 Conclusion: SPS通过引入随机patch遮蔽，有效缓解了基础模型中patch特征冗余导致的过拟合问题，提升了自动驾驶策略的鲁棒性、泛化能力和计算效率，并验证了其在真实世界中的可迁移性。 Abstract: Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features extracted from foundation models generalize better to Out-of-Distribution (OOD). We hypothesize that due to the self-attention mechanism, each patch feature implicitly embeds/contains information from all other patches, represented in a different way and intensity, making these descriptors highly redundant. We quantify redundancy in such (BLIP2) features via PCA and cross-patch similarity: $90$% of variance is captured by $17/64$ principal components, and strong inter-token correlations are pervasive. Training on such overlapping information leads the policy to overfit spurious correlations, hurting OOD robustness. We present Stochastic-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient. For every frame, SPS randomly masks a fraction of patch descriptors, not feeding them to the policy model, while preserving the spatial layout of the remaining patches. Thus, the policy is provided with different stochastic but complete views of the (same) scene: every random subset of patches acts like a different, yet still sensible, coherent projection of the world. The policy thus bases its decisions on features that are invariant to which specific tokens survive. Extensive experiments confirm that across all OOD scenarios, our method outperforms the state of the art (SOTA), achieving a $6.2$% average improvement and up to $20.4$% in closed-loop simulations, while being $2.4\times$ faster. We conduct ablations over masking rates and patch-feature reorganization, training and evaluating 9 systems, with 8 of them surpassing prior SOTA. Finally, we show that the same learned policy transfers to a physical, real-world car without any tuning.

[145] From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion

Cheng Chen,Yuyu Guo,Pengpeng Zeng,Jingkuan Song,Peng Di,Hang Yu,Lianli Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为Cross-Layer Injection (CLI) 的新框架，通过建立视觉与语言模型之间的动态多对多连接，克服了现有视觉-语言模型中因单向、静态连接导致的视觉特征瓶颈问题。

Details

Motivation: 现有的视觉-语言模型（VLMs）仅将视觉编码器的输出连接到大语言模型（LLM）输入，这种静态架构限制了LLM与分层视觉知识的全面对齐，难以融合局部细节与全局语义进行连贯推理。 Method: 提出Cross-Layer Injection (CLI) 框架，包含两个轻量级组件：自适应多投影（AMP）模块用于融合不同视觉层的特征，以及自适应门控融合（AGF）机制使LLM能根据解码上下文选择性注入最相关的视觉信息。 Result: 在LLaVA-OneVision和LLaVA-1.5上集成CLI后，在18个多样化基准测试中均表现出显著性能提升。 Conclusion: CLI是一种可扩展的范式，通过为LLM提供按需访问完整视觉层次结构的能力，实现了更深层次的多模态理解。 Abstract: Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection that links only the output of the vision encoder to the input of the large language model (LLM). This static architecture fundamentally limits the ability of LLMs to achieve comprehensive alignment with hierarchical visual knowledge, compromising their capacity to accurately integrate local details with global semantics into coherent reasoning. To resolve this, we introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities. CLI consists of two synergistic, parameter-efficient components: an Adaptive Multi-Projection (AMP) module that harmonizes features from diverse vision layers, and an Adaptive Gating Fusion (AGF) mechanism that empowers the LLM to selectively inject the most relevant visual information based on its real-time decoding context. We validate the effectiveness and versatility of CLI by integrating it into LLaVA-OneVision and LLaVA-1.5. Extensive experiments on 18 diverse benchmarks demonstrate significant performance improvements, establishing CLI as a scalable paradigm that unlocks deeper multimodal understanding by granting LLMs on-demand access to the full visual hierarchy.

[146] Alterbute: Editing Intrinsic Attributes of Objects in Images

Tal Reiss,Daniel Winter,Matan Cohen,Alex Rav-Acha,Yael Pritch,Ariel Shamir,Yedid Hoshen

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的图像对象属性编辑方法Alterbute，能够在保持对象身份和场景上下文的同时修改其颜色、纹理、材质甚至形状等内在属性。

Details

Motivation: 现有方法在编辑对象内在属性时难以兼顾身份保持与属性变化，或因无监督先验导致身份丢失，或因监督过强限制了合理的变化空间。本文旨在实现既灵活又身份保持的属性编辑。 Method: 提出一种松弛训练目标，结合身份参考图像、文本描述目标属性、背景图像和对象掩码进行条件生成；推理时复用原始背景和掩码以限制外在变化。引入视觉命名实体（VNEs）作为细粒度身份类别，并利用视觉语言模型从大规模数据中自动提取VNE标签和属性描述，实现可扩展的身份保持监督。 Result: Alterbute在对象内在属性编辑任务上优于现有方法，尤其在保持身份一致性方面表现突出。 Conclusion: Alterbute通过松弛训练与VNE监督机制，有效实现了对象内在属性的灵活编辑同时保持其身份，推动了图像编辑中身份保持与属性可控性的平衡发展。 Abstract: We introduce Alterbute, a diffusion-based method for editing an object's intrinsic attributes in an image. We allow changing color, texture, material, and even the shape of an object, while preserving its perceived identity and scene context. Existing approaches either rely on unsupervised priors that often fail to preserve identity or use overly restrictive supervision that prevents meaningful intrinsic variations. Our method relies on: (i) a relaxed training objective that allows the model to change both intrinsic and extrinsic attributes conditioned on an identity reference image, a textual prompt describing the target intrinsic attributes, and a background image and object mask defining the extrinsic context. At inference, we restrict extrinsic changes by reusing the original background and object mask, thereby ensuring that only the desired intrinsic attributes are altered; (ii) Visual Named Entities (VNEs) - fine-grained visual identity categories (e.g., ''Porsche 911 Carrera'') that group objects sharing identity-defining features while allowing variation in intrinsic attributes. We use a vision-language model to automatically extract VNE labels and intrinsic attribute descriptions from a large public image dataset, enabling scalable, identity-preserving supervision. Alterbute outperforms existing methods on identity-preserving object intrinsic attribute editing.

[147] WildRayZer: Self-supervised Large View Synthesis in Dynamic Environments

Xuweiyi Chen,Wentao Zhou,Zezhou Cheng

Main category: cs.CV

TL;DR: 提出WildRayZer，一种用于动态环境中新视角合成的自监督框架，通过分析-合成测试分离静态与动态内容，显著提升合成质量。

Details

Motivation: 动态场景中相机和物体均移动，破坏了传统静态NVS模型依赖的多视图一致性，导致伪影、几何幻觉和位姿估计不稳定。 Method: 采用分析-合成测试：用仅相机的静态渲染器解释刚性结构，利用其残差生成伪运动掩码，蒸馏出运动估计器，并用于掩码输入和门控损失梯度，使监督聚焦于背景补全。 Result: 在自建数据集D-RE10K和D-RE10K-iPhone上实验表明，WildRayZer在瞬态区域去除和全帧NVS质量上优于优化和前馈基线方法。 Conclusion: WildRayZer通过残差驱动的运动感知机制，在无需显式动态建模的情况下实现了高质量的动态场景新视角合成。 Abstract: We present WildRayZer, a self-supervised framework for novel view synthesis (NVS) in dynamic environments where both the camera and objects move. Dynamic content breaks the multi-view consistency that static NVS models rely on, leading to ghosting, hallucinated geometry, and unstable pose estimation. WildRayZer addresses this by performing an analysis-by-synthesis test: a camera-only static renderer explains rigid structure, and its residuals reveal transient regions. From these residuals, we construct pseudo motion masks, distill a motion estimator, and use it to mask input tokens and gate loss gradients so supervision focuses on cross-view background completion. To enable large-scale training and evaluation, we curate Dynamic RealEstate10K (D-RE10K), a real-world dataset of 15K casually captured dynamic sequences, and D-RE10K-iPhone, a paired transient and clean benchmark for sparse-view transient-aware NVS. Experiments show that WildRayZer consistently outperforms optimization-based and feed-forward baselines in both transient-region removal and full-frame NVS quality with a single feed-forward pass.

Table of Contents

cs.CL [Back]

[1] LLM-Driven Preference Data Synthesis for Proactive Prediction of the Next User Utterance in Human-Machine Dialogue

[2] Evaluating Novelty in AI-Generated Research Plans Using Multi-Workflow LLM Pipelines

[3] Introducing Axlerod: An LLM-based Chatbot for Assisting Independent Insurance Agents

[4] Opportunities and Challenges of Natural Language Processing for Low-Resource Senegalese Languages in Social Science Research

[5] SALP-CG: Standard-Aligned LLM Pipeline for Classifying and Grading Large Volumes of Online Conversational Health Data

[6] StatLLaMA: A multi-stage training framework for building a domain-optimized statistical language model

[7] Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

[8] Uncertainty-Aware Dynamic Knowledge Graphs for Reliable Question Answering

[9] Cross-Platform Evaluation of Large Language Model Safety in Pediatric Consultations: Evolution of Adversarial Robustness and the Scale Paradox

[10] ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

[11] SagaScale: A Realistic, Scalable, and High-Quality Long-Context Benchmark Built from Full-Length Novels

[12] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions

[13] Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation

[14] Forgetting as a Feature: Cognitive Alignment of Large Language Models

[15] SciNets: Graph-Constrained Multi-Hop Reasoning for Scientific Literature Synthesis

[16] Eliminating Agentic Workflow for Introduction Generation with Parametric Stage Tokens

[17] Enhancing Business Analytics through Hybrid Summarization of Financial Reports

[18] Clinical Document Metadata Extraction: A Scoping Review

[19] Geometric Patterns of Meaning: A PHATE Manifold Analysis of Multi-lingual Embeddings

[20] Benchmarking Cross-Lingual Semantic Alignment in Multilingual Embeddings

[21] Closing the Data Loop: Using OpenDataArena to Engineer Superior Training Datasets

[22] From Detection to Diagnosis: Advancing Hallucination Analysis with Automated Data Synthesis

[23] Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations

[24] Bears, all bears, and some bears. Language Constraints on Language Models' Inductive Inferences

[25] MedRedFlag: Investigating how LLMs Redirect Misconceptions in Real-World Health Communication

[26] OUTLINEFORGE: Hierarchical Reinforcement Learning with Explicit States for Scientific Writing

[27] Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL

[28] Clozing the Gap: Exploring Why Language Model Surprisal Outperforms Cloze Surprisal

[29] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

[30] Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG

[31] SocraticKG: Knowledge Graph Construction via QA-Driven Fact Extraction

[32] EHRNavigator: A Multi-Agent System for Patient-Level Clinical Question Answering over Heterogeneous Electronic Health Records

[33] EmplifAI: a Fine-grained Dataset for Japanese Empathetic Medical Dialogues in 28 Emotion Labels

[34] Long-Chain Reasoning Distillation via Adaptive Prefix Alignment

[35] Deriving Character Logic from Storyline as Codified Decision Trees

[36] Is MT Ready for the Next Crisis or Pandemic?

[37] CALM-IT: Generating Realistic Long-Form Motivational Interviewing Dialogues with Dual-Actor Conversational Dynamics Tracking

[38] SIN-Bench: Tracing Native Evidence Chains in Long-Context Multimodal Scientific Interleaved Literature

[39] Skill-Aware Data Selection and Fine-Tuning for Data-Efficient Reasoning Distillation

[40] Role-Playing Agents Driven by Large Language Models: Current Status, Challenges, and Future Trends

[41] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

[42] What Gets Activated: Uncovering Domain and Driver Experts in MoE Language Models

[43] Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment

[44] AWED-FiNER: Agents, Web applications, and Expert Detectors for Fine-grained Named Entity Recognition across 36 Languages for 6.6 Billion Speakers

[45] Credit C-GPT: A Domain-Specialized Large Language Model for Conversational Understanding in Vietnamese Debt Collection

[46] HOMURA: Taming the Sand-Glass for Time-Constrained LLM Translation via Reinforcement Learning

[47] HUMANLLM: Benchmarking and Reinforcing LLM Anthropomorphism via Human Cognitive Patterns

[48] One Instruction Does Not Fit All: How Well Do Embeddings Align Personas and Instructions in Low-Resource Indian Languages?

[49] GeoSteer: Faithful Chain-of-Thought Steering via Latent Manifold Gradients

[50] Loop as a Bridge: Can Looped Transformers Truly Link Representation Space and Natural Language Outputs?

[51] coTherapist: A Behavior-Aligned Small Language Model to Support Mental Healthcare Experts

[52] Untangling Input Language from Reasoning Language: A Diagnostic Framework for Cross-Lingual Moral Alignment in LLMs

[53] Measuring Affinity between Attention-Head Weight Subspaces via the Projection Kernel

[54] MoST: Mixing Speech and Text with Modality-Aware Mixture of Experts

[55] The Straight and Narrow: Do LLMs Possess an Internal Moral Path?

[56] Multilinguality as Sense Adaptation

[57] ADVOSYNTH: A Synthetic Multi-Advocate Dataset for Speaker Identification in Courtroom Scenarios

[58] Boundary-Aware NL2SQL: Integrating Reliability through Hybrid Reward and Data Synthesis

[59] An Efficient Long-Context Ranking Architecture With Calibrated LLM Distillation: Application to Person-Job Fit

[60] OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

[61] Training-Trajectory-Aware Token Selection

[62] Unlocking Implicit Experience: Synthesizing Tool-Use Trajectories from Text

[63] The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models

[64] INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

[65] TF3-RO-50M: Training Compact Romanian Language Models from Scratch on Synthetic Moral Microfiction

[66] Are Language Models Models?

[67] SurgGoal: Rethinking Surgical Planning Evaluation via Goal-Satisfiability

[68] Contextual StereoSet: Stress-Testing Bias Alignment Robustness in Large Language Models

[69] DR-Arena: an Automated Evaluation Framework for Deep Research Agents

[70] AEQ-Bench: Measuring Empathy of Omni-Modal Large Models

[71] PERM: Psychology-grounded Empathetic Reward Modeling for Large Language Models

[72] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

[73] Form and Meaning in Intrinsic Multilingual Evaluations

[74] Influential Training Data Retrieval for Explaining Verbalized Confidence of LLMs

[75] Detecting Winning Arguments with Large Language Models and Persuasion Strategies

[76] LIBERTy: A Causal Framework for Benchmarking Concept-Based Explanations of LLMs with Structural Counterfactuals

[77] Grounding Agent Memory in Contextual Intent

[78] MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching