Skip to content

Table of Contents

cs.CL [Back]

[1] Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings

Xiran Fan,Zhimeng Jiang,Chin-Chia Michael Yeh,Yuzhong Chen,Yingtong Dou,Menghai Pan,Yan Zheng

Main category: cs.CL

TL;DR: 提出一种混合框架,利用LLM生成的嵌入作为轻量级交易模型的语义初始化,结合多源数据融合和一键词约束原则,提升交易理解任务的性能。

Details Motivation: 现有交易分析模型使用基于索引的表示方法处理商户类别字段,导致语义信息丢失;而虽然大语言模型(LLM)具有更好的语义理解能力,但其计算开销大,难以用于实时金融场景。 Method: 采用LLM生成的嵌入作为轻量级模型的初始语义表示,引入多源数据融合以丰富商户特征,并提出“一键词约束”原则确保不同LLM架构下嵌入的一致性,同时通过噪声过滤和上下文感知增强来提升数据质量。 Result: 在大规模交易数据集上的实验表明,该方法在多个交易理解任务中显著优于现有方法。 Conclusion: 所提出的框架在保持计算效率的同时有效保留了语义信息,为实际部署提供了可解释且高性能的解决方案。 Abstract: The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.

[2] The Table of Media Bias Elements: A sentence-level taxonomy of media bias types and propaganda techniques

Tim Menzner,Jochen L. Leidner

Main category: cs.CL

TL;DR: 本文提出了一种细粒度的句子级媒体偏见与宣传分类法,包含38种基本偏见类型,分为六个功能类别,并通过实证分析和跨学科理论支持其有效性。

Details Motivation: 现有研究多关注媒体在左右政治光谱上的立场,而忽视了偏见通过具体语言手段表达的本质,因此需要从‘立场’转向‘语言表达方式’进行更精细的分析。 Method: 结合细读、跨学科理论和试点标注,基于26,464个来自新闻语料、用户提交和自主浏览的句子,构建两层分类体系,并进行随机样本的定量调查与现有分类体系的对比分析。 Result: 提出了包含38种偏见类型的分类体系,可视化为“媒体偏见元素表”,明确了每种类型的定义、实例、认知与社会动因及识别指南;对155个句子的抽样调查显示该体系具有更高的覆盖率和更低的歧义性。 Conclusion: 该分类法能更精确地识别句子层面的媒体偏见,超越传统政治光谱分析,为自然语言处理和传播学提供了更具解释力和实用性的工具。 Abstract: Public debates about "left-" or "right-wing" news overlook the fact that bias is usually conveyed by concrete linguistic manoeuvres that transcend any single political spectrum. We therefore shift the focus from where an outlet allegedly stands to how partiality is expressed in individual sentences. Drawing on 26,464 sentences collected from newsroom corpora, user submissions and our own browsing, we iteratively combine close-reading, interdisciplinary theory and pilot annotation to derive a fine-grained, sentence-level taxonomy of media bias and propaganda. The result is a two-tier schema comprising 38 elementary bias types, arranged in six functional families and visualised as a "table of media-bias elements". For each type we supply a definition, real-world examples, cognitive and societal drivers, and guidance for recognition. A quantitative survey of a random 155-sentence sample illustrates prevalence differences, while a cross-walk to the best-known NLP and communication-science taxonomies reveals substantial coverage gains and reduced ambiguity.

[3] Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Zheng Luo,T Pranav Kutralingam,Ogochukwu N Okoani,Wanpeng Xu,Hua Wei,Xiyang Hu

Main category: cs.CL

TL;DR: 本文提出了MLCL基准,系统评估了大语言模型在多语言环境下的工具调用能力,发现参数值语言不匹配是主要失败模式,并评估了多种推理策略对性能的改善效果。

Details Motivation: 研究大语言模型在多语言用户交互下的工具调用鲁棒性,特别是在非英语语境中的表现尚不明确。 Method: 提出MLCL诊断基准,对中文、印地语和低资源语言伊博语进行系统性评估,并进行细粒度错误分析;评估多种推理时系统策略的效果。 Result: 尽管模型能正确理解意图和选择工具,但参数值语言不匹配仍是主要失败原因;所测试的策略虽能显著减少语言引发的执行错误,但无法完全恢复到英语水平的表现。 Conclusion: 多语言工具调用存在显著挑战,尤其是参数值需遵循语言不变的执行规范,当前方法在低资源语言上仍有局限。 Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.

[4] Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Zhiwei Liu,Yupen Cao,Yuechen Jiang,Mohsinul Kabir,Polydoros Giannouris,Chen Xu,Ziyang Xu,Tianlei Zhu,Tariquzzaman Faisal,Triantafillos Papadopoulos,Yan Wang,Lingfei Qian,Xueqing Peng,Zhuohan Xie,Ye Yuan,Saeed Almheiri,Abdulrazzaq Alnajjar,Mingbin Chen,Harry Stuart,Paul Thompson,Prayag Tiwari,Alejandro Lopez-Lira,Xue Liu,Jimin Huang,Sophia Ananiadou

Main category: cs.CL

TL;DR: 本文提出了一个名为\mfmdscen的综合基准,用于评估大型语言模型(LLMs)在多语言金融虚假信息检测中的行为偏差,结合与金融专家合作构建的多种复杂经济场景和多语言数据集,系统评估了22个主流LLM,发现商业和开源模型中均存在显著的行为偏差。

Details Motivation: 现有研究主要关注简单通用环境下的LLM偏见,缺乏对复杂真实金融环境及高风险、情境敏感、多语言金融虚假信息检测任务中行为偏见的深入探讨。 Method: 与金融专家合作设计了三种复杂的金融场景:基于角色与人格、基于角色与地区、以及融合民族与宗教信仰的基于角色的场景,并构建覆盖英语、中文、希腊语和孟加拉语的多语言金融虚假信息数据集,结合这些场景与虚假声明,形成\mfmdscen基准,系统评估22个主流LLM的行为偏差。 Result: 实验结果显示,无论是商业还是开源的LLM,在处理多语言金融虚假信息时均表现出显著的行为偏差,且偏差在不同语言和情境下具有差异性。 Conclusion: \mfmdscen为评估LLM在复杂金融环境中的行为偏见提供了有效工具,揭示了当前模型在高风险金融应用中的潜在问题,强调需进一步研究以缓解此类偏见。 Abstract: Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project will be available at https://github.com/lzw108/FMD.

[5] Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts

Jan Černý,Ivana Kvapilíková,Silvie Cinková

Main category: cs.CL

TL;DR: 提出了一种基于信息熵的文本可读性估计方法,并开发了可视化框架和开源工具。

Details Motivation: 通过测量文本的信息熵来估计其可读性,特别是用于改善行政或官僚文本的清晰度。 Method: 利用多种语言模型近似文本的信息熵,并构建可视化框架展示结果。 Result: 实现了对文本可读性的有效估计,并发布了名为Glitter的自由软件工具集。 Conclusion: 该方法有助于提升复杂文本的可读性和理解性,具有实际应用价值。 Abstract: This work investigates how measuring information entropy of text can be used to estimate its readability. We propose a visualization framework that can be used to approximate information entropy of text using multiple language models and visualize the result. The end goal is to use this method to estimate and improve readability and clarity of administrative or bureaucratic texts. Our toolset is available as a libre software on https://github.com/ufal/Glitter.

[6] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao,Yilun Du,Mengyu Wang

Main category: cs.CL

TL;DR: 本文首次对前沿大语言模型(LLM)的原生概率采样能力进行了大规模统计审计,发现其在不同协议和分布下采样保真度严重不足,且随分布复杂性和采样规模增加而恶化,导致下游任务中出现系统性偏差,表明当前LLM缺乏可靠内部采样机制,需依赖外部工具以满足统计严谨性要求。

Details Motivation: 随着大语言模型从聊天界面转向教育评估和合成数据构建等随机性管道的关键组件,能够忠实地从指定概率分布中采样已成为一项功能需求,但目前尚不清楚前沿模型在此类任务上的实际表现如何。 Method: 研究采用双协议设计:批量生成(Batch Generation)和独立请求(Independent Requests),在11个模型上跨15个分布进行N=1000的采样测试,并通过统计检验评估采样结果与目标分布的一致性,同时分析失败模式及其对下游任务的影响。 Result: 批量生成协议下仅有13%的中位通过率,而独立请求协议下10个模型在所有分布上均未通过;采样保真度随分布复杂性和采样数量N单调下降,并在MCQ生成和文本到图像提示合成等下游任务中引发系统性偏差。 Conclusion: 当前大语言模型缺乏有效的内部采样机制,在需要统计保证的应用中必须依赖外部采样工具。 Abstract: As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.

[7] Tracing Moral Foundations in Large Language Models

Chenxiao Yu,Bowen Yi,Farzan Karimi-Malekabadi,Suhaib Abdurahman,Jinyi Ye,Shrikanth Narayanan,Yue Zhao,Morteza Dehghani

Main category: cs.CL

TL;DR: 该研究通过Moral Foundations Theory(MFT)分析两个指令微调的大语言模型(LLM),发现道德概念在模型中以分层、分布且部分解耦的方式表征,并能通过内部表示因果地影响输出,表明LLM中的多元道德结构可能源自语言统计规律的潜在模式。

Details Motivation: 探究大语言模型生成道德判断是源于内在概念结构还是表面模仿。 Method: 结合逐层分析MFT概念表征与人类道德感知的对齐、使用预训练稀疏自编码器(SAE)识别支持道德概念的稀疏特征,以及利用密集MFT向量和稀疏SAE特征进行因果引导干预的多层级方法。 Result: 发现两个模型以结构化、依赖层次的方式表征并区分道德基础,与人类判断一致;SAE特征显示与特定道德基础有明确语义关联;引导密集向量或稀疏特征可预测地改变模型行为,证明其与道德输出存在因果关系。 Conclusion: 大语言模型中的道德概念具有分布式、分层且部分解耦的机制,表明多元道德结构可作为语言统计规律的潜在模式在模型中自然涌现。 Abstract: Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

[8] Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

Hongjin Kim,Jaewook Lee,Kiyoung Lee,Jong-hun Shin,Soojong Lim,Oh-Woog Kwon

Main category: cs.CL

TL;DR: 本研究探讨了如何通过强化学习和神经元调优提升大语言模型在韩语等低资源语言中的推理能力,发现关键在于对齐模型内部的推理过程而非引入新语言知识。

Details Motivation: 大语言模型在英语等高资源语言中表现出色,但在韩语等低资源语言中的推理和自纠错能力有限,亟需有效提升方法。 Method: 提出一种自我纠正的代码切换数据集,结合强化学习与针对韩语特定神经元(特别是在早期层)的微调策略,以对齐模型的内部推理过程。 Result: 该方法显著提升了模型在数学推理和自纠错任务上的韩语表现,证明仅靠强化学习不足,需结合内部对齐机制。 Conclusion: 多语言推理提升的关键在于激发并对齐模型已有的推理能力,尤其是通过内部翻译和神经元级调优实现语言间的推理对齐。 Abstract: Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model's internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL's effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.

[9] Towards Valid Student Simulation with Large Language Models

Zhihao Yuan,Yunze Xiao,Ming Li,Weihao Xuan,Richard Tong,Mona Diab,Tom Mitchell

Main category: cs.CL

TL;DR: 本文提出了一个基于大语言模型(LLM)的学生模拟的理论与方法框架,提出“能力悖论”问题,并通过引入知识状态规范(ESS)和目标-环境框架来提升模拟的真实性与有效性。

Details Motivation: 解决大语言模型在模拟学习者时出现的‘能力悖论’——即强大模型难以真实模拟部分知识掌握的学习者行为。 Method: 将学生模拟重构为受控生成问题,引入显式的知识状态规范(ESS)和目标-环境框架,以规范学习者的知识访问、错误结构和状态演化。 Result: 形式化了学生模拟系统的关键设计维度,整合了现有研究,并提出了关于有效性、评估和伦理风险的开放性挑战。 Conclusion: 强调知识保真度比表面真实感更重要,是将LLM用于科学和教学工具的前提。 Abstract: This paper presents a conceptual and methodological framework for large language model (LLM) based student simulation in educational settings. The authors identify a core failure mode, termed the "competence paradox" in which broadly capable LLMs are asked to emulate partially knowledgeable learners, leading to unrealistic error patterns and learning dynamics. To address this, the paper reframes student simulation as a constrained generation problem governed by an explicit Epistemic State Specification (ESS), which defines what a simulated learner can access, how errors are structured, and how learner state evolves over time. The work further introduces a Goal-by-Environment framework to situate simulated student systems according to behavioral objectives and deployment contexts. Rather than proposing a new system or benchmark, the paper synthesizes prior literature, formalizes key design dimensions, and articulates open challenges related to validity, evaluation, and ethical risks. Overall, the paper argues for epistemic fidelity over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.

[10] The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Herun Wan,Jiaying Wu,Minnan Luo,Fanxiao Li,Zhi Zeng,Min-Yen Kan

Main category: cs.CL

TL;DR: 本文提出了一种名为MisBelief的框架,用于生成具有误导性的证据以测试大语言模型(LLM)在面对复杂、难以证伪的信息时的事实信念稳定性,并发现现有模型对此类信息极为敏感;为此,作者提出了Deceptive Intent Shielding (DIS) 机制,通过识别证据背后的欺骗意图来提供早期预警,有效缓解了模型信念的偏移。

Details Motivation: 为了确保大语言模型(LLM)在辅助人类决策时具备可靠的事实信念,需要探究其在面对不易被察觉的复杂误导信息时的脆弱性,因为传统方法仅关注对明显错误信息的抵抗能力,而忽视了更隐蔽、逻辑上看似合理的欺骗性推理的影响。 Method: 提出MisBelief框架,利用多角色、多轮协作的LLM生成逐步演化、逻辑连贯但事实错误的误导证据;构建包含三个难度等级共4800个实例的数据集,评估7个代表性LLM的表现;并设计Deceptive Intent Shielding (DIS) 机制,通过推断证据中的欺骗意图来进行早期干预,提升模型对证据的审慎判断能力。 Result: 实验表明,尽管LLM对直接的错误信息有一定抵抗力,但在面对由MisBelief生成的精细化误导证据时表现出高度敏感,对虚假内容的信念平均上升93.0%;而引入DIS机制后,能有效抑制这种信念偏移,促使模型进行更谨慎的证据评估。 Conclusion: 大语言模型在面对经过精细构造、逻辑自洽但具有欺骗性的证据时存在严重漏洞,仅靠事实核查不足以保障其可靠性;提出的DIS治理机制通过识别欺骗意图提供了新的防御路径,强调在可信AI系统中需增强对证据来源和动机的判断能力。 Abstract: To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.

[11] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

Zhiyu Shen,Ziming Wu,Fuming Lai,Shaobing Lian,Yanghui Rao

Main category: cs.CL

TL;DR: MemBuilder是一个基于强化学习的框架,通过多维记忆构建和密集奖励机制提升大模型在长期对话中的一致性表现。

Details Motivation: 现有的检索机制难以捕捉历史状态的时间演变,而当前的记忆增强框架依赖静态提示或面临稀疏奖励的问题,导致模型在长时对话中难以保持一致性。 Method: 提出MemBuilder框架,采用合成会话级问题生成来提供密集的中间奖励,并引入贡献感知的梯度加权机制,根据各组件对下游任务的影响调整策略更新。 Result: 实验结果显示,MemBuilder使一个40亿参数的模型在多个长期对话基准上超越了最先进的闭源基线模型,并表现出良好的泛化能力。 Conclusion: MemBuilder通过密集奖励和多维记忆归因有效解决了长期对话中的一致性难题,为训练更智能的对话系统提供了新思路。 Abstract: Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component's downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.

[12] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

Yubo Hou,Zhisheng Chen,Tao Wan,Zengchang Qin

Main category: cs.CL

TL;DR: FlashMem是一种新型框架,通过重用推理过程中的内部状态来提取固有记忆,解决了大语言模型缺乏动态上下文保持能力的问题。

Details Motivation: 大语言模型的无状态架构无法保存动态上下文,导致代理需要重复处理历史信息以维持长期自主性,现有外部记忆方法因架构分离而效率低下。 Method: 利用内部表示唯一编码输入轨迹的特性,将最后一个隐藏状态作为交互历史的充分统计量,并通过共享KV缓存的Consolidator直接合成记忆,同时使用基于注意力熵的认知监视器自适应触发记忆整合。 Result: 实验表明,FlashMem在减少5倍推理延迟的同时,性能与重型基线相当。 Conclusion: FlashMem有效弥合了高效推理与持久认知之间的差距,为大模型的记忆机制提供了新的解决方案。 Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone's frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.

[13] CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems

Xuemei Tang,Chengxi Yan,Jinghang Gu,Chu-Ren Huang

Main category: cs.CL

TL;DR: 提出CHisAgent,一种多智能体大语言模型框架,用于构建中国古代历史事件的分类体系,通过三个专门角色提升分类的连贯性与覆盖度。

Details Motivation: 大型语言模型在非英语历史文化推理方面能力有限,且人工构建历史分类体系成本高、难以扩展。 Method: 将分类构建分解为三个角色:自下而上的Inducer从原始文本生成初始层级,自上而下的Expander利用LLM世界知识补充中间概念,证据引导的Enricher融合外部结构化资源以确保准确性。 Result: 基于《二十四史》构建了涵盖政治、军事、外交和社会生活的古代中国事件分类体系,在无参考和有参考评估中均表现出更优的结构连贯性和覆盖率,并支持跨文化对齐。 Conclusion: CHisAgent能有效自动构建高质量、领域感知的历史分类体系,增强大模型在中文历史语境下的理解与推理能力。 Abstract: Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non-English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose \textbf{CHisAgent}, a multi-agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role-specialized stages: a bottom-up \textit{Inducer} that derives an initial hierarchy from raw historical corpora, a top-down \textit{Expander} that introduces missing intermediate concepts using LLM world knowledge, and an evidence-guided \textit{Enricher} that integrates external structured historical resources to ensure faithfulness. Using the \textit{Twenty-Four Histories}, we construct a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference-free and reference-based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross-cultural alignment.

[14] Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen,Tianyu Liu,Junyi Shen,Jinyang Wu,Quan Kong,Li Huan,Cong Wang

Main category: cs.CL

TL;DR: 本文提出了一种名为Double的新型并行推测解码框架,通过结合检索机制和同步策略,解决了传统推测解码中的理论速度瓶颈与计算浪费问题,实现了无需训练、无损加速的显著性能提升。

Details Motivation: 传统推测解码(SD)及其并行版本(PSD)受限于理论速度上限和中间token拒绝导致的计算浪费与流水线停顿,难以进一步提升推理效率。 Method: 提出Double框架,引入同步机制:草案模型执行迭代检索推测以突破速度极限,目标模型进行权威检索生成多token引导,避免回滚,从而提升精度与效率。整个过程无需训练且保持生成结果无损。 Result: 在LLaMA3.3-70B上实现5.3倍加速,在Qwen3-32B上实现2.8倍加速,显著优于需大量训练的EAGLE-3方法。 Conclusion: Double通过双检索协同机制有效解决了推测解码中的精度与效率矛盾,在无需训练的前提下实现了当前最优的推理加速效果,具有广泛适用性。 Abstract: Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

[15] Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang,Heng Lu,Xueyao Zhang,Shujie Liu,Yan Lu,Jinyu Li,Zhizheng Wu

Main category: cs.CL

TL;DR: 本文提出了一种名为TARS的强化学习框架,通过不对称奖励设计来对齐文本和语音条件下的推理轨迹,有效缩小了语音大模型中的模态推理差距。

Details Motivation: 现有的语音大语言模型在语音输入上的推理性能明显弱于文本输入,存在显著的模态推理差距,主要源于Transformer层间的表示漂移和长链推理中的行为偏差。 Method: 提出TARS框架,采用强化学习方法,结合表示对齐(逐层隐藏状态相似性)和行为对齐(生成输出与参考文本的语义一致性)两种密集且互补的信号,通过不对称奖励机制实现文本与语音推理路径的对齐。 Result: 在MMSU和OBQA等具有挑战性的推理基准上,TARS显著缩小了模态推理差距,并在7B规模的语音大语言模型中实现了最先进的性能。 Conclusion: TARS通过联合优化表示和行为对齐,有效提升了语音大语言模型的推理能力,为缩小跨模态推理差距提供了新思路。 Abstract: Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

[16] Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Hongjin Kim,Jeonghyun Kang,Harksoo Kim

Main category: cs.CL

TL;DR: 本研究提出了一个名为Harmful Essay Detection (HED)的基准,用于评估自动作文评分系统和大语言模型在识别包含种族主义、性别偏见等有害内容作文中的表现,发现现有模型普遍忽视内容的伦理维度,易将有害观点误判为高分论证,亟需增强伦理敏感性。

Details Motivation: 当前自动作文评分系统和大语言模型在评分过程中常忽略作文中涉及的伦理与道德问题,可能对含有有害观点的作文给出高分,存在严重风险,因此需要系统评估并改进模型对有害内容的识别能力。 Method: 构建了一个包含敏感话题(如种族主义、性别偏见)的有害作文数据集HED,并利用该基准测试多种大语言模型在区分有害内容与正当论点方面的能力,分析其评分行为中的伦理缺失。 Result: 实验结果显示:(1) 当前大语言模型难以准确区分有害内容与具有争议性的合理论证;(2) 现有的自动作文评分模型和大语言模型在评分时普遍未考虑内容的伦理影响,导致有害作文被错误高估。 Conclusion: 自动作文评分系统必须引入对伦理内容的敏感机制,未来的研究应致力于开发更具道德判断力的评分模型,以确保教育评价的安全性与公正性。 Abstract: This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.

[17] Generation-Based and Emotion-Reflected Memory Update: Creating the KEEM Dataset for Better Long-Term Conversation

Jeonghyun Kang,Hongjin Kim,Harksoo Kim

Main category: cs.CL

TL;DR: 本文提出了KEEM数据集,用于生成式长期对话系统中的记忆更新,通过整合事实、情感和因果关系提升系统理解与共情能力。

Details Motivation: 现有记忆更新方法存在信息冲突和难以准确追踪用户状态的问题,需要更有效的动态记忆整合机制。 Method: 提出KEEM数据集,采用生成式方法动态构建包含事实信息、情感上下文和因果关系的综合记忆。 Result: KEEM能够有效保持关键信息并融合情感与因果结构,实现更连贯和富有同理心的对话响应。 Conclusion: KEEM为长期对话系统提供了更优的记忆更新方案,显著提升了系统在开放域对话中对用户状态的理解和响应质量。 Abstract: In this work, we introduce the Keep Emotional and Essential Memory (KEEM) dataset, a novel generation-based dataset designed to enhance memory updates in long-term conversational systems. Unlike existing approaches that rely on simple accumulation or operation-based methods, which often result in information conflicts and difficulties in accurately tracking a user's current state, KEEM dynamically generates integrative memories. This process not only preserves essential factual information but also incorporates emotional context and causal relationships, enabling a more nuanced understanding of user interactions. By seamlessly updating a system's memory with both emotional and essential data, our approach promotes deeper empathy and enhances the system's ability to respond meaningfully in open-domain conversations.

[18] ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging

Junyao Yang,Chen Qian,Dongrui Liu,Wen Shen,Yong Liu,Jing Shao

Main category: cs.CL

TL;DR: 本文提出了一种名为ReasonAny的新模型融合框架,通过对比梯度识别解决领域专用大模型在融合过程中推理能力与领域性能崩溃的问题,实现了有效的“推理+X”能力合成。

Details Motivation: 现有的模型融合方法在将长链思维推理能力引入领域专用模型时,往往导致推理深度减弱和领域性能下降。作者旨在解决这一推理与领域能力之间的性能崩溃问题。 Method: 提出ReasonAny框架,基于发现的反直觉现象——推理能力主要存在于梯度敏感性较低的参数区域,采用对比梯度识别技术来区分并保留推理相关和领域相关的参数,从而实现非破坏性的模型融合。 Result: 在安全、生物医学和金融等多个领域进行实验,结果显示ReasonAny显著优于现有最先进基线方法,在保持强大推理能力的同时有效提升了领域性能。 Conclusion: ReasonAny为构建兼具深度推理能力和专业领域知识的模型提供了一种高效、无需训练的解决方案,推动了“推理+X”模型的发展。 Abstract: Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as "Reasoning + X", remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes "Reasoning + X" capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.

[19] Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice

Sung-Yoo Lim,Koki Sato,Kiyoshi Takami,Giancarlos Parady,Eui-Jin Kim

Main category: cs.CL

TL;DR: 该研究探讨了大型语言模型(LLM)在从日本群体聊天数据中自动提取联合外出就餐决策因素方面的潜力,提出了一种受知识获取过程启发的提示框架,将非结构化对话转化为结构化决策数据,并通过与人工标注对比发现LLM能可靠捕捉显性因素,但在识别隐性社会文化因素方面仍有局限。

Details Motivation: 传统出行调查难以观察群体社交活动中的复杂决策过程,而新兴的非结构化聊天数据为理解这些过程提供了新机会,但其分析依赖耗时的人工标注来捕捉受社会和文化规范影响的显性和隐性因素,因此需要探索自动化方法以提高效率并扩展应用。 Method: 研究设计了一种分步提示框架,引导LLM依次提取群体层面的餐厅选择集与结果、个体对各选项的偏好及其驱动属性,从而将群聊内容转化为结构化表格数据,并通过与人工标注的基准数据集进行定量比较和定性误差分析来评估LLM输出的准确性。 Result: 结果显示LLM能够可靠地识别显性决策因素(如明确提及的餐厅名称和决定),但在识别隐性因素(如未明说的社会规范、权力关系或文化惯例)方面表现较差,错误分析揭示了特定情境下模型的局限性。 Conclusion: LLM在处理非传统社交数据方面具有潜力,可作为辅助工具加速数据标注,但在涉及复杂社会语境时仍需人类监督,未来应结合人机协作以实现更准确的决策过程解析。 Abstract: Social activities result from complex joint activity-travel decisions between group members. While observing the decision-making process of these activities is difficult via traditional travel surveys, the advent of new types of data, such as unstructured chat data, can help shed some light on these complex processes. However, interpreting these decision-making processes requires inferring both explicit and implicit factors. This typically involves the labor-intensive task of manually annotating dialogues to capture context-dependent meanings shaped by the social and cultural norms. This study evaluates the potential of Large Language Models (LLMs) to automate and complement human annotation in interpreting decision-making processes from group chats, using data on joint eating-out activities in Japan as a case study. We designed a prompting framework inspired by the knowledge acquisition process, which sequentially extracts key decision-making factors, including the group-level restaurant choice set and outcome, individual preferences of each alternative, and the specific attributes driving those preferences. This structured process guides the LLM to interpret group chat data, converting unstructured dialogues into structured tabular data describing decision-making factors. To evaluate LLM-driven outputs, we conduct a quantitative analysis using a human-annotated ground truth dataset and a qualitative error analysis to examine model limitations. Results show that while the LLM reliably captures explicit decision-making factors, it struggles to identify nuanced implicit factors that human annotators readily identified. We pinpoint specific contexts when LLM-based extraction can be trusted versus when human oversight remains essential. These findings highlight both the potential and limitations of LLM-based analysis for incorporating non-traditional data sources on social activities.

[20] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue

Jiawei Shen,Jia Zhu,Hanghui Guo,Weijie Shi,Yue Cui,Qingyu Niu,Guoqing Ma,Yidan Liang,Jingjiang Liu,Yiling Wang,Shimin Di,Jiajie Xu

Main category: cs.CL

TL;DR: 提出了一种名为ACR的自适应上下文重构框架,用于解决大语言模型在多轮对话中出现的上下文惯性和状态漂移问题。

Details Motivation: 大语言模型在多轮对话中难以保持与先前内容的一致性,容易产生上下文惯性和状态漂移。 Method: 设计了一个包含上下文重构操作符库和教师引导的自进化训练范式,动态监控并重塑交互历史。 Result: 在多轮对话任务上显著优于现有基线方法,并减少了token消耗。 Conclusion: ACR框架有效缓解了上下文惯性和状态漂移,提升了长对话中的模型表现。 Abstract: Large Language Models (LLMs) have shown remarkable performance in multi-turn dialogue. However, in multi-turn dialogue, models still struggle to stay aligned with what has been established earlier, follow dependencies across many turns, and avoid drifting into incorrect facts as the interaction grows longer. Existing approaches primarily focus on extending the context window, introducing external memory, or applying context compression, yet these methods still face limitations such as \textbf{contextual inertia} and \textbf{state drift}. To address these challenges, we propose the \textbf{A}daptive \textbf{C}ontext \textbf{R}efactoring \textbf{(ACR)} Framework, which dynamically monitors and reshapes the interaction history to mitigate contextual inertia and state drift actively. ACR is built on a library of context refactoring operators and a teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, thereby decoupling context management from the reasoning process. Extensive experiments on multi-turn dialogue demonstrate that our method significantly outperforms existing baselines while reducing token consumption.

Nguyen Minh Phuong,Ha-Thanh Nguyen,May Myo Zin,Ken Satoh

Main category: cs.CL

TL;DR: 提出了一种利用大语言模型(LLMs)在法律领域信息抽取任务中进行数据增强的简单有效的方法,减少了人工标注工作量,提高了系统鲁棒性,并具有良好的泛化能力。

Details Motivation: 减少法律领域信息抽取任务中对大量人工标注数据的依赖,降低标注成本。 Method: 利用大语言模型(LLMs)构建数据增强 pipeline,自动生成高质量的训练数据。 Result: 显著减少了人工标注的工作量,同时提升了信息抽取系统的性能和鲁棒性。 Conclusion: 该方法不仅在法律领域有效,还具有良好的通用性,可推广至其他自然语言处理任务。 Abstract: In this paper, we propose a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain. The proposed method is both simple and effective, significantly reducing the manual effort required for data annotation while enhancing the robustness of Information Extraction systems. Furthermore, the method is generalizable, making it applicable to various Natural Language Processing (NLP) tasks beyond the legal domain.

[22] Text Detoxification in isiXhosa and Yorùbá: A Cross-Lingual Machine Learning Approach for Low-Resource African Languages

Abayomi O. Agbeyangi

Main category: cs.CL

TL;DR: 本研究提出了一种轻量级、可解释的混合方法,用于非洲低资源语言isiXhosa和Yorùbá的文本去毒化,结合TF-IDF与逻辑回归进行毒性检测,并采用基于词典和标记引导的重写策略,实现了高效的文本风格迁移。

Details Motivation: 针对非洲语言缺乏有效的自动毒性语言缓解工具这一问题,尤其是低资源语言在文本去毒化方面的研究空白。 Method: 采用TF-IDF与逻辑回归模型进行毒性检测,并设计受控的词汇与标记引导的重写机制;构建包含惯用语、变音符号和语码转换的平行语料库用于训练与评估。 Result: 毒性检测在isiXhosa上取得61-72%的分层K折准确率,在Yorùbá上为72-86%,ROC-AUC最高达0.88;重写模块成功去毒所有检测出的毒性句子,同时保留100%非毒性句子。 Conclusion: 基于可解释机器学习与规则编辑相结合的方法,为低资源语言提供了高效、可扩展且文化适应性强的文本去毒化方案,树立了非洲语言文本风格迁移的新基准。 Abstract: Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages. This study addresses this critical gap by investigating automatic text detoxification (toxic to neutral rewriting) for two low-resource African languages, isiXhosa and Yorùbá. The work contributes a novel, pragmatic hybrid methodology: a lightweight, interpretable TF-IDF and Logistic Regression model for transparent toxicity detection, and a controlled lexicon- and token-guided rewriting component. A parallel corpus of toxic to neutral rewrites, which captures idiomatic usage, diacritics, and code switching, was developed to train and evaluate the model. The detection component achieved stratified K-fold accuracies of 61-72% (isiXhosa) and 72-86% (Yorùbá), with per-language ROC-AUCs up to 0.88. The rewriting component successfully detoxified all detected toxic sentences while preserving 100% of non-toxic sentences. These results demonstrate that scalable, interpretable machine learning detectors combined with rule-based edits offer a competitive and resource-efficient solution for culturally adaptive safety tooling, setting a new benchmark for low-resource Text Style Transfer (TST) in African languages.

[23] GIFT: Games as Informal Training for Generalizable LLMs

Nuoyan Lyu,Bingbing Xu,Weihao Meng,Yige Yuan,Yang Zhang,Zhiyong Huang,Tat-Seng Chua,Huawei Shen

Main category: cs.CL

TL;DR: 本文提出将游戏作为大语言模型(LLM)非正式学习的主要环境,通过嵌套训练框架解决多任务学习中的性能退化问题,利用游戏的内在奖励信号提升模型的泛化能力和战略、社交等通用智能。

Details Motivation: LLM在形式化任务上表现优异,但在体现人类认知的战略创造力和社会推理等‘实践智慧’方面仍不足,主要因缺乏依赖交互反馈的非正式学习。 Method: 提出‘游戏即非正式学习环境’的理念,并设计嵌套训练框架(Nested Training Framework),采用序列化任务组合实现显式的‘AND’目标,结合基于GRPO的强化学习,在矩阵博弈、井字棋和‘谁是卧底’等游戏中进行训练。 Result: 实验表明,该框架能有效避免多任务干扰,在跨任务的能力导向基准测试中显著提升模型的泛化性能。 Conclusion: 游戏可作为培养LLM通用智能的有效非正式学习环境,嵌套训练框架为多能力协同学习提供了可行路径,有助于缩小LLM与人类实践智慧之间的差距。 Abstract: While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the "practical wisdom" and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit "OR" objective, our framework employs sequential task composition to enforce an explicit "AND" objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who's the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model's generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.

[24] Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

Alireza Dehghanpour Farashah,Aditi Khandelwal,Marylou Fauchard,Zhuan Shi,Negar Rostamzadeh,Golnoosh Farnadi

Main category: cs.CL

TL;DR: 本研究探讨了多语言大模型中的遗忘学习问题,使用Aya-Expanse 8B模型在十种语言上扩展了数据与概念遗忘的基准测试,发现高资源语言的遗忘更稳定,且句法相似性是跨语言遗忘行为的主要预测因素。

Details Motivation: 多语言大模型在不同语言环境中面临安全与公平性挑战,现有遗忘学习研究主要集中于英语,缺乏对多语言复杂性的考虑,特别是跨语言知识迁移和偏见传播问题。 Method: 在Aya-Expanse 8B模型上进行实验,设置数据遗忘和概念遗忘两种场景,将事实知识和刻板印象的基准通过翻译扩展到十种代表不同语系和资源水平的语言,并分析语言间距离(如句法、语义)对遗忘效果的影响。 Result: 实验表明高资源语言中的遗忘学习更为稳定;存在不对称的跨语言迁移效应;句法相似性是预测跨语言遗忘行为最强的因素。 Conclusion: 多语言遗忘需考虑语言间的结构相似性,尤其是句法层面的接近度,未来的研究应针对低资源语言设计更鲁棒的遗忘策略以提升模型的安全性和公平性。 Abstract: As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.

[25] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling

Sejun Park,Yoonah Park,Jongwon Lim,Yohan Jo

Main category: cs.CL

TL;DR: 提出一种上下文感知的用户画像框架,通过生成最优查询和总结用户历史记录来提升说服力预测模型的性能。

Details Motivation: 现有方法缺乏系统性地利用被说服者的历史活动来优化说服力预测,本文旨在填补这一空白。 Method: 设计了一个包含查询生成器和画像模块的可训练框架,前者用于从用户历史中检索与说服相关的记录,后者将这些记录汇总成个性化画像以支持预测模型。 Result: 在ChangeMyView数据集上的实验表明,该方法在多个预测模型上均优于现有方法,F1分数最高提升13.77个百分点;分析还显示有效的用户画像是上下文和预测器依赖的。 Conclusion: 任务导向、上下文相关的用户画像对个性化说服力预测至关重要。 Abstract: Estimating the persuasiveness of messages is critical in various applications, from recommender systems to safety assessment of LLMs. While it is imperative to consider the target persuadee's characteristics, such as their values, experiences, and reasoning styles, there is currently no established systematic framework to optimize leveraging a persuadee's past activities (e.g., conversations) to the benefit of a persuasiveness prediction model. To address this problem, we propose a context-aware user profiling framework with two trainable components: a query generator that generates optimal queries to retrieve persuasion-relevant records from a user's history, and a profiler that summarizes these records into a profile to effectively inform the persuasiveness prediction model. Our evaluation on the ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains of up to +13.77%p in F1 score. Further analysis shows that effective user profiles are context-dependent and predictor-specific, rather than relying on static attributes or surface-level similarity. Together, these results highlight the importance of task-oriented, context-dependent user profiling for personalized persuasiveness prediction.

[26] Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

Hao Yang,Hongyuan Lu,Dingkang Yang,Wenliang Yang,Peng Sun,Xiaochuan Zhang,Jun Xiao,Kefan He,Wai Lam,Yang Liu,Xinhua Zeng

Main category: cs.CL

TL;DR: Stephanie2是一种新型的逐步决策对话代理,通过主动等待和消息节奏自适应机制,显著提升了即时通讯中AI聊天的自然性和参与度。

Details Motivation: 现有AI聊天系统缺乏主动等待机制,消息发送节奏不自然,难以模拟人类社交聊天的真实体验。 Method: 提出Stephanie2,引入主动等待决策机制,将延迟建模为思考时间和打字时间之和,并采用基于时间窗口的双代理系统生成伪对话历史用于评估。 Result: 实验表明,Stephanie2在自然性、参与度等指标上明显优于Stephanie1,并在图灵测试的角色识别任务中取得更高通过率。 Conclusion: Stephanie2通过精细化的步态控制和节奏建模,实现了更接近人类的对话行为,为未来AI对话系统设计提供了新方向。 Abstract: Instant-messaging human social chat typically progresses through a sequence of short messages. Existing step-by-step AI chatting systems typically split a one-shot generation into multiple messages and send them sequentially, but they lack an active waiting mechanism and exhibit unnatural message pacing. In order to address these issues, we propose Stephanie2, a novel next-generation step-wise decision-making dialogue agent. With active waiting and message-pace adaptation, Stephanie2 explicitly decides at each step whether to send or wait, and models latency as the sum of thinking time and typing time to achieve more natural pacing. We further introduce a time-window-based dual-agent dialogue system to generate pseudo dialogue histories for human and automatic evaluations. Experiments show that Stephanie2 clearly outperforms Stephanie1 on metrics such as naturalness and engagement, and achieves a higher pass rate on human evaluation with the role identification Turing test.

[27] Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Atnafu Lambebo Tonja,Srija Anand,Emilio Villa-Cueva,Israel Abebe Azime,Jesujoba Oluwadara Alabi,Muhidin A. Mohamed,Debela Desalegn Yadeta,Negasi Haile Abadi,Abigail Oppong,Nnaemeka Casmir Obiefuna,Idris Abdulmumin,Naome A Etori,Eric Peter Wairagala,Kanda Patrick Tshinu,Imanigirimbabazi Emmanuel,Gabofetswe Malema,Alham Fikri Aji,David Ifeoluwa Adelani,Thamar Solorio

Main category: cs.CL

TL;DR: 本文介绍了Afri-MCQA,首个覆盖15种非洲语言、7.5k问答对的多语言文化问答基准,旨在推动非洲语言在AI研究中的代表性。

Details Motivation: 非洲语言在AI研究中严重被忽视,缺乏能够评估多语言文化和跨模态理解能力的基准测试。 Method: 构建了一个包含文本和语音模态的平行英-非语言问答数据集,由母语者创建,并在大语言模型上进行基准测试,同时设计控制实验以分离语言能力与文化知识的影响。 Result: 实验显示开源大模型在非洲语言和语音任务上表现极差,尤其在开放性视觉问答中接近零准确率;母语与英语之间存在显著性能差距。 Conclusion: 结果表明需要发展以语音为先的方法、文化嵌入的预训练策略以及跨语言文化迁移技术,以促进更包容的多模态AI发展。 Abstract: Africa is home to over one-third of the world's languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)

[28] Multimodal In-context Learning for ASR of Low-resource Languages

Zhaolin Li,Jan Niehues

Main category: cs.CL

TL;DR: 该论文研究了多模态上下文学习(MICL)在语音大模型中对未见过的低资源语言的有效性,并提出结合更强声学模型与语音大模型的ASR系统,通过MICL提升识别性能。

Details Motivation: 由于标注数据稀缺,现有自动语音识别系统覆盖的语言有限,而现有的上下文学习方法主要关注高资源语言和纯文本场景,缺乏对低资源语言及多模态设置的研究。 Method: 使用Phi-4和Qwen3-Omni两个语音大模型,在三种濒危语言上进行实验,探索多模态上下文学习(MICL)的效果,并分析跨语言迁移对MICL效率的影响;同时通过注意力机制分析MICL的工作机制,提出一种基于MICL选择声学假设的ASR系统。 Result: 实验表明MICL能有效支持未见语言的多模态学习,且跨语言迁移可显著提升效果;注意力分析显示模型各层对音频和文本上下文有不同偏好,整体偏向文本;基于MICL的提示式ASR在未见语言上表现差,但结合强声学模型后显著提升ASR性能。 Conclusion: MICL能有效促进低资源语言的语音识别,跨语言迁移可替代目标语言训练数据,为低资源语言ASR提供了可行的新路径。 Abstract: Automatic speech recognition (ASR) still covers only a small fraction of the world's languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.

[29] Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging

Thomas Fabian

Main category: cs.CL

TL;DR: 提出一种基于扩散张量成像(DTI)分析和可视化自然语言表达中信息流的新方法,揭示大语言模型中上下文相关的表示机制。

Details Motivation: 现有方法仅关注孤立的词嵌入,忽略词语在上下文中的使用,难以全面理解大语言模型如何表示自然语言表达。 Method: 将扩散张量成像(DTI)应用于词嵌入,追踪大语言模型各层中的信息流动,并可视化整个自然语言表达中的信息路径。 Result: DTI能够揭示词嵌入之间的信息流动模式,区分不同任务(如代词消解和隐喻识别)下的信息流差异,并可用于比较模型结构及发现可剪枝的低利用率层。 Conclusion: 该方法扩展了对孤立词嵌入的分析,提升了大语言模型在实际自然语言表达上的可解释性。 Abstract: Understanding how large language models (LLMs) represent natural language is a central challenge in natural language processing (NLP) research. Many existing methods extract word embeddings from an LLM, visualise the embedding space via point-plots, and compare the relative positions of certain words. However, this approach only considers single words and not whole natural language expressions, thus disregards the context in which a word is used. Here we present a novel tool for analysing and visualising information flow in natural language expressions by applying diffusion tensor imaging (DTI) to word embeddings. We find that DTI reveals how information flows between word embeddings. Tracking information flows within the layers of an LLM allows for comparing different model structures and revealing opportunities for pruning an LLM's under-utilised layers. Furthermore, our model reveals differences in information flows for tasks like pronoun resolution and metaphor detection. Our results show that our model permits novel insights into how LLMs represent actual natural language expressions, extending the comparison of isolated word embeddings and improving the interpretability of NLP models.

[30] Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

Amalie Brogaard Pauli,Maria Barrett,Max Müller-Eberstein,Isabelle Augenstein,Ira Assent

Main category: cs.CL

TL;DR: 该研究提出一个评估框架,分析大语言模型在生成说服性语言时受接收者性别、发送者意图和输出语言的影响,发现不同模型在生成中普遍存在与性别刻板印象一致的语言偏见。

Details Motivation: 理解用户指令如何影响大语言模型生成说服性语言,特别是针对不同群体(如性别)是否存在偏见。 Method: 通过成对提示指令评估13个大语言模型和16种语言,使用基于社会心理学和传播学的LLM-as-judge方法分析19类说服性语言。 Result: 所有模型在生成说服性语言时均表现出显著的性别差异,且这些差异与社会心理学和语言学中已知的性别刻板印象一致。 Conclusion: 大语言模型在人际沟通任务中会系统性地生成带有性别偏见的说服性语言,需引起关注并加以纠正。 Abstract: Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

[31] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang,Jingyu Hu,Tong Li,Hanqi Yan,Wenxuan Wang,Di Wang

Main category: cs.CL

TL;DR: AutoMonitor-Bench是首个系统评估基于大语言模型的不当行为监控器在多样化任务和失效模式下可靠性的基准,揭示了漏报率与误报率之间的权衡,并通过大规模训练探索监控性能的提升空间。

Details Motivation: 现有的LLM-based misbehavior monitors缺乏系统性评估,难以衡量其在不同任务和隐式不当行为下的可靠性,因此需要一个标准化的基准来推动该领域的发展。 Method: 构建包含3,010个标注样本的AutoMonitor-Bench,覆盖问答、代码生成和推理任务,配对不当与良性行为;采用漏报率(MR)和误报率(FAR)作为评估指标;使用153,581规模数据集微调Qwen3-4B-Instruction以探索训练策略的影响。 Result: 12个闭源和10个开源LLM表现出显著的监控性能差异,且普遍存在MR与FAR之间的权衡;在已知简单不当行为上训练的模型对未见或更隐式的不当行为检测效果有限。 Conclusion: 可靠的、可扩展的不当行为监控仍面临挑战,未来需发展任务感知的监控设计与训练策略。 Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

[32] One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing,Lennart Keller,Goran Glavaš

Main category: cs.CL

TL;DR: 本文研究了在多语言模型预训练中使用罗马化脚本的影响,发现在具有分段脚本的语言中性能损失可忽略不计,而在具有形态音节脚本的语言(如中文和日文)中则存在性能下降,高保真度的罗马化可以缓解但不能完全恢复。同时,罗马化提高了分段脚本的编码效率且成本低廉。

Details Motivation: 探讨罗马化是否是预训练通用多语言模型的良好表示选择,特别是信息丢失是否会影响高资源语言的性能。 Method: 从头开始在六种类型多样的高资源语言的罗马化和原始文本上预训练编码器语言模型,并使用两种不同保真度轮廓的罗马化工具来调查潜在的性能退化来源。 Result: 对于具有分段脚本的语言,观察到的性能损失可以忽略不计;而对于具有形态音节脚本的语言(如中文和日文),存在性能下降,高保真度的罗马化可以缓解但不能完全恢复。没有证据表明增加的子词重叠会引发负面干扰。此外,罗马化提高了分段脚本的编码效率。 Conclusion: 罗马化在提高分段脚本语言的编码效率方面有效且成本低,但对于形态音节脚本语言仍存在挑战,需要进一步研究以减少性能损失。 Abstract: Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.

[33] Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs

Eilam Cohen,Itamar Bul,Danielle Inbar,Omri Loewenbach

Main category: cs.CL

TL;DR: 该论文比较了微调和提示工程在使用编码器-解码器大语言模型进行文本简化任务中的表现,发现微调模型在结构简化上更优,而提示方法在语义相似性上较高但易复制输入,人类评估整体更偏好微调结果。

Details Motivation: 探索大语言模型在文本简化任务中微调与提示工程两种方法的权衡,填补两者在不同评估维度下的系统性比较空白。 Method: 在多个基准数据集上对编码器-解码器大语言模型采用微调和提示工程两种方法进行文本简化,并使用多种自动与人工评估指标进行对比分析。 Result: 微调模型在结构简化方面表现更好,提示方法在语义相似性得分更高但倾向于复制输入内容,人类评估更倾向于微调生成的结果。 Conclusion: 在文本简化任务中,尽管提示工程能保持更高语义相似性,微调在整体质量和结构简化上更具优势,研究结果支持在资源允许情况下优先选择微调。 Abstract: Large language models (LLMs) enable strong text generation, and in general there is a practical tradeoff between fine-tuning and prompt engineering. We introduce Simplify-This, a comparative study evaluating both paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks, using a range of evaluation metrics. Fine-tuned models consistently deliver stronger structural simplification, whereas prompting often attains higher semantic similarity scores yet tends to copy inputs. A human evaluation favors fine-tuned outputs overall. We release code, a cleaned derivative dataset used in our study, checkpoints of fine-tuned models, and prompt templates to facilitate reproducibility and future work.

[34] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Xiaoshuai Song,Haofei Chang,Guanting Dong,Yutao Zhu,Zhicheng Dou,Ji-Rong Wen

Main category: cs.CL

TL;DR: 本文提出了EnvScaler,一个通过程序化合成自动生成可扩展的工具交互环境的框架,用于提升大语言模型在复杂环境中的任务解决能力。

Details Motivation: 现有的大语言模型在真实环境中作为智能体训练时面临缺乏丰富多样的工具交互沙箱的问题,而现有解决方案如模拟环境易产生幻觉、手动构建难以扩展。 Method: EnvScaler包含两个组件:SkelBuilder通过主题挖掘、逻辑建模和质量评估构建多样化的环境骨架;ScenGenerator为每个环境生成多个任务场景和基于规则的轨迹验证函数。 Result: 使用EnvScaler合成了191个环境和约7000个场景,并应用于Qwen3系列模型的监督微调和强化学习,结果表明其显著提升了模型在多轮、多工具交互复杂环境中的表现。 Conclusion: EnvScaler能有效自动化构建高质量、可扩展的工具交互环境,显著增强大语言模型在复杂现实环境中的代理能力。 Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs' ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.

[35] LLMs as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public

Milad Alshomary,Grace Li,Anubhav Jangra,Yufang Hou,Kathleen McKeown,Smaranda Muresan

Main category: cs.CL

TL;DR: 提出一种训练LLM成为科学记者的框架,帮助早期研究人员更好地向公众传播研究成果。

Details Motivation: 现有大模型不适合帮助研究人员有效向公众传播科学发现,缺乏对社会影响等问题的关注。 Method: 设计一个训练框架,使LLM能够模拟科学记者的角色,并通过与模拟及真实研究人员对话进行评估。 Result: 实验表明,该框架训练出的LLM记者能提出更相关的问题,引导研究人员阐明研究的社会意义;用户研究中多数参与者更偏好此模型。 Conclusion: 该框架能有效提升LLM在科研公众传播中的辅助能力,有助于早期研究人员提升沟通技巧。 Abstract: The scientific community needs tools that help early-stage researchers effectively communicate their findings and innovations to the public. Although existing general-purpose Large Language Models (LLMs) can assist in this endeavor, they are not optimally aligned for it. To address this, we propose a framework for training LLMs to emulate the role of a science journalist that can be used by early-stage researchers to learn how to properly communicate their papers to the general public. We evaluate the usefulness of our trained LLM Journalists in leading conversations with both simulated and human researchers. %compared to the general-purpose ones. Our experiments indicate that LLMs trained using our framework ask more relevant questions that address the societal impact of research, prompting researchers to clarify and elaborate on their findings. In the user study, the majority of participants who interacted with our trained LLM Journalist appreciated it more than interacting with general-purpose LLMs.

[36] Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE

Liu Zai

Main category: cs.CL

TL;DR: Peek2是一种新的预分词算法,作为GPT-3、LLaMa-3和Qwen-2.5中使用的cl100k类预分词器的直接替代品,具有更高的性能和安全性。

Details Motivation: 传统的基于正则表达式的预分词方法在性能和安全性方面存在瓶颈,需要一种更高效且安全的解决方案。 Method: 提出了一种名为Peek2的新算法,该算法无需使用正则表达式,并在CPU上运行,具备稳定的线性复杂度O(n)。 Result: Peek2在整个字节级BPE编码过程中实现了1.11倍的整体吞吐量提升,并且提供的预分段结果与原始基于正则表达式的预分词器完全相同。 Conclusion: Peek2是一种高效、安全的预分词算法,能够显著提高处理速度,同时保持与现有方法一致的结果,适合作为现有模型预分词步骤的直接替换方案。 Abstract: Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a $ 1.11\times $ improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity $ O(n) $, and provides presegmentation results identical to those of the original Regex-based pretokenizer.

[37] Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

Molly Kennedy,Ali Parker,Yihong Liu,Hinrich Schütze

Main category: cs.CL

TL;DR: 研究发现,九种最先进的大语言模型在生成文本时普遍存在向中心意识形态倾斜的现象,即‘中心坍缩’,其中Grok 4在意识形态表达上最为明显,而Claude Sonnet 4.5和Llama 3.1分别在商业和开源模型中表现出最强的偏见评分性能。

Details Motivation: 随着基于大语言模型的摘要和文本生成技术在新闻报道中的应用增加,人们担忧其可能通过微妙的措辞选择影响公众对政治议题的理解,因此本研究旨在探讨这些模型是否存在政治框架偏差。 Method: 研究者们首先比较了少样本意识形态预测与LEFT/CENTER/RIGHT标签的一致性,然后在FAITHFUL、CENTRIST、LEFT和RIGHT提示下生成“引导”摘要,并使用单一固定的意识形态评估器对所有输出进行评分。 Result: 研究发现在文章级别的评分和生成的文本中都存在普遍的意识形态中心坍缩现象,表明模型有系统地倾向于采用中间立场的框架。不同模型的表现各异,Grok 4是最具意识形态表达力的生成器,而Claude Sonnet 4.5和Llama 3.1分别在商业和开放权重模型中实现了最佳的偏见评分表现。 Conclusion: 尽管大语言模型被广泛应用于文本生成,但它们在处理政治内容时显示出明显的中心化倾向,这可能限制了多元观点的表达,并提示开发者需要更加注意算法设计中的公平性和多样性问题。 Abstract: Large Language Model (LLM) based summarization and text generation are increasingly used for producing and rewriting text, raising concerns about political framing in journalism where subtle wording choices can shape interpretation. Across nine state-of-the-art LLMs, we study political framing by testing whether LLMs' classification-based bias signals align with framing behavior in their generated summaries. We first compare few-shot ideology predictions against LEFT/CENTER/RIGHT labels. We then generate "steered" summaries under FAITHFUL, CENTRIST, LEFT, and RIGHT prompts, and score all outputs using a single fixed ideology evaluator. We find pervasive ideological center-collapse in both article-level ratings and generated text, indicating a systematic tendency toward centrist framing. Among evaluated models, Grok 4 is by far the most ideologically expressive generator, while Claude Sonnet 4.5 and Llama 3.1 achieve the strongest bias-rating performance among commercial and open-weight models, respectively.

[38] Semantic NLP Pipelines for Interoperable Patient Digital Twins from Unstructured EHRs

Rafael Brens,Yuqiao Meng,Luoxi Tang,Zhaohan Xi

Main category: cs.CL

TL;DR: 提出一种基于语义NLP的管道,将非结构化电子健康记录转化为符合FHIR标准的患者数字孪生表示,提升互操作性与结构完整性。

Details Motivation: 由于临床文档的多样性和缺乏标准化映射,从非结构化电子健康记录生成可互操作的患者数字孪生面临挑战。 Method: 采用命名实体识别提取临床概念,通过概念归一化映射到SNOMED-CT或ICD-10,并利用关系抽取构建条件、药物和观察结果之间的结构化关联,最终生成FHIR兼容的数字孪生表示。 Result: 在MIMIC-IV数据集上评估,与基线方法相比,实体和关系抽取均取得高F1分数,并显著提升模式完整性和互操作性。 Conclusion: 该语义NLP驱动的管道能有效将自由文本EHR转化为标准化、可互操作的数字孪生表示,有助于推动个性化医疗与临床决策支持的发展。 Abstract: Digital twins -- virtual replicas of physical entities -- are gaining traction in healthcare for personalized monitoring, predictive modeling, and clinical decision support. However, generating interoperable patient digital twins from unstructured electronic health records (EHRs) remains challenging due to variability in clinical documentation and lack of standardized mappings. This paper presents a semantic NLP-driven pipeline that transforms free-text EHR notes into FHIR-compliant digital twin representations. The pipeline leverages named entity recognition (NER) to extract clinical concepts, concept normalization to map entities to SNOMED-CT or ICD-10, and relation extraction to capture structured associations between conditions, medications, and observations. Evaluation on MIMIC-IV Clinical Database Demo with validation against MIMIC-IV-on-FHIR reference mappings demonstrates high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.

[39] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra,Devichand Budagam,Anubhab Mandal,Bishal Santra,Pawan Goyal,Manish Gupta

Main category: cs.CL

TL;DR: 本文提出了多模态自动补全(MAC)任务,利用视觉和文本信息提升实时对话中的字符预测效果,相较于传统纯文本方法更准确,并设计了Router-Suggest框架在效率与性能间取得平衡。

Details Motivation: 在具有共享视觉上下文的场景中(如数字助手、医疗咨询),仅依赖文本的自动补全难以准确捕捉用户意图,因此需要结合视觉线索的多模态方法来提升预测准确性与用户体验。 Method: 提出多模态自动补全(MAC)任务,构建基于MMDialog和ImageChat的基准数据集,评估现有视觉语言模型(VLMs)与纯文本模型的表现,并设计Router-Suggest框架动态选择模型,兼顾准确性和推理效率。 Result: 实验表明VLM在用户满意度、打字节省和回复质量上显著优于纯文本模型;Router-Suggest在保持性能的同时实现2.3x至10x的速度提升。 Conclusion: 引入多模态上下文对自动补全至关重要,能够实现更智能、更贴近用户需求的交互体验,未来助手系统应融合多模态理解能力。 Abstract: Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

[40] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Alexandra Dragomir,Florin Brad,Radu Tudor Ionescu

Main category: cs.CL

TL;DR: 本文提出了一种结合课程学习的新型优化策略CLewR,用于提升大语言模型在多语言机器翻译中的表现,通过多次重复由易到难的训练过程,有效缓解了简单样本的灾难性遗忘问题,并在多种模型和优化方法上实现了性能增益。

Details Motivation: 现有的偏好优化方法在多语言机器翻译中忽略了训练样本顺序的影响,导致模型可能遗忘简单样本,本文旨在探索课程学习在此场景下的作用。 Method: 将课程学习引入主流的偏好优化算法中,提出带有重启机制的课程学习策略(CLewR),在训练过程中多次执行从简单到困难的样本排序,以增强学习效果。 Result: 在Gemma2、Qwen2.5、Llama3.1等多个模型家族及不同偏好优化技术上验证了CLewR的有效性, consistently 提升了翻译性能。 Conclusion: 课程学习的引入显著提升了零样本多语言机器翻译的表现,CLewR通过重启机制有效缓解了灾难性遗忘,是一种通用且有效的训练策略。 Abstract: Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

[41] What do the metrics mean? A critical analysis of the use of Automated Evaluation Metrics in Interpreting

Jonathan Downie,Joss Moorkens

Main category: cs.CL

TL;DR: 本文探讨了当前自动翻译质量评估方法在真实口译场景中的适用性,指出这些方法因无法考虑交际语境而单独使用时不可行。

Details Motivation: 随着口译技术的发展,亟需快速高效的口译质量评估方法,但现有自动化指标是否适用于真实口译实践尚不明确。 Method: 分析近年来提出的各类自动化口译质量评估方法,并讨论其在人类和机器口译中的适用性。 Result: 现有自动评估指标未能纳入交际语境因素,因此单独使用时无法有效衡量口译质量。 Conclusion: 口译发生的语境是质量评估的核心,自动化指标必须结合语境因素才能成为有效的评估工具。 Abstract: With the growth of interpreting technologies, from remote interpreting and Computer-Aided Interpreting to automated speech translation and interpreting avatars, there is now a high demand for ways to quickly and efficiently measure the quality of any interpreting delivered. A range of approaches to fulfil the need for quick and efficient quality measurement have been proposed, each involving some measure of automation. This article examines these recently-proposed quality measurement methods and will discuss their suitability for measuring the quality of authentic interpreting practice, whether delivered by humans or machines, concluding that automatic metrics as currently proposed cannot take into account the communicative context and thus are not viable measures of the quality of any interpreting provision when used on their own. Across all attempts to measure or even categorise quality in Interpreting Studies, the contexts in which interpreting takes place have become fundamental to the final analysis.

[42] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Maxime Dassen,Rebecca Kotula,Kenton Murray,Andrew Yates,Dawn Lawrie,Efsun Kayi,James Mayfield,Kevin Duh

Main category: cs.CL

TL;DR: 本文提出了FACTUM框架,通过四个机制性指标分析检索增强生成(RAG)模型中引用幻觉的成因,发现正确引用的特征随模型规模变化而演变,挑战了现有认为幻觉仅源于参数知识过度依赖的观点。

Details Motivation: 现有研究通常将引用幻觉归因于模型对参数知识的过度依赖,但这一解释过于简化。本文旨在探究更深层次的内部机制,并揭示不同模型规模下引用正确性的动态特征。 Method: 提出FACTUM框架,包含四个可量化指标,用于衡量模型注意力路径与FFN路径的贡献及其对齐程度,分析在不同规模模型中信息流动和整合的机制差异。 Result: 发现正确的引用不仅依赖更强的参数知识贡献,还涉及注意力sink的信息融合;且该模式随模型规模变化:小模型偏好高路径对齐,大模型则表现出低对齐、更高正交性;FACTUM在AUC上超越现有方法达37.5%。 Conclusion: 引用幻觉是模型内部机制复杂交互的结果,且受模型规模影响,应从动态、细粒度的机制视角重新理解RAG系统的可靠性问题。 Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model's parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model's attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model's parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.

[43] Continual-learning for Modelling Low-Resource Languages from Large Language Models

Santosh Srinath K,Mudit Somani,Varun Reddy Padala,Prajna Devi Upadhyay,Abhijit Das

Main category: cs.CL

TL;DR: 提出一种基于词性标注的代码切换和回放适配器策略,以缓解多语言场景下从大模型训练小模型时的灾难性遗忘问题。

Details Motivation: 在多语言场景中,使用大语言模型构建低资源语言的小语言模型时,存在灾难性遗忘的问题。 Method: 采用基于词性标注(POS)的代码切换和回放适配器的持续学习策略。 Result: 在视觉问答和语言建模任务上的实验表明,所提方法能有效缓解灾难性遗忘并提升性能。 Conclusion: 该方法成功地减少了多语言小语言模型训练中的灾难性遗忘,具有实际应用潜力。 Abstract: Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.

[44] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil,Manikandarajan Venmathimaran,Muthu Subash Kavitha

Main category: cs.CL

TL;DR: 提出iReasoner,一种通过引入链式思维和内部一致性奖励来实现大视觉语言模型在无监督下自我提升的框架。

Details Motivation: 现有自进化框架主要奖励最终结果,忽略了中间推理过程的重要性,导致视觉决策中的推理能力较弱。 Method: 设计Proposer-Solver循环结构,在无标签图像上通过显式提取链式思维(CoT),结合结果级奖励与轨迹感知信号,对中间推理步骤进行优化。 Result: 在完全无监督后训练设置下,基于Qwen2.5-VL-7B的iReasoner在多个多模态推理基准上最高提升+2.1分。 Conclusion: iReasoner为纯无监督环境下实现推理感知的LMM自进化提供了可行路径。 Abstract: Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

[45] Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law

Jakub Harasta,Matej Vasina,Martin Kornel,Tomas Foltynek

Main category: cs.CL

TL;DR: 本研究探讨了大型语言模型(LLM)在家庭法情境中是否存在性别偏见,使用基于捷克法律设计的离婚案例,测试四个主流模型在不同变量下的共享育儿比例建议,发现部分模型存在性别相关的输出差异,提示公众依赖LLM获取法律帮助存在风险。

Details Motivation: 由于许多人难以获得司法救济,越来越多非专业人士依赖大型语言模型(LLM)进行法律自助,但这些模型可能存在偏见,影响用户判断,因此需要评估其在敏感法律场景中的公平性。 Method: 研究采用专家设计的捷克家庭法离婚案例,通过性别化姓名与中性标签两种版本,在零样本设置下测试GPT-5 nano、Claude Haiku 4.5、Gemini 2.5 Flash和Llama 3.3四个LLM;引入九个法律相关变量,分析其对模型建议的共同抚养比例的影响。 Result: 初步结果显示不同模型之间存在差异,某些模型在建议共享育儿比例时表现出性别依赖的模式,表明输出可能受性别因素影响。 Conclusion: 研究表明,当前主流LLM在处理家庭法问题时可能存在性别偏见,揭示了非专业用户依赖此类工具获取法律建议的风险,并强调需对模型在敏感法律领域的行为进行更系统、严格的评估。 Abstract: Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models' proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople's reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.

[46] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

Constantinos Karouzos,Xingwei Tan,Nikolaos Aletras

Main category: cs.CL

TL;DR: 本文研究了在领域迁移下对齐优化的泛化能力,比较了五种主流对齐目标和多种适应策略,发现基于伪标签的方法能显著缓解领域迁移带来的性能下降。

Details Motivation: 现有偏好调优方法在训练域外表现会下降,但不同适应策略对此问题的缓解效果尚不明确,因此需要系统研究对齐模型在领域迁移下的泛化能力。 Method: 对比五种主流对齐目标,并评估包括目标域监督微调和伪标签等适应策略,在摘要生成和问答帮助性任务上进行跨域分析。 Result: 不同对齐目标在领域迁移下表现出系统性差异,基于伪标签的适应策略能显著减少因领域偏移导致的性能退化。 Conclusion: 伪标签等适应策略可有效提升偏好调优模型在新领域的泛化能力,为构建更具鲁棒性的对齐模型提供了实践指导。 Abstract: Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation

Zihang Tian,Rui Li,Jingsen Zhang,Xiaohe Bo,Wei Huo,Xu Chen

Main category: cs.CL

TL;DR: 本文提出了一种名为HAPS的分层大语言模型路由框架,联合搜索模型架构和参数以提升任务性能。

Details Motivation: 现有LLM路由方法通常只关注选择模型架构,而忽略了对性能至关重要的参数设置。 Method: HAPS采用高层路由器选择候选LLM架构,并通过低层路由器为选定架构搜索最优参数;设计了参数生成网络以在两个路由器之间共享参数,并采用奖励增强目标进行训练。 Result: 在两个常用基准上的实验表明,HAPS consistently 优于强基线路由方法。 Conclusion: HAPS通过联合优化架构与参数选择,有效提升了LLM路由的性能。 Abstract: Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high-level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low-level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward-augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at https://github.com/zihangtian/HAPS.

[48] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Haoming Xu,Ningyuan Zhao,Yunzhi Yao,Weihong Xu,Hongru Wang,Xinle Deng,Shumin Deng,Jeff Z. Pan,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 本文提出了Neighbor-Consistency Belief(NCB)作为评估大语言模型信念鲁棒性的新指标,并引入结构感知训练(SAT)来提升模型在上下文干扰下的知识稳定性。

Details Motivation: 现有评估方法依赖于点对点的置信度(如Self-Consistency),无法反映模型在上下文扰动下信念的脆弱性,难以保证真实场景中可靠部署的需求。 Method: 提出Neighbor-Consistency Belief(NCB)这一结构化指标,衡量模型在概念邻域中的响应一致性;设计认知压力测试协议,探测上下文干扰下的输出稳定性;并提出Structure-Aware Training(SAT)以优化上下文不变的信念结构。 Result: 实验表明高NCB数据的表现更抗干扰,SAT将长尾知识的脆弱性降低了约30%。 Conclusion: 通过结构化的信念评估与训练方法,可显著提升大语言模型在现实部署中的信念鲁棒性和可靠性。 Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.

[49] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le,Valentin Pelloin,Arnault Chatelain,Maryem Bouziane,Mohammed Ghennai,Qianwen Guan,Kirill Milintsevich,Salima Mdhaffar,Aidan Mannion,Nils Defauw,Shuyue Gu,Alexandre Audibert,Marco Dinarelli,Yannick Estève,Lorraine Goeuriot,Steffen Lalande,Nicolas Hervé,Maximin Coavoux,François Portet,Étienne Ollion,Marie Candito,Maxime Peyrard,Solange Rossato,Benjamin Lecouteux,Aurélie Nardy,Gilles Sérasset,Vincent Segonne,Solène Evain,Diandra Fabre,Didier Schwab

Main category: cs.CL

TL;DR: Pantagruel 是一个新的自监督编码器模型家族,用于法语文本和语音,通过在特征空间中学习上下文化的表示,在多种下游任务中表现出优于或媲美现有基线模型的性能。

Details Motivation: 为了提升法语文本与语音模态下语言和声学规律的建模能力,并支持统一架构下的多模态理解。 Method: 采用特征空间中的自监督目标进行预训练,使用大规模法语文本(如Wikipedia、OSCAR)和语音数据集(如MultilingualLibriSpeech、INA-100k)训练独立但共享架构的编码器。 Result: 在FLUE和LeBenchmark等标准法语基准的多个下游任务中,Pantagruel模型表现优于或媲美CamemBERT、FlauBERT和LeBenchmark2.0等强基线。 Conclusion: 特征空间中的自监督学习有效提升了法语表征学习效果,Pantagruel为统一处理文本与语音的多模态理解提供了强大基础。 Abstract: We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l'Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

[50] Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng,Jintian Zhang,Yujie Luo,Yuren Mao,Yunjun Gao,Lun Du,Huajun Chen,Ningyu Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为FOREAGENT的自主机器学习代理框架,通过“先预测后验证”循环克服传统“生成-执行-反馈”范式中的执行瓶颈,利用大语言模型对解决方案进行数据驱动的偏好预测,实现了6倍的收敛加速并超越基线6%。

Details Motivation: 现有自主科学发现代理受限于生成-执行-反馈范式,依赖昂贵的物理执行进行假设评估,导致严重的执行瓶颈,效率低下。 Method: 形式化了数据导向解偏好任务,构建包含18,438组对比的数据集,利用经过验证数据分析报告引导的大语言模型进行预测,并在FOREAGENT中实现预测-验证循环机制。 Result: 大语言模型在引导下达到61.5%的预测准确率且具备良好置信度校准,FOREAGENT实现6倍收敛加速,在性能上超过基于执行的基线方法6%。 Conclusion: 通过内部化执行先验并采用预测优先策略,可显著缓解执行瓶颈,提升科学发现代理的效率与性能,验证了数据中心化推理在自主智能体中的潜力。 Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.

[51] Distilling Feedback into Memory-as-a-Tool

Víctor Gallego

Main category: cs.CL

TL;DR: 提出一个框架,通过基于文件的记忆系统和代理控制的工具调用,将瞬态批评转化为可检索的指南,以摊销推理时推理的成本。

Details Motivation: 为了降低推理时精炼流程的推理成本,同时保持性能。 Method: 使用基于文件的记忆系统和代理控制的工具调用,将临时的批评转化为可检索的指导原则。 Result: 在Rubric Feedback Bench这一新数据集上评估表明,增强的LLM能迅速达到测试时优化管道的性能,同时大幅降低推理成本。 Conclusion: 该方法在保持高性能的同时显著减少了推理开销,为高效推理提供了可行方案。 Abstract: We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.

[52] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Qiguang Chen,Yantao Du,Ziniu Li,Jinhao Liu,Songyao Duan,Jiarui Guo,Minghao Liu,Jiaheng Liu,Tong Yang,Ge Zhang,Libo Qin,Wanxiang Che,Wenhao Huang

Main category: cs.CL

TL;DR: 本文提出了一种基于分子结构类比的长链思维(Long CoT)推理学习框架,揭示了有效推理轨迹中的稳定结构,并提出了Mole-Syn方法以提升大模型在复杂推理任务中的性能与强化学习稳定性。

Details Motivation: 大语言模型难以从人类或非长链推理的LLMs中有效学习长链思维推理,本文旨在探究可学习的Long CoT轨迹的关键特征并解决其训练不稳定性问题。 Method: 通过分析蒸馏出的推理轨迹,识别出三种类分子相互作用(深度推理、自我反思、自我探索),提出‘有效语义异构体’概念,并设计基于分布转移图的Mole-Syn方法来引导生成有效的Long CoT结构。 Result: 实验表明,有效的Long CoT结构具有促进快速熵收敛的‘化学键’特性,而结构竞争会损害训练;Mole-Syn方法在多个基准上提升了推理性能和强化学习稳定性。 Conclusion: 将推理过程建模为具有分子式稳定结构的系统有助于理解和优化Long CoT学习,Mole-Syn为构建高效、稳定的复杂推理轨迹提供了新路径。 Abstract: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.

[53] Don't Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Elias Lumer,Faheem Nizar,Akshaya Jangiti,Kevin Frank,Anmol Gulati,Mandar Phadate,Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: 本文研究了在多轮LLM智能体任务中使用提示缓存的成本和性能优化效果,评估了三种主流LLM提供商的缓存策略,并提出了提升效率的最佳实践。

Details Motivation: 尽管大型语言模型(LLM)智能体在多轮任务中广泛应用,但提示缓存对这类工作负载的成本和延迟影响尚未被充分研究。缺乏量化分析和有效的缓存策略比较。 Method: 作者在DeepResearchBench基准上评估了OpenAI、Anthropic和Google三家主要LLM提供商的提示缓存效果,比较了三种缓存策略:完整上下文缓存、仅系统提示缓存、排除动态工具结果的缓存,并测量API成本和首令牌时间(TTFT)。 Result: 提示缓存可将API成本降低45-80%,首令牌时间改善13-31%。合理控制缓存块(如将动态内容置于系统提示末尾、避免动态函数调用、排除动态工具结果)比简单的全上下文缓存更有效,后者可能反而增加延迟。不同提供商的缓存行为存在差异。 Conclusion: 提示缓存在多轮智能体任务中具有显著的成本和性能优势,但需采用策略性缓存设计以最大化效益,本文为生产环境中部署提供了实用指导。 Abstract: Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.

[54] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Jiajie Zhang,Xin Lv,Ling Feng,Lei Hou,Juanzi Li

Main category: cs.CL

TL;DR: 提出了一种细粒度奖励框架Citation-aware Rubric Rewards (CaRR) 和相应的训练方法 C-GRPO,以提升基于大语言模型的深度搜索代理在推理全面性、事实准确性和证据连贯性方面的能力。

Details Motivation: 现有强化学习方法依赖二元结果奖励,无法有效评估代理的推理过程,导致捷径利用和幻觉等问题。需要一种更细粒度的奖励机制来引导深度搜索代理进行可验证、有据可依的多步推理。 Method: 提出CaRR框架,将复杂问题分解为可验证的单跳评分标准,要求代理显式识别隐藏实体、提供正确引用并构建完整的证据链;进一步结合结果奖励提出C-GRPO训练算法。 Result: 实验表明C-GRPO在多个深度搜索基准上优于基于结果奖励的基线方法,能有效抑制捷径行为,促进全面且基于证据的推理,并在开放性研究任务中表现出良好的泛化能力。 Conclusion: CaRR与C-GRPO通过细粒度、引用感知的奖励机制显著提升了深度搜索代理的推理质量与可靠性,为构建可信的AI搜索系统提供了有效路径。 Abstract: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents' reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.

[55] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Chengming Cui,Tianxin Wei,Ziyi Chen,Ruizhong Qiu,Zhichen Zeng,Zhining Liu,Xuying Ning,Duo Zhou,Jingrui He

Main category: cs.CL

TL;DR: 提出AdaFuse,一种自适应的集成解码框架,通过动态选择语义上合适的融合单元来克服现有集成方法在生成过程中缺乏灵活性的问题。

Details Motivation: 现有集成方法依赖固定的融合粒度,无法灵活适应不同任务的生成特征和中段生成调整的需求。 Method: 引入基于不确定性的准则决定每一步是否进行集成,在低置信状态下采用多样性感知的扩展策略探索候选续写,并实现自适应集成与测试时扩展的协同作用。 Result: 在开放域问答、算术推理和机器翻译任务上,AdaFuse平均相对提升6.88%,优于强基线。 Conclusion: AdaFuse通过动态调整融合行为,有效结合多个大语言模型的优势,提升了生成质量和鲁棒性。 Abstract: Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.

cs.CV [Back]

[56] Bi-Orthogonal Factor Decomposition for Vision Transformers

Fenil R. Doshi,Thomas Fel,Talia Konkle,George Alvarez

Main category: cs.CV

TL;DR: 本文提出了Bi-orthogonal Factor Decomposition (BFD) 框架,用于分析视觉Transformer中注意力机制如何在token间传递信息,发现注意力主要通过内容交互,并揭示了位置与语义因素在通信中的作用。

Details Motivation: 目前缺乏对注意力机制在token之间交换何种信息的系统理解,尤其是无法区分位置和内容因素的作用。 Method: 提出BFD框架:首先使用基于ANOVA的方法将token激活分解为正交的位置和内容因子;然后对查询-键交互矩阵QK^T进行SVD,提取双正交模式以揭示信息传递机制。 Result: 1) 注意力主要由内容驱动,内容-内容交互主导能量,其次是内容-位置耦合;2) 注意力头表现出功能特化,分化为不同类型的操作符;3) DINOv2通过中间层同时保持位置结构并丰富语义内容,从而实现更强的形状处理能力。 Conclusion: BFD揭示了视觉Transformer中token通过注意力交互的具体机制,明确了位置与语义因素在通信中的角色,为理解注意力提供了实用的新视角。 Abstract: Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2's superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.

[57] Coding the Visual World: From Image to Simulation Using Vision Language Models

Sagi Eppel

Main category: cs.CV

TL;DR: 本文提出Im2Sim方法,通过让视觉语言模型(VLM)从自然图像生成可执行的模拟代码,评估其对复杂系统(如云、城市、植被等)的视觉理解能力。实验表明,当前领先的VLM在高层抽象理解方面表现良好,但对细节和低层模式的还原能力有限,揭示了其深层理解与精细感知之间的不对称性。

Details Motivation: 视觉理解可视为构建图像中系统表征模型的能力。研究旨在探索VLM是否能识别并模拟图像中的复杂系统,从而评估其真正的视觉理解水平,而不仅仅是图像识别或描述。 Method: 提出Im2Sim方法:给VLM一张真实世界的系统图像(如城市、云),要求其描述系统并生成可运行的模拟代码;执行代码生成合成图像,并与原图对比。测试涵盖物理、生态、城市等多种复杂涌现系统。 Result: GPT和Gemini等领先VLM能够理解和建模跨多个领域、多组件、多抽象层次的复杂系统,展现出较强的高层语义理解能力;但在复现图像细节和低层次结构模式方面表现较差。 Conclusion: VLM具备深层次的高阶视觉理解能力,能构建系统的抽象模型,但缺乏对视觉细节的精确感知,显示出‘高层理解强、低层感知弱’的不对称特性。 Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.

[58] STResNet & STYOLO : A New Family of Compact Classification and Object Detection Models for MCUs

Sudhakar Sah,Ravish Kumar

Main category: cs.CV

TL;DR: 本文提出了两个轻量级神经网络模型家族STResNet和STYOLO,分别用于图像分类和目标检测,兼顾准确性、效率和内存占用,适用于资源受限设备。

Details Motivation: 现有轻量级模型通常以准确率为代价换取延迟优化,限制了其在微控制器和神经处理单元设备上的应用。因此需要更优的平衡方案。 Method: 设计了STResNet(Nano到Tiny变体)和STYOLO系列模型,针对边缘硬件进行联合优化,并在ImageNet-1K和MS COCO数据集上验证性能。 Result: STResNetMilli在仅300万参数下达到70.0% Top-1准确率,优于MobileNetV1和ShuffleNetV2;STYOLOMicro和STYOLOMilli在MS COCO上分别取得30.5%和33.6% mAP,超过YOLOv5n和YOLOX Nano。 Conclusion: 所提模型在保持极低参数量的同时显著提升精度,为边缘设备上的高效部署提供了更优选择。 Abstract: Recent advancements in lightweight neural networks have significantly improved the efficiency of deploying deep learning models on edge hardware. However, most existing architectures still trade accuracy for latency, which limits their applicability on microcontroller and neural processing unit based devices. In this work, we introduce two new model families, STResNet for image classification and STYOLO for object detection, jointly optimized for accuracy, efficiency, and memory footprint on resource constrained platforms. The proposed STResNet series, ranging from Nano to Tiny variants, achieves competitive ImageNet 1K accuracy within a four million parameter budget. Specifically, STResNetMilli attains 70.0 percent Top 1 accuracy with only three million parameters, outperforming MobileNetV1 and ShuffleNetV2 at comparable computational complexity. For object detection, STYOLOMicro and STYOLOMilli achieve 30.5 percent and 33.6 percent mean average precision, respectively, on the MS COCO dataset, surpassing YOLOv5n and YOLOX Nano in both accuracy and efficiency. Furthermore, when STResNetMilli is used as a backbone with the Ultralytics training environment.

[59] MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments

Svitlana Morkva,Maximum Wilder-Smith,Michael Oechsle,Alessio Tonioni,Marco Hutter,Vaishakh Patil

Main category: cs.CV

TL;DR: MOSAIC-GS是一种基于高斯点阵的单目视频动态场景重建方法,通过融合多几何线索和Poly-Fourier曲线运动建模,实现高效、高质量的动静态场景分离与重建。

Details Motivation: 单目动态场景重建因缺乏多视角约束而具有歧义性,难以准确恢复几何和时序一致性,本文旨在减少对视觉外观的依赖,提升重建的稳定性和效率。 Method: 利用深度、光流、动态分割和点跟踪等多几何线索,在初始化阶段估计初步的3D场景动态,并结合刚性运动约束;将场景分解为静态与动态部分,动态高斯点采用时间依赖的Poly-Fourier曲线表示轨迹,实现参数高效的非刚性运动建模。 Result: 相比现有方法显著加快了优化和渲染速度,同时在标准单目动态场景基准上保持与最先进方法相当的重建质量。 Conclusion: MOSAIC-GS通过显式利用多几何线索和紧凑的运动表示,在效率与质量之间取得了良好平衡,适用于高保真、实时动态场景重建任务。 Abstract: We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.

[60] Ensemble of radiomics and ConvNeXt for breast cancer diagnosis

Jorge Alberto Garza-Abdala,Gerardo Alejandro Fumagal-González,Beatriz A. Bosques-Palomo,Mario Alexis Monsivais Molina,Daly Avedano,Servando Cardona-Huerta,José Gerardo Tamez-Pena

Main category: cs.CV

TL;DR: 该研究评估了放射组学、深度学习(DL)及集成方法在乳腺癌筛查 mammograms 中的检测性能,发现集成方法显著提升了诊断效果。

Details Motivation: 早期诊断乳腺癌对提高生存率至关重要,放射组学和深度学习虽具潜力,但其性能尚需系统评估,尤其是跨数据集的泛化能力。 Method: 使用两个独立数据集(RSNA 和 TecSalud),分别训练 ConvNeXtV1-small DL 模型和放射组学模型,并采用统一方法构建和校准集成模型。 Result: 集成方法 AUC 达 0.87,优于单独的 DL(0.83)和放射组学(0.80),且在跨数据集验证中表现稳定。 Conclusion: 结合深度学习与放射组学的集成方法能显著提升 mammograms 中乳腺癌的检测性能,具有临床应用潜力。 Abstract: Early diagnosis of breast cancer is crucial for improving survival rates. Radiomics and deep learning (DL) have shown significant potential in assisting radiologists with early cancer detection. This paper aims to critically assess the performance of radiomics, DL, and ensemble techniques in detecting cancer from screening mammograms. Two independent datasets were used: the RSNA 2023 Breast Cancer Detection Challenge (11,913 patients) and a Mexican cohort from the TecSalud dataset (19,400 patients). The ConvNeXtV1-small DL model was trained on the RSNA dataset and validated on the TecSalud dataset, while radiomics models were developed using the TecSalud dataset and validated with a leave-one-year-out approach. The ensemble method consistently combined and calibrated predictions using the same methodology. Results showed that the ensemble approach achieved the highest area under the curve (AUC) of 0.87, compared to 0.83 for ConvNeXtV1-small and 0.80 for radiomics. In conclusion, ensemble methods combining DL and radiomics predictions significantly enhance breast cancer diagnosis from mammograms.

[61] EdgeLDR: Quaternion Low-Displacement Rank Neural Networks for Edge-Efficient Deep Learning

Vladimir Frants,Sos Agaian,Karen Panetta

Main category: cs.CV

TL;DR: 本文提出了EdgeLDR,一种结合四元数通道混合与分块循环结构的高效神经网络框架,通过FFT加速实现低内存和低计算成本的线性与卷积层,适用于边缘设备上的深度学习模型压缩。

Details Motivation: 深度神经网络在边缘设备上部署受限于密集线性操作的内存访问和计算开销;现有方法在参数效率或计算效率上存在不足,需兼顾两者的优化方案。 Method: 提出EdgeLDR框架,采用四元数分块循环矩阵结构,并利用复共轭表示实现基于FFT的快速计算;设计了相应的线性与卷积层实现方法。 Result: FFT加速显著优于朴素实现,延迟随块大小增加保持稳定;在CIFAR、SVHN和高光谱图像分类任务中实现了高参数压缩比且保持竞争力的准确率。 Conclusion: EdgeLDR有效结合了四元数神经网络的参数效率与结构化矩阵的计算效率,为边缘设备提供了高性能的模型压缩解决方案。 Abstract: Deploying deep neural networks on edge devices is often limited by the memory traffic and compute cost of dense linear operators. While quaternion neural networks improve parameter efficiency by coupling multiple channels through Hamilton products, they typically retain unstructured dense weights; conversely, structured matrices enable fast computation but are usually applied in the real domain. This paper introduces EdgeLDR, a practical framework for quaternion block-circulant linear and convolutional layers that combines quaternion channel mixing with block-circulant parameter structure and enables FFT-based evaluation through the complex adjoint representation. We present reference implementations of EdgeLDR layers and compare FFT-based computation against a naive spatial-domain realization of quaternion circulant products. FFT evaluation yields large empirical speedups over the naive implementation and keeps latency stable as block size increases, making larger compression factors computationally viable. We further integrate EdgeLDR layers into compact CNN and Transformer backbones and evaluate accuracy-compression trade-offs on 32x32 RGB classification (CIFAR-10/100, SVHN) and hyperspectral image classification (Houston 2013, Pavia University), reporting parameter counts and CPU/GPU latency. The results show that EdgeLDR layers provide significant compression with competitive accuracy.

[62] Sketch&Patch++: Efficient Structure-Aware 3D Gaussian Representation

Yuang Shi,Simone Gasparini,Géraldine Morin,Wei Tsang Ooi

Main category: cs.CV

TL;DR: 本文提出了一种将高斯分布分为Sketch Gaussians和Patch Gaussians的分层表示方法,用于3D场景的高效渐进式流式传输,在保持高质量的同时显著提升压缩效率。

Details Motivation: 受艺术创作中先勾勒轮廓再填充色块的启发,希望在3D高斯泼溅(3DGS)中实现语义分离,以支持结构感知的高效表示与传输。 Method: 提出一种基于多准则密度聚类与自适应质量驱动优化的分层分类框架,直接在3DGS上操作,将高斯分为捕捉高频边界的Sketch Gaussians和表示低频平滑区域的Patch Gaussians,实现无需外部几何先验的自适应分类。 Result: 在多种室内外场景上验证,相比均匀剪枝方法,PSNR提升1.74 dB,SSIM提升6.7%,LPIPS改善41.4%;室内场景仅需0.5%原始模型大小即可维持视觉质量。 Conclusion: 该结构感知的混合表示方法实现了高效的存储、自适应流式传输和高质量渲染,适用于带宽受限和资源受限环境下的3D内容分发。 Abstract: We observe that Gaussians exhibit distinct roles and characteristics analogous to traditional artistic techniques -- like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features such as edges and contours, while others represent broader, smoother regions analogous to brush strokes that add volume and depth. Based on this observation, we propose a hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which represent high-frequency, boundary-defining features, and (ii) Patch Gaussians, which cover low-frequency, smooth regions. This semantic separation naturally enables layered progressive streaming, where the compact Sketch Gaussians establish the structural skeleton before Patch Gaussians incrementally refine volumetric detail. In this work, we extend our previous method to arbitrary 3D scenes by proposing a novel hierarchical adaptive categorization framework that operates directly on the 3DGS representation. Our approach employs multi-criteria density-based clustering, combined with adaptive quality-driven refinement. This method eliminates dependency on external 3D line primitives while ensuring optimal parametric encoding effectiveness. Our comprehensive evaluation across diverse scenes, including both man-made and natural environments, demonstrates that our method achieves up to 1.74 dB improvement in PSNR, 6.7% in SSIM, and 41.4% in LPIPS at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, our method can maintain visual quality with only 0.5\% of the original model size. This structure-aware representation enables efficient storage, adaptive streaming, and rendering of high-fidelity 3D content across bandwidth-constrained networks and resource-limited devices.

[63] Multi-task Cross-modal Learning for Chest X-ray Image Retrieval

Zhaohui Liang,Sivaramakrishnan Rajaraman,Niccolo Marini,Zhiyun Xue,Sameer Antani

Main category: cs.CV

TL;DR: 提出一种多任务学习框架来微调BiomedCLIP,以提升胸部X光图像-文本检索性能,结合三种损失函数,在临床相关检索任务中表现出更均衡且有意义的效果。

Details Motivation: CLIP和BiomedCLIP未针对细粒度医学检索任务(如用胸部X光图像检索放射学报告)进行优化,需改进其在临床场景中的表现。 Method: 基于BiomedCLIP架构,引入轻量级MLP投影头,并采用包含二元交叉熵损失、监督对比损失和CLIP损失的多任务复合损失函数进行微调。 Result: 微调后的模型在图像到文本和文本到图像检索任务中均优于预训练的BiomedCLIP和通用CLIP模型,t-SNE可视化显示正常与异常病例语义聚类更清晰。 Conclusion: 领域自适应的多任务学习能有效提升生物医学跨模态检索的性能,增强诊断敏感性。 Abstract: CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model's enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.

[64] Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Yuxiang Ji,Yong Wang,Ziyu Ma,Yiming Hu,Hailang Huang,Xuecai Hu,Guanhua Chen,Liaoni Wu,Xiangxiang Chu

Main category: cs.CV

TL;DR: 本文提出了一种新的图像地理定位方法,通过赋予模型“地图思维”能力,并采用两阶段优化策略(基于强化学习的代理训练和并行测试时扩展),显著提升了定位精度。同时构建了全新的真实世界图像基准MAPBench用于评估。

Details Motivation: 现有基于大视觉语言模型的图像地理定位方法忽略了人类常用的使用地图的策略,缺乏有效的空间推理机制,导致定位精度有限。 Method: 提出“Thinking with Map”框架,构建“agent-in-the-map”循环;采用两阶段优化:先用代理强化学习提升采样效率,再用平行测试时扩展(TTS)探索多条候选路径以优化最终预测。 Result: 在新构建的真实世界图像基准MAPBench上验证,相比Gemini-3-Pro结合谷歌搜索/地图模式的方法,Acc@500m指标从8.0%提升至22.1%,优于现有的开源和闭源模型。 Conclusion: 将地图作为推理工具引入视觉语言模型可显著提升图像地理定位性能,所提出的代理式学习与测试时扩展策略有效增强了模型的空间决策能力。 Abstract: The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans -- using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0\% to 22.1\% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.

[65] TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection

Hongyang Xie,Hongyang He,Victor Sanchez

Main category: cs.CV

TL;DR: 提出TAPM-Net,通过建模特征空间中目标扰动的传播轨迹,提升红外小目标检测性能。

Details Motivation: 现有CNN和ViT模型缺乏对小目标在特征空间中引发的方向性、逐层扰动路径的追踪机制,难以有效区分信号与复杂背景噪声。 Method: 设计了PGM模块构建扰动能量场并提取梯度跟随的特征轨迹,结合基于Mamba的TASB模块建模沿轨迹的动态传播,引入速度约束扩散和语义对齐融合机制。 Result: 在NUAA-SIRST和IRSTD-1K数据集上达到最先进检测性能,优于现有注意力机制方法。 Conclusion: TAPM-Net通过显式建模小目标的特征扰动轨迹,实现了高效、各向异性的状态转移,在保持低计算成本的同时提升了检测精度。 Abstract: Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.

[66] ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction

Tingwei Xie,Jinxin He,Yonghong Song

Main category: cs.CV

TL;DR: 本文提出了一种轻量级且架构无关的管道ROAP,用于优化布局感知Transformer中的注意力分布,通过建模阅读顺序和抑制视觉噪声来提升复杂文档理解性能。

Details Motivation: 现有多模态Transformer在处理视觉丰富的文档时受限于缺乏对逻辑阅读顺序的显式建模以及视觉标记对文本语义注意力的干扰。 Method: 提出ROAP框架:首先使用自适应XY-Gap树(AXG-Tree)提取层次化阅读序列,然后通过阅读顺序感知的相对位置偏置(RO-RPB)将其融入注意力机制,并引入文本标记子块注意力先验(TT-Prior)以自适应抑制视觉噪声并增强文本间细粒度交互。 Result: 在FUNSD和CORD基准上的实验表明,ROAP能持续提升LayoutLMv3和GeoLayoutLM等主流骨干模型的性能。 Conclusion: 显式建模阅读逻辑和调节模态间干扰对鲁棒的文档理解至关重要,ROAP为复杂版面分析提供了一个可扩展的解决方案。 Abstract: The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance fine-grained text-text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at https://github.com/KevinYuLei/ROAP.

[67] Multi-Image Super Resolution Framework for Detection and Analysis of Plant Roots

Shubham Agarwal,Ofek Nourian,Michael Sidorov,Sharon Chemweno,Ofer Hadar,Naftali Lazarovitch,Jhonathan E. Ephrath

Main category: cs.CV

TL;DR: 提出了一种基于深度学习的多图像超分辨率(MISR)框架,结合地下成像系统,提升植物根系的可见性和细节,实现更准确的根系性状分析。

Details Motivation: 由于地下环境中的遮挡、土壤湿度变化和低对比度等问题,传统视觉方法难以清晰成像植物根系,限制了根系研究的发展。 Method: 设计了一种新型地下成像系统,获取多视角重叠的根系图像,并构建一个模拟真实地下环境的合成数据集;采用基于深度学习的MISR框架,利用多视图的空间冗余重建高分辨率图像。 Result: 该方法在定量评估中优于现有超分辨率方法,BRISQUE评分降低2.3%,CLIP-IQA保持相同,显著提升了图像质量与结构保真度,能够更准确地估计根毛数量和密度等关键性状。 Conclusion: 所提框架为地下植物根系的自动成像与性状量化提供了有效解决方案,对农业与生态研究具有重要意义。 Abstract: Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.

[68] Hippocampal Atrophy Patterns Across the Alzheimer's Disease Spectrum: A Voxel-Based Morphometry Analysis

Trishna Niraula

Main category: cs.CV

TL;DR: 该研究利用CAT12/SPM12对ADNI数据进行基于体素的形态学分析,发现阿尔茨海默病(AD)患者海马体萎缩显著,并具有中等预测MCI向AD转化的能力,但APOE4基因型对海马体积无显著影响。

Details Motivation: 探讨阿尔茨海默病(AD)和轻度认知障碍(MCI)中灰质萎缩的模式,特别是内侧颞叶结构的变化,并评估其作为疾病进展生物标志物的潜力。 Method: 采用CAT12/SPM12进行基于体素的形态学分析(VBM),对249名ADNI参与者的基线T1加权MRI扫描进行分析,使用一般线性模型检验诊断组别对灰质体积的影响,协变量包括年龄和总颅内体积,统计阈值为p < 0.001(体素水平),簇水平经FWE校正(p < 0.05)。 Result: AD组相较于CN和MCI组在海马体表现出显著萎缩(Cohen's d = 2.03 和 1.61);海马体积对MCI向AD转化具有中等预测能力(AUC = 0.66);按APOE4状态分层后未发现其对横断面海马体积有显著遗传效应。 Conclusion: 内侧颞叶退化是AD进展的关键特征,海马体积可作为潜在的预测生物标志物,而APOE4在此样本中未显示对海马体积的显著影响。 Abstract: Alzheimer's disease (AD) and mild cognitive impairment (MCI) are associated with progressive gray matter loss, particularly in medial temporal structures. In this study, CAT12/SPM12 voxel-based morphometry was applied to baseline T1-weighted MRI scans from 249 ADNI participants (CN = 90, MCI = 129, AD = 30). Gray matter volume was analyzed using a general linear model, with the diagnostic group as primary predictor and age and total intracranial volume as covariates. Statistical maps were thresholded at p < 0.001 (voxelwise) and corrected for multiple comparisons at the cluster level using family-wise error (FWE) correction (p < 0.05). Significant hippocampal atrophy was observed in AD relative to CN and MCI (Cohen's d = 2.03 and 1.61, respectively). Hippocampal volume demonstrated moderate predictive value for conversion from MCI to AD (AUC = 0.66). Stratification by APOE4 status did not reveal significant genetic effects on cross-sectional hippocampal volume. These results support medial temporal degeneration as a key feature of AD progression and provide insights into predictive biomarkers and genetic influences.

[69] MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

Zizhong Li,Haopeng Zhang,Jiawei Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为MMViR的多模态、多粒度结构化表示方法,用于提升长视频理解的效果与效率。

Details Motivation: 现有的多模态大模型在处理长视频时面临计算成本高和信息冗余或碎片化的问题,难以有效捕捉复杂事件和长程依赖。 Method: 通过识别关键转折点对视频进行分割,并构建包含全局叙述与细粒度视觉细节的三级描述结构,实现高效的查询检索与跨场景泛化。 Result: 在问答、摘要生成和检索三个任务上进行了广泛评估,MMViR相比先前最优方法在小时级视频理解上提升了19.67%,同时将处理延迟降低至原来的45.4%。 Conclusion: MMViR通过结构化表示有效平衡了长视频理解的性能与效率,具有良好的通用性和实际应用潜力。 Abstract: Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

[70] Prompt-Free SAM-Based Multi-Task Framework for Breast Ultrasound Lesion Segmentation and Classification

Samuel E. Johnny,Bernes L. Atabonfack,Israel Alagbe,Assane Gueye

Main category: cs.CV

TL;DR: 提出一种基于SAM的多任务深度学习框架,用于乳腺超声图像中病灶分割与诊断分类,采用无提示、全监督方式,利用SAM视觉编码器特征进行分割和掩码引导注意力机制提升分类性能。

Details Motivation: 由于对比度低、斑点噪声和病灶形态多样,乳腺超声图像中的准确肿瘤分割与分类仍具挑战性。 Method: 提出一个多任务深度学习框架,联合执行病灶分割和诊断分类;使用SAM视觉编码器的高维特征,通过轻量卷积头或UNet风格解码器进行像素级分割;分类分支引入掩码引导注意力机制以聚焦病灶相关特征并抑制背景干扰。 Result: 在PRECISE 2025数据集上实验显示,该方法达到0.887的Dice相似系数和92.3%的分类准确率,在挑战赛排行榜中位居前列。 Conclusion: 基于SAM的表征结合分割引导学习可显著提升乳腺超声图像中病灶的描绘与诊断预测性能。 Abstract: Accurate tumor segmentation and classification in breast ultrasound (BUS) imaging remain challenging due to low contrast, speckle noise, and diverse lesion morphology. This study presents a multi-task deep learning framework that jointly performs lesion segmentation and diagnostic classification using embeddings from the Segment Anything Model (SAM) vision encoder. Unlike prompt-based SAM variants, our approach employs a prompt-free, fully supervised adaptation where high-dimensional SAM features are decoded through either a lightweight convolutional head or a UNet-inspired decoder for pixel-wise segmentation. The classification branch is enhanced via mask-guided attention, allowing the model to focus on lesion-relevant features while suppressing background artifacts. Experiments on the PRECISE 2025 breast ultrasound dataset, split per class into 80 percent training and 20 percent testing, show that the proposed method achieves a Dice Similarity Coefficient (DSC) of 0.887 and an accuracy of 92.3 percent, ranking among the top entries on the PRECISE challenge leaderboard. These results demonstrate that SAM-based representations, when coupled with segmentation-guided learning, significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.

[71] Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors

Fuwen Luo,Zihao Wan,Ziyue Wang,Yaluo Liu,Pau Tong Lin Xu,Xuanjia Qiao,Xiaolong Wang,Peng Li,Yang Liu

Main category: cs.CV

TL;DR: 本文提出了一种名为HieroSA的新框架,能够自动从字符位图中提取象形文字的笔画级结构,无需手工标注数据,具有跨语言泛化能力。

Details Motivation: 现有大模型和多模态大模型对象形文字的内部结构不敏感,且当前结构分析方法通常依赖特定文字系统且费时费力。 Method: 提出HieroSA框架,将现代意音文字和古代象形文字图像转化为标准化坐标空间中的线段表示,使多模态大模型能理解字符内部结构。 Result: 实验证明HieroSA能有效捕捉字符内部结构与语义,无需语言特定先验知识,在多种文字系统上展现良好泛化性能。 Conclusion: HieroSA为象形文字的字形分析提供了一种通用、可解释的新工具,有望推动跨语言古文字研究与理解。 Abstract: Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script-specific and labor-intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke-level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations in a normalized coordinate space, allowing for cross-lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at https://github.com/THUNLP-MT/HieroSA.

[72] GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting

Xuan Cheng,Jiahao Rao,Chengyang Li,Wenhao Wang,Weilin Chen,Lvqing Yang

Main category: cs.CV

TL;DR: 本文提出了一种名为GaussianSwap的新型视频换脸框架,利用3D高斯点阵化技术构建目标视频中的面部头像,并从源图像迁移身份信息,实现了高保真、可动态控制的换脸效果。

Details Motivation: 传统基于像素的视频换脸方法生成的结果缺乏结构化表示,无法进行动画或交互操作,限制了其应用范围。 Method: 该框架首先预处理目标视频以提取FLAME参数、相机姿态和分割掩码,然后将3D高斯点阵绑定到跨帧的FLAME模型上,实现动态面部控制;同时提出一种由三种先进人脸识别模型构成的复合身份嵌入方法用于头像微调,最后将换脸后的头像渲染至背景帧生成最终视频。 Result: 实验结果表明,GaussianSwap在身份保持性、视觉清晰度和时间一致性方面表现优越,并支持以往难以实现的交互式应用。 Conclusion: GaussianSwap实现了从传统像素级生成到结构化高保真头像生成的范式转变,为视频换脸技术开辟了新的可能性。 Abstract: We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.

[73] SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances

Qiwei Yang,Pingping Zhang,Yuhao Wang,Zijing Gong

Main category: cs.CV

TL;DR: 提出了一种用于视频行人重识别的尺度自适应框架SAS-VPReID,包含记忆增强视觉主干、多粒度时间建模和先验正则化形状动态三个模块,在VReID-XFD基准上表现优异。

Details Motivation: 远距离视频行人重识别面临分辨率低、视角变化大和外观噪声多等挑战,现有方法难以提取判别性强的特征。 Method: 设计了SAS-VPReID框架,包括:1)基于CLIP和多代理记忆的记忆增强视觉主干(MEVB);2)多粒度时间建模(MGTM)以捕捉跨尺度运动线索;3)引入先验正则化形状动态(PRSD)捕获身体结构动态。 Result: 在VReID-XFD基准上验证了各模块有效性,整体框架在挑战赛排行榜上排名第一。 Conclusion: SAS-VPReID通过结合记忆机制、多粒度时序建模与形状先验,有效提升了远距离视频行人重识别性能。 Abstract: Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at https://github.com/YangQiWei3/SAS-VPReID.

[74] DIFF-MF: A Difference-Driven Channel-Spatial State Space Model for Multi-Modal Image Fusion

Yiming Sun,Zifan Ye,Qinghua Hu,Pengfei Zhu

Main category: cs.CV

TL;DR: 提出了一种基于差异驱动的通道-空间状态空间模型DIFF-MF,用于多模态图像融合,通过利用模态间的特征差异图引导特征提取,在通道和空间维度上实现更优的融合效果。

Details Motivation: 现有方法在红外强度和可见光细节之间难以平衡,容易造成一方信息过强而另一方被削弱,因此需要一种能同时保留热目标显著性和可见光结构细节的融合方法。 Method: 提出DIFF-MF模型,利用模态间的特征差异图指导特征提取;在通道维度采用通道交换模块结合交叉注意力双状态空间建模增强通道交互;在空间维度采用跨模态状态空间扫描实现全面的空间融合。 Result: 在驾驶场景和低空无人机数据集上的实验表明,该方法在视觉质量和定量指标上均优于现有方法。 Conclusion: DIFF-MF能有效整合多模态互补特征,在保持线性计算复杂度的同时提升融合图像质量,适用于多模态图像融合任务。 Abstract: Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.

[75] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Yanfeng Li,Yue Sun,Keren Fu,Sio-Kei Im,Xiaoming Liu,Guangtao Zhai,Xiaohong Liu,Tao Tan

Main category: cs.CV

TL;DR: 提出了一种名为MoGen的用户友好多目标图像生成方法,通过区域语义锚(RSA)和自适应多模态引导(AMG)模块,实现语言描述与图像区域的精确对齐及动态细粒度控制。

Details Motivation: 现有方法在语言描述与图像语义对齐上存在对象数量不一致和属性混叠问题,且依赖外部控制信号导致输入格式僵化,难以适应不同用户需求。 Method: 设计了区域语义锚(RSA)模块以对齐文本短语与图像区域,并引入自适应多模态引导(AMG)模块来自适应融合多种控制信号,实现灵活的结构化意图引导。 Result: 实验表明,MoGen在生成质量、对象数量一致性和细粒度控制方面显著优于现有方法,同时具备更高的可访问性和控制灵活性。 Conclusion: MoGen有效解决了多目标图像生成中的语义对齐与控制灵活性问题,为用户提供了一种更实用、兼容性强的生成框架。 Abstract: Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.

[76] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck

Feiran Zhang,Yixin Wu,Zhenghua Wang,Xiaohua Wang,Changze Lv,Xuanjing Huang,Xiaoqing Zheng

Main category: cs.CV

TL;DR: 提出VIB-Probe框架,利用变分信息瓶颈理论从内部注意力头中提取判别性模式,有效检测并缓解视觉-语言模型中的幻觉问题。

Details Motivation: 现有幻觉检测方法多依赖输出logits或外部工具,忽视了模型内部机制;而直接探测内部注意力状态因视觉-语言语法与噪声纠缠而困难。 Method: 基于变分信息瓶颈(VIB)理论设计VIB-Probe,通过信息瓶颈原理过滤语义噪声,提取跨层和跨头的判别模式,并利用探针梯度识别对幻觉有因果影响的注意力头,进而实施推理时干预策略。 Result: 在多个基准上实验表明,VIB-Probe在幻觉检测与缓解两方面均显著优于现有基线方法。 Conclusion: VIB-Probe能有效揭示并利用VLM内部注意力头中的真实生成信号,为理解和减轻多模态幻觉提供了新思路。 Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.

[77] One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection

Bin-Bin Gao,Chengjie Wang

Main category: cs.CV

TL;DR: 本文提出了一种极其简单、通用且高效的通用视觉异常检测框架UniADet,通过解耦分类与分割任务及跨层级特征,仅学习少量独立权重,在零样本和少样本设置下显著超越现有方法,并首次优于全监督方法,在14个真实异常检测基准上验证了其有效性。

Details Motivation: 现有基于视觉-语言模型的通用异常检测方法依赖复杂的提示工程、适配模块和训练策略,限制了其灵活性和泛化能力,亟需一种更简洁、通用的解决方案。 Method: 本文发现语言编码器在异常分类与分割中仅用于生成决策权重,因而可被移除;提出完全解耦分类与分割任务以及不同层次特征的方法,分别为各任务和层级学习独立的轻量级权重,实现简单而高效的推理。 Result: UniADet仅含0.002M可学习参数,在14个真实世界异常检测基准(涵盖工业与医疗领域)上显著优于现有的零样本和少样本方法,并首次超越全监督方法。 Conclusion: UniADet通过极简设计实现了高效、通用的异常检测,证明了解耦策略的有效性,为基于视觉-语言模型的异常检测提供了新的思路,具有良好的扩展性和应用前景。 Abstract: Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at https://github.com/gaobb/UniADet.

[78] Semi-Supervised Facial Expression Recognition based on Dynamic Threshold and Negative Learning

Zhongpeng Cai,Jun Yu,Wei Xu,Tianyu Liu,Jianqing Sun,Jiaen Liang

Main category: cs.CV

TL;DR: 本文提出了一种基于动态阈值调整(DTA)和选择性负学习(SNL)的半监督面部表情识别算法,在RAF-DB和AffectNet数据集上实现了最先进的性能,有效利用了有标签和无标签数据。

Details Motivation: 获取大量标注面部表情数据成本高昂,因此需要设计能充分利用有标签和无标签数据的半监督算法。 Method: 在特征提取中引入局部注意力增强和特征图随机丢弃策略,并结合动态阈值调整与选择性负学习,从低置信度无标签样本的互补标签中挖掘有用信息。 Result: 在RAF-DB和AffectNet数据集上达到SOTA性能,且优于未使用完整数据集的全监督方法。 Conclusion: 所提出的DTA和SNL方法能有效提升半监督面部表情识别性能,具有较强的实用性和泛化能力。 Abstract: Facial expression recognition is a key task in human-computer interaction and affective computing. However, acquiring a large amount of labeled facial expression data is often costly. Therefore, it is particularly important to design a semi-supervised facial expression recognition algorithm that makes full use of both labeled and unlabeled data. In this paper, we propose a semi-supervised facial expression recognition algorithm based on Dynamic Threshold Adjustment (DTA) and Selective Negative Learning (SNL). Initially, we designed strategies for local attention enhancement and random dropout of feature maps during feature extraction, which strengthen the representation of local features while ensuring the model does not overfit to any specific local area. Furthermore, this study introduces a dynamic thresholding method to adapt to the requirements of the semi-supervised learning framework for facial expression recognition tasks, and through a selective negative learning strategy, it fully utilizes unlabeled samples with low confidence by mining useful expression information from complementary labels, achieving impressive results. We have achieved state-of-the-art performance on the RAF-DB and AffectNet datasets. Our method surpasses fully supervised methods even without using the entire dataset, which proves the effectiveness of our approach.

[79] What's Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews

Fanxiao Li,Jiaying Wu,Tingchao Fu,Dayang Li,Herun Wan,Wei Zhou,Min-Yen Kan

Main category: cs.CV

TL;DR: 本文提出了一种多阶段管道来构建MM-Misleading基准,用于检测社交媒体新闻预览中因信息遗漏导致的误导性,并提出了OMGuard方法,通过解释感知微调和基于理由的内容修正来提升多模态大模型对这类问题的检测与纠正能力。

Details Motivation: 信息遗漏引起的隐性误导在社交媒体新闻预览中普遍存在,但现有研究对此关注不足,难以被发现且影响广泛,因此需要系统化的方法进行检测与纠正。 Method: 开发一个多阶段管道,分离并模拟基于预览和基于上下文的理解,构建MM-Misleading基准;在此基础上评估开源多模态大模型的表现,并提出OMGuard方法,结合解释感知微调和理由引导的标题重写技术。 Result: 实验表明,OMGuard使8B参数模型的检测准确率媲美235B的LVLM,并显著提升了端到端的误导内容纠正效果;分析发现误导主要源于局部叙事偏移(如缺少背景),并在图像驱动场景中揭示了纯文本修正的局限性。 Conclusion: 信息遗漏引发的误导需引起重视,视觉信息在纠正过程中至关重要,OMGuard为提升多模态模型在这类问题上的表现提供了有效路径。 Abstract: Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model's detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.

[80] Towards Generalized Multi-Image Editing for Unified Multimodal Models

Pengcheng Xu,Peng Tang,Donghao Luo,Xiaobin Hu,Weichu Cui,Qingdong He,Zhennan Chen,Jiangning Zhang,Charles Ling,Boyu Wang

Main category: cs.CV

TL;DR: 本文提出了一种可扩展的多图像编辑框架,用于统一多模态模型(UMMs),通过引入可学习的潜在分隔符和正弦索引编码,提升跨多输入图像的视觉一致性和语义解耦能力,并构建高保真基准进行验证。

Details Motivation: 现有统一多模态模型在处理多图像输入时难以保持视觉一致性并准确解析跨图像的视觉线索,尤其当输入图像数量可变时表现受限。 Method: 提出两种算法创新:1)可学习的潜在分隔符,在隐空间中显式区分每张参考图像,实现解耦的条件控制;2)正弦索引编码,为同一图像的视觉标记分配连续的正弦索引嵌入,提供明确的图像身份标识并支持对可变输入数量的泛化与外推。同时采用逆向数据集构建方法建立高保真基准。 Result: 实验表明,该方法在多种多图像编辑任务中显著优于先前基线,在语义一致性、视觉保真度和跨图像整合方面均有提升,验证了其在一致性与泛化能力上的优势。 Conclusion: 所提出的框架有效解决了UMMs在多图像场景下的身份混淆与泛化难题,通过显式的图像标识建模,实现了对可变数量输入图像的高质量、一致性的编辑。 Abstract: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.

[81] Orient Anything V2: Unifying Orientation and Rotation Understanding

Zehan Wang,Ziang Zhang,Jiayang Xu,Jialei Wang,Tianyu Pang,Chao Du,HengShuang Zhao,Zhou Zhao

Main category: cs.CV

TL;DR: Orient Anything V2 是一个用于从单张或成对图像中统一理解物体3D方向和旋转的基础模型,相比V1扩展了对多种旋转对称性的支持并能直接估计相对旋转。

Details Motivation: 现有方法在处理具有复杂旋转对称性的物体时存在局限,且难以泛化到新类别;需要一个统一、可扩展的框架来准确估计物体朝向。 Method: 提出四种创新:基于生成模型合成可扩展的3D资产、模型在环的高效标注系统、对称性感知的周期性分布拟合目标、多帧架构以预测相对旋转。 Result: 在11个常用基准上实现了最先进的零样本性能,涵盖朝向估计、6DoF位姿估计和对称性识别任务。 Conclusion: Orient Anything V2 显著提升了朝向估计的通用性和适用范围,在多样化下游任务中表现出强大的泛化能力。 Abstract: This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

[82] Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection

Hanyi Wang,Jun Lan,Yaoyu Kang,Huijia Zhu,Weiqiang Wang,Zhuosheng Zhang,Shilin Wang

Main category: cs.CV

TL;DR: 提出一种三阶段领域持续学习框架,用于检测不断演化的AI生成图像,通过参数高效微调、数据增强链和K-FAC缓解灾难性遗忘,并利用线性模式连接提升性能。

Details Motivation: 现有检测方法难以泛化到未见的生成模型,且生成技术快速演进导致检测模型容易失效,缺乏适应性。 Method: 第一阶段采用参数高效的微调策略构建可迁移的离线检测模型;第二阶段引入具有渐进复杂性的数据增强链和K-FAC近似Hessian以缓解灾难性遗忘;第三阶段使用基于线性模式连接的线性插值策略融合模型共性。 Result: 构建了包含27个生成模型的时间序列基准(至2024年8月),初始离线检测器比基线高+5.51% mAP,持续学习策略平均准确率达92.20%,优于当前最优方法。 Conclusion: 所提框架在真实场景下能有效持续适应新型生成模型,显著提升检测性能与鲁棒性。 Abstract: The malicious misuse and widespread dissemination of AI-generated images pose a significant threat to the authenticity of online information. Current detection methods often struggle to generalize to unseen generative models, and the rapid evolution of generative techniques continuously exacerbates this challenge. Without adaptability, detection models risk becoming ineffective in real-world applications. To address this critical issue, we propose a novel three-stage domain continual learning framework designed for continuous adaptation to evolving generative models. In the first stage, we employ a strategic parameter-efficient fine-tuning approach to develop a transferable offline detection model with strong generalization capabilities. Building upon this foundation, the second stage integrates unseen data streams into a continual learning process. To efficiently learn from limited samples of novel generated models and mitigate overfitting, we design a data augmentation chain with progressively increasing complexity. Furthermore, we leverage the Kronecker-Factored Approximate Curvature (K-FAC) method to approximate the Hessian and alleviate catastrophic forgetting. Finally, the third stage utilizes a linear interpolation strategy based on Linear Mode Connectivity, effectively capturing commonalities across diverse generative models and further enhancing overall performance. We establish a comprehensive benchmark of 27 generative models, including GANs, deepfakes, and diffusion models, chronologically structured up to August 2024 to simulate real-world scenarios. Extensive experiments demonstrate that our initial offline detectors surpass the leading baseline by +5.51% in terms of mean average precision. Our continual learning strategy achieves an average accuracy of 92.20%, outperforming state-of-the-art methods.

[83] GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting

Nengbo Lu,Minghua Pan,Shaohua Sun,Yizhou Liang

Main category: cs.CV

TL;DR: 本文提出了一种名为GS-DMSR的3D动态场景重建方法,通过自适应梯度聚焦和多尺度流形增强模块,在提高模型收敛速度的同时优化了复杂动态场景的渲染质量。

Details Motivation: 在高精度重建具有复杂动态运动的3D场景中,传统方法难以平衡模型收敛速度与渲染质量,亟需一种能够高效区分高斯模型运动显著性并进行差异化优化的方法。 Method: 提出GS-DMSR方法,通过量化分析高斯属性的动态演化过程实现自适应梯度聚焦,识别高斯模型的运动显著性差异,并施加差异化的优化策略;同时引入多尺度流形增强模块,结合隐式非线性解码器与显式变形场进行协同优化,提升对复杂变形场景的建模效率。 Result: 实验表明该方法在合成数据集上可达96 FPS的帧率,显著降低了存储开销和训练时间,同时保持高质量渲染效果。 Conclusion: GS-DMSR有效平衡了3D动态场景重建中的收敛速度与渲染质量,为复杂动态场景的高效建模提供了新的解决方案。 Abstract: In the field of 3D dynamic scene reconstruction, how to balance model convergence rate and rendering quality has long been a critical challenge that urgently needs to be addressed, particularly in high-precision modeling of scenes with complex dynamic motions. To tackle this issue, this study proposes the GS-DMSR method. By quantitatively analyzing the dynamic evolution process of Gaussian attributes, this mechanism achieves adaptive gradient focusing, enabling it to dynamically identify significant differences in the motion states of Gaussian models. It then applies differentiated optimization strategies to Gaussian models with varying degrees of significance, thereby significantly improving the model convergence rate. Additionally, this research integrates a multi-scale manifold enhancement module, which leverages the collaborative optimization of an implicit nonlinear decoder and an explicit deformation field to enhance the modeling efficiency for complex deformation scenes. Experimental results demonstrate that this method achieves a frame rate of up to 96 FPS on synthetic datasets, while effectively reducing both storage overhead and training time.Our code and data are available at https://anonymous.4open.science/r/GS-DMSR-2212.

[84] Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation

Takito Sawada,Akinori Iwata,Masahiro Okuda

Main category: cs.CV

TL;DR: 提出一种数据驱动的指标来量化数据集的形状-纹理平衡,并通过调整最大池化的膨胀来增强形状偏置,从而在形状主导的数据集上提高分类准确性。

Details Motivation: 卷积神经网络(CNN)具有强烈的纹理偏置,这在处理以形状为主的数据(如插图和草图)时会降低性能。现有方法缺乏识别哪些数据集真正受益于形状偏置模型的定量指标。 Method: 通过计算每个图像亮度通道与其L0平滑版本之间的结构相似性指数(SSIM)来量化数据集的形状-纹理平衡;引入一种高效的适应方法,通过修改最大池化的膨胀参数并冻结卷积权重来增强形状偏置。 Result: 实验结果表明,该方法在形状主导的数据集上显著提高了分类准确率,尤其是在低数据环境下,仅需训练最终分类层即可。 Conclusion: 所提出的指标和适应方法有效提升了CNN在形状主导任务中的表现,同时保持了计算效率和实用性。 Abstract: Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information--a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image's luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.

[85] SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

Chuhan Wang,Xintong Li,Jennifer Yuntong Zhang,Junda Wu,Chengkai Huang,Lina Yao,Julian McAuley,Jingbo Shang

Main category: cs.CV

TL;DR: 提出SceneAlign框架,利用场景图进行可控结构干预,通过构建硬负样本推理路径提升多模态大模型在复杂视觉场景中的推理保真度。

Details Motivation: 现有基于偏好的方法无法有效解决多模态大模型在复杂视觉场景中因依赖语言先验而忽略视觉 grounding 导致的推理不忠实问题。 Method: 使用场景图识别推理关键节点,设计四种模拟典型 grounding 失误的扰动策略,生成语言合理但视觉事实错误的硬负推理路径,用于直接偏好优化。 Result: 在七个视觉推理基准上,SceneAlign consistently 提升了答案准确性和推理保真度。 Conclusion: 通过引入结构化的视觉感知对齐机制,可有效增强多模态模型在复杂场景下的细粒度、保真推理能力。 Abstract: Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.

[86] Learning Geometric Invariance for Gait Recognition

Zengbin Wang,Junjie Li,Saihui Hou,Xu Liu,Chunshui Cao,Yongzhen Huang,Muyi Sun,Siye Wang,Man Zhang

Main category: cs.CV

TL;DR: 本文提出了一种新的步态识别框架${\mathcal{RRS}}$-Gait,通过建模反射、旋转和缩放三种几何变换实现跨视角和跨着装条件下的身份不变性特征学习,显著提升了多变条件下的步态识别性能。

Details Motivation: 现有步态识别方法多隐式学习不同条件下的共性特征,缺乏对不同步态条件间内在关系的显式建模,尤其在跨视角和跨着装场景下面临挑战。 Method: 将不同步态条件间的差异视为几何变换(反射、旋转、缩放)的组合,设计了${\mathcal{RRS}}$-Gait框架:通过自适应调整卷积核实现特征近似等变性,并结合全局池化进行不变性学习。 Result: 在四个主流步态数据集(Gait3D, GREW, CCPG, SUSTech1K)上验证了方法的有效性,在多种复杂条件下均取得优异性能。 Conclusion: 通过显式建模几何变换可有效提升步态识别的鲁棒性和泛化能力,为跨条件步态识别提供了新思路。 Abstract: The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.

[87] LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

Chengen Xie,Bin Sun,Tianyu Li,Junjie Wu,Zhihui Hao,XianPeng Lang,Hongyang Li

Main category: cs.CV

TL;DR: 提出LatentVLA框架,通过自监督潜在动作预测和知识蒸馏,无需语言标注即可训练高效、鲁棒的端到端自动驾驶模型,在多个基准上实现SOTA性能。

Details Motivation: 现有VLA模型依赖语言标注、存在语言偏差、推理效率低,难以应对长尾场景且不利于实时部署。 Method: 提出LatentVLA,采用自监督的潜在动作预测学习驾驶表征,避免使用语言标注;通过知识蒸馏将VLA模型的泛化能力迁移到轻量级视觉网络,提升推理效率。 Result: 在NAVSIM基准上达到92.4的PDMS分数,刷新SOTA;在nuScenes上展现出强零样本泛化能力。 Conclusion: LatentVLA消除了语言偏差,降低了标注负担,兼顾了高性能与实时性,为端到端自动驾驶提供了更实用的VLA训练范式。 Abstract: End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.

[88] Compressing image encoders via latent distillation

Caroline Mazini Rodrigues,Nicolas Keriven,Thomas Maugey

Main category: cs.CV

TL;DR: 提出一种简化的知识蒸馏方法,通过减少编码器大小来压缩深度学习图像压缩模型,在保持重建质量的同时降低计算资源需求。

Details Motivation: 深度学习图像压缩模型通常复杂且占用资源多,难以在硬件受限的环境中应用。 Method: 采用简化的知识蒸馏策略,用较少数据和更短训练时间逼近原始模型的潜在空间,从而将重型编码器转化为轻量级编码器。 Result: 在两种不同架构上验证了该方法的有效性,轻量级编码器在重建质量和统计保真度方面优于使用原始损失函数训练的结果。 Conclusion: 该方法使得深度学习图像压缩模型更适用于资源受限的环境,具有良好的实用价值。 Abstract: Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.

[89] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Jingyu Li,Junjie Wu,Dongnan Hu,Xiangkai Huang,Bin Sun,Zhihui Hao,Xianpeng Lang,Xiatian Zhu,Li Zhang

Main category: cs.CV

TL;DR: SGDrive是一种基于视觉语言模型(VLM)的新型端到端自动驾驶框架,通过构建场景-智能体-目标的层次化结构,增强VLM在驾驶任务中的空间-时间表征能力,在NAVSIM基准上实现了相机方法的最先进性能。

Details Motivation: 现有的通用视觉语言模型缺乏对驾驶特定的三维时空推理能力,难以建立结构化的空间-时间表征,限制了其在自动驾驶轨迹规划中的应用效果。 Method: 提出SGDrive框架,基于预训练VLM,将驾驶理解分解为场景-智能体-目标三个层次,模拟人类驾驶认知过程,显式地引入驾驶知识的层次结构以增强表示学习。 Result: 在NAVSIM基准上的实验表明,SGDrive在PDMS和EPDMS指标上均优于现有纯视觉方法,展现出层次化知识结构在适配通用VLM用于自动驾驶方面的有效性。 Conclusion: 通过引入驾驶特定的层次化知识结构,SGDrive有效弥补了通用VLM在自动驾驶中结构化时空表征的不足,提升了轨迹规划的性能,验证了知识结构化在专用领域应用中的重要性。 Abstract: Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM's representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

[90] SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

Muye Huang,Lingling Zhang,Yifei Li,Yaqiang Wu,Jun Liu

Main category: cs.CV

TL;DR: SketchVL是一种新的多模态大语言模型,通过FinePO强化学习算法实现细粒度信用分配,利用绘图标记中间推理步骤并结合细粒度过程奖励模型(FinePRM)逐步行打分,显著提升图表理解、自然图像和数学推理任务的性能。

Details Motivation: 现有MLLM在处理图表理解等复杂视觉推理任务时面临信用分配难题,传统的轨迹级优势估计无法区分单个响应中的正确与错误推理步骤。 Method: 提出SketchVL模型,通过在图像上绘制中间推理步骤的标记并将标注后的图像反馈给自身,构建多步推理过程;设计FinePO算法,结合细粒度过程奖励模型(FinePRM)对每个绘图动作进行评分,实现细粒度的信用分配。 Result: 实验表明,SketchVL在图表数据集、自然图像数据集和数学推理任务上平均比基线模型提升7.23%,且模型行为能有效对齐FinePRM的评分。 Conclusion: FinePO算法和SketchVL框架为训练具备强大推理能力的多模态模型提供了新方向,验证了细粒度强化学习信号在复杂推理任务中的有效性。 Abstract: Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL's methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23\% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

[91] Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation

Jin Wang,Jianxiang Lu,Comi Chen,Guangzheng Xu,Haoyu Yang,Peng Chen,Na Zhang,Yifan Xu,Longhuang Wu,Shuai Shao,Qinglin Lu,Ping Luo

Main category: cs.CV

TL;DR: 本文提出RCM,一种基于扩散模型的图像到视频框架,用于高质量的3D角色生成与新视角合成,支持复杂姿态、高分辨率输出、可控视角及多视图输入。

Details Motivation: 现有方法在处理复杂姿态和自遮挡时难以实现一致且高质量的新视角合成,缺乏灵活性与细节保真度。 Method: 提出RCM框架,通过将任意复杂姿态的角色转换为标准姿态,结合扩散模型进行高分辨率轨道视频生成,并支持多视图条件输入与不同初始相机姿态的可控观察。 Result: 实现了1024x1024分辨率的高质量轨道视频生成,在新视角合成和3D角色生成方面优于现有最先进方法,支持最多4个输入视图。 Conclusion: RCM在复杂姿态处理、视角控制和多输入支持方面表现优越,显著提升了3D角色重建与新视角合成的质量与实用性。 Abstract: Generating high-quality 3D characters from single images remains a significant challenge in digital content creation, particularly due to complex body poses and self-occlusion. In this paper, we present RCM (Rotate your Character Model), an advanced image-to-video diffusion framework tailored for high-quality novel view synthesis (NVS) and 3D character generation. Compared to existing diffusion-based approaches, RCM offers several key advantages: (1) transferring characters with any complex poses into a canonical pose, enabling consistent novel view synthesis across the entire viewing orbit, (2) high-resolution orbital video generation at 1024x1024 resolution, (3) controllable observation positions given different initial camera poses, and (4) multi-view conditioning supporting up to 4 input images, accommodating diverse user scenarios. Extensive experiments demonstrate that RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.

[92] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang,Jianxiang Lu,Guangzheng Xu,Comi Chen,Haoyu Yang,Linqing Wang,Peng Chen,Mingtao Chen,Zhichao Hu,Longhuang Wu,Shuai Shao,Qinglin Lu,Ping Luo

Main category: cs.CV

TL;DR: 本文提出了一种名为TAGRPO的鲁棒后训练框架,用于提升图像到视频(I2V)生成模型的性能,通过引入基于对比学习的GRPO损失和 rollout 视频的记忆库,在奖励信号优化和生成多样性方面显著优于现有方法。

Details Motivation: 直接将现有的Group Relative Policy Optimization(GRPO)方法应用于I2V模型时,往往无法带来一致的奖励提升,因此需要一种更有效的后训练策略来解决该问题。 Method: 提出TAGRPO框架,利用从相同初始噪声生成的rollout视频作为优化指导,设计应用于中间隐变量的新GRPO损失函数,推动高奖励轨迹对齐并远离低奖励轨迹;同时引入rollout视频记忆库以增强多样性并降低计算开销。 Result: TAGRPO在I2V生成任务中显著优于DanceGRPO,实现了更高的奖励得分和更稳定的训练效果。 Conclusion: TAGRPO为I2V模型提供了一种简单而有效的后训练优化方案,结合对比学习思想与记忆机制,提升了生成质量与训练效率。 Abstract: Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.

[93] FeatureSLAM: Feature-enriched 3D gaussian splatting SLAM in real time

Christopher Thirgood,Oscar Mendez,Erin Ling,Jon Storey,Simon Hadfield

Main category: cs.CV

TL;DR: 本文提出了一种基于3D高斯点阵的实时语义SLAM系统FeatureSLAM,通过将密集特征光栅化与视觉基础模型对齐,实现了高保真、语义丰富的建图与稳定跟踪,支持开放集分割与自由视角语义掩码,兼顾实时性与精度。

Details Motivation: 现有语义SLAM系统多依赖预定义类别标签,限制了下游任务的灵活性;同时,高保真建图与实时性能难以兼得。本文旨在结合新型3D表示方法(3DGS)与密集语义特征,构建一个兼具实时性、高精度与开放语义能力的统一SLAM框架。 Method: 提出FeatureSLAM,将来自视觉基础模型的密集语义特征融入3D高斯点阵的渲染流程中,在新视角合成时同步实现高效相机跟踪与特征增强的建图。通过特征对齐与联合优化,提升位姿估计与地图质量,并支持自由视角下的开放词汇语义分割。 Result: 在标准数据集上实现与最先进系统相当的实时跟踪速度,姿态误差降低9%,建图精度提高8%;语义与语言掩码效果媲美离线3DGS模型,同时支持开放集、自由视角分割等新任务。 Conclusion: 将密集语义特征嵌入3DGS渲染流程不仅能提升SLAM系统的跟踪稳定性与建图保真度,还能解锁多种开放语义下游应用,验证了实时特征嵌入SLAM的双重优势。 Abstract: We present a real-time tracking SLAM system that unifies efficient camera tracking with photorealistic feature-enriched mapping using 3D Gaussian Splatting (3DGS). Our main contribution is integrating dense feature rasterization into the novel-view synthesis, aligned with a visual foundation model. This yields strong semantics, going beyond basic RGB-D input, aiding both tracking and mapping accuracy. Unlike previous semantic SLAM approaches (which embed pre-defined class labels) FeatureSLAM enables entirely new downstream tasks via free-viewpoint, open-set segmentation. Across standard benchmarks, our method achieves real-time tracking, on par with state-of-the-art systems while improving tracking stability and map fidelity without prohibitive compute. Quantitatively, we obtain 9\% lower pose error and 8\% higher mapping accuracy compared to recent fixed-set SLAM baselines. Our results confirm that real-time feature-embedded SLAM, is not only valuable for enabling new downstream applications. It also improves the performance of the underlying tracking and mapping subsystems, providing semantic and language masking results that are on-par with offline 3DGS models, alongside state-of-the-art tracking, depth and RGB rendering.

[94] ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Guray Ozgur,Eduarda Caldeira,Tahar Chettaoui,Jan Niklas Kolf,Marco Huber,Naser Damer,Fadi Boutros

Main category: cs.CV

TL;DR: 提出了一种无需训练的面部图像质量评估方法ViTNT-FIQA,通过分析Vision Transformer中间块中补丁嵌入的演化稳定性来衡量图像质量,仅需单次前向传播,具有高效性和广泛适用性。

Details Motivation: 现有面部图像质量评估方法多依赖最终层表示或需要多次前向传播/反向传播,缺乏高效且无需训练的解决方案。 Method: 利用Vision Transformer中间块的补丁嵌入,计算连续块间L2归一化嵌入的欧氏距离,并聚合为图像级质量分数,以衡量特征演化的稳定性。 Result: 在八个基准数据集上验证了方法的有效性,表现出与最先进方法相当的性能,同时仅需单次前向传播,无需反向传播或模型修改。 Conclusion: ViTNT-FIQA是一种高效、无需训练的面部图像质量评估方法,可广泛应用于任何预训练的ViT-based人脸识别模型。 Abstract: Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

[95] FlyPose: Towards Robust Human Pose Estimation From Aerial Views

Hassaan Farooq,Marvin Brenner,Peter St\ütz

Main category: cs.CV

TL;DR: 本文提出了FlyPose,一种轻量级的无人机航拍图像人体姿态估计框架,通过多数据集训练在多个测试集上显著提升了检测和姿态估计精度,并实现了实时推理与机载部署。

Details Motivation: 由于无人机在人类附近操作时面临低分辨率、陡峭视角和遮挡等挑战,现有方法难以准确感知人体姿态,因此需要一个适用于航拍视角且满足实时性要求的解决方案。 Method: 提出FlyPose,采用自上而下的姿态估计流程,结合多数据集联合训练策略,优化了人在航拍图像中的检测与关键点定位,并在Jetson Orin AGX上实现低延迟推理。 Result: 在多个数据集上取得显著提升:人物检测平均提升6.8 mAP,UAV-Human数据集上2D姿态估计提升16.3 mAP,推理延迟约为20毫秒,并成功部署于四旋翼无人机进行飞行实验。 Conclusion: FlyPose有效解决了航拍视角下的人体姿态估计难题,具备高精度、低延迟特性,适合实际无人机应用场景,同时发布的FlyPose-104数据集为后续研究提供了有价值的基准。 Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster response and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perception of human poses and actions from an aerial viewpoint. This perspective challenges existing methods with low resolution, steep viewing angles and (self-)occlusion, especially if the application demands realtime feasibile models. We train and deploy FlyPose, a lightweight top-down human pose estimation pipeline for aerial imagery. Through multi-dataset training, we achieve an average improvement of 6.8 mAP in person detection across the test-sets of Manipal-UAV, VisDrone, HIT-UAV as well as our custom dataset. For 2D human pose estimation we report an improvement of 16.3 mAP on the challenging UAV-Human dataset. FlyPose runs with an inference latency of ~20 milliseconds including preprocessing on a Jetson Orin AGX Developer Kit and is deployed onboard a quadrotor UAV during flight experiments. We also publish FlyPose-104, a small but challenging aerial human pose estimation dataset, that includes manual annotations from difficult aerial perspectives: https://github.com/farooqhassaan/FlyPose.

[96] Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification

Quanjiang Li,Zhiming Liu,Tianxiang Xu,Tingjin Luo,Chenping Hou

Main category: cs.CV

TL;DR: 提出了一种自适应解耦表示学习方法(ADRL),用于解决多视图多标签学习中特征缺失和标注不完整的问题,通过鲁棒的视图补全、表示解耦和标签语义建模,显著提升了性能。

Details Motivation: 多视图多标签学习常因数据获取困难和标注成本高而面临特征缺失和标注不完整的问题,现有方法在特征恢复、表示解耦和标签语义建模方面存在局限。 Method: ADRL通过跨模态传播具有邻域感知的特征级亲和性实现鲁棒视图补全,采用随机掩码策略增强重建效果;利用基于互信息的目标函数促进共享表示的一致性,并抑制视图特异性表示与其他模态的信息重叠;通过标签嵌入与视图表示的独立交互进行原型特定特征选择,并生成伪标签以指导判别性视图融合。 Result: 在公共数据集和实际应用中的大量实验表明,ADRL在多视图多标签学习任务中表现出优越性能。 Conclusion: ADRL有效解决了特征缺失和标注不完整下的多视图多标签学习问题,在表示学习和标签语义建模方面具有优势,具备良好的理论支持和实际应用价值。 Abstract: Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.

[97] SceneFoundry: Generating Interactive Infinite 3D Worlds

ChunTeng Chen,YiChen Hsu,YiWen Liu,WeiFang Sun,TsaiChing Ni,ChunYi Lee,Min Sun,YuanFu Yang

Main category: cs.CV

TL;DR: SceneFoundry是一个基于语言引导的扩散框架,能够从自然语言生成具有功能性铰接家具和语义多样布局的大规模3D室内环境,支持机器人训练。

Details Motivation: 现有生成方法难以捕捉包含可动部件的真实室内环境的功能复杂性,限制了机器人学习和具身智能的发展。 Method: 结合LLM模块控制楼层布局生成,并利用基于扩散的后验采样从大规模3D模型库中填充铰接物体;引入可微分引导函数确保物理可用性。 Result: 实验表明该框架能生成结构合理、语义连贯且功能交互性强的多样化3D环境。 Conclusion: SceneFoundry支持可扩展的具身AI研究,推动机器人在真实交互环境中训练的能力。 Abstract: The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.

[98] Boosting Latent Diffusion Models via Disentangled Representation Alignment

John Page,Xuesong Niu,Kai Wu,Kun Gai

Main category: cs.CV

TL;DR: 本文提出了Send-VAE,一种通过与视觉基础模型的语义层次对齐来显式优化语义解耦表示学习的变分自编码器,显著提升了图像生成质量和训练效率。

Details Motivation: 现有研究使用相同的表示对齐目标同时优化VAE和LDM,忽视了二者在表示需求上的本质差异:LDM需要保留高层语义,而VAE应专注于属性级语义解耦。 Method: 提出Send-VAE,利用非线性映射网络将VAE的潜在空间与预训练视觉基础模型的语义层次对齐,从而在保持生成性能的同时增强语义解耦能力。 Result: 在线性探测任务中验证了语义解耦能力,并用于训练基于流的Transformer(SiTs),在ImageNet 256x256上实现了1.21(使用分类器无关引导)和1.75的FID分数,达到SOTA水平,且显著加快了训练速度。 Conclusion: 通过专门设计针对VAE的语义解耦对齐策略,Send-VAE优于传统方法,为构建更高效、高质量的潜扩散模型提供了新的思路。 Abstract: Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

[99] GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras

Weimin Liu,Wenjun Wang,Joshua H. Meng

Main category: cs.CV

TL;DR: 本文提出GeoSurDepth,一种利用几何一致性进行环视深度估计的自监督框架,通过伪几何先验、多视角光度监督和自适应运动学习策略,在DDAD和nuScenes数据集上实现了最先进性能。

Details Motivation: 现有环视深度估计方法较少显式利用单目和多视角设置中的丰富几何结构,主要依赖光度一致性,导致深度估计鲁棒性不足。 Method: 提出GeoSurDepth框架:1)利用基础模型作为伪几何先验和特征增强工具,保持3D空间表面法向一致性和2D对象/纹理一致的深度估计;2)设计新的视图合成流程,通过空间扭曲实现2D-3D提升,引入跨时序、空间和时空的光度监督;3)提出自适应联合运动学习策略,增强对空间几何线索的利用以改进运动推理。 Result: 在DDAD和nuScenes数据集上进行了广泛实验,结果表明GeoSurDepth在自监督环视深度估计任务中达到最先进的性能。 Conclusion: 充分挖掘几何一致性对于实现鲁棒的自监督多视角深度估计至关重要,所提出的GeoSurDepth框架有效整合了几何、光度和运动线索,提升了3D场景理解能力。 Abstract: Accurate surround-view depth estimation provides a competitive alternative to laser-based sensors and is essential for 3D scene understanding in autonomous driving. While prior studies have proposed various approaches that primarily focus on enforcing cross-view constraints at the photometric level, few explicitly exploit the rich geometric structure inherent in both monocular and surround-view setting. In this work, we propose GeoSurDepth, a framework that leverages geometry consistency as the primary cue for surround-view depth estimation. Concretely, we utilize foundation models as a pseudo geometry prior and feature representation enhancement tool to guide the network to maintain surface normal consistency in spatial 3D space and regularize object- and texture-consistent depth estimation in 2D. In addition, we introduce a novel view synthesis pipeline where 2D-3D lifting is achieved with dense depth reconstructed via spatial warping, encouraging additional photometric supervision across temporal, spatial, and spatial-temporal contexts, and compensating for the limitations of single-view image reconstruction. Finally, a newly-proposed adaptive joint motion learning strategy enables the network to adaptively emphasize informative spatial geometry cues for improved motion reasoning. Extensive experiments on DDAD and nuScenes demonstrate that GeoSurDepth achieves state-of-the-art performance, validating the effectiveness of our approach. Our framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation.

[100] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Nate Gillman,Yinghua Zhou,Zitian Tang,Evan Luo,Arjan Chakravarthy,Daksh Aggarwal,Michael Freeman,Charles Herrmann,Chen Sun

Main category: cs.CV

TL;DR: 本文提出了Goal Force框架,通过力向量和中间动力学来定义视频生成的目标,使模型能零样本泛化到复杂真实场景,成为隐式的神经物理模拟器。

Details Motivation: 现有视频生成中目标指定困难:文本过于抽象,图像难以动态描述;需更符合物理任务直觉的控制方式。 Method: 提出Goal Force框架,使用显式力向量和中间动力学作为目标输入;在合成的因果基本单元数据集(如弹性碰撞、多米诺骨牌)上训练视频生成模型,学习力在时空中的传播。 Result: 模型虽仅在简单物理数据上训练,却能在复杂真实场景(如工具操作、多物体因果链)中实现零样本泛化,展现出类似物理引擎的模拟能力。 Conclusion: 将视频生成基于基础物理交互,可使模型成为无需外部引擎的隐式神经物理模拟器,支持精确且具物理感知的规划。 Abstract: Recent advancements in video generation have enabled the development of ``world models'' capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

[101] Kidney Cancer Detection Using 3D-Based Latent Diffusion Models

Jen Dusseljee,Sarah de Boer,Alessa Hering

Main category: cs.CV

TL;DR: 提出了一种基于3D潜在扩散模型的弱监督肾脏异常检测方法,使用DDPM、DDIM和VQ-GAN,在仅需病例级伪标签的情况下实现对腹部CT图像的体积级分析。

Details Motivation: 现有方法多为逐切片处理且依赖密集标注,难以扩展到3D体积数据;本文旨在探索无需大量精细标注即可进行3D异常检测的生成式方法。 Method: 结合DDPM、DDIM和VQ-GAN构建3D潜在扩散模型,直接在图像体积上操作,并利用病例级伪标签进行弱监督训练。 Result: 在对比增强腹部CT上验证了方法可行性,虽性能尚未超越全监督基线,但在重建保真度和病灶定位方面指明了改进方向。 Conclusion: 该方法为实现低标注成本、生成式的复杂腹部解剖建模提供了重要进展,展示了3D潜在扩散在弱监督异常检测中的潜力。 Abstract: In this work, we present a novel latent diffusion-based pipeline for 3D kidney anomaly detection on contrast-enhanced abdominal CT. The method combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs). Unlike prior slice-wise approaches, our method operates directly on an image volume and leverages weak supervision with only case-level pseudo-labels. We benchmark our approach against state-of-the-art supervised segmentation and detection models. This study demonstrates the feasibility and promise of 3D latent diffusion for weakly supervised anomaly detection. While the current results do not yet match supervised baselines, they reveal key directions for improving reconstruction fidelity and lesion localization. Our findings provide an important step toward annotation-efficient, generative modeling of complex abdominal anatomy.

[102] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting

Yinghan Xu,John Dingliana

Main category: cs.CV

TL;DR: 提出一种新的框架,用于将任意姿态的人体分解为可动画的多层3D人体模型,分离身体和服装,通过2D高斯表示和扩散模型修复隐藏区域,实现高质量渲染与虚拟试穿。

Details Motivation: 传统单层重建方法将衣物锁定在一个身份上,而先前的多层方法在处理遮挡区域时存在困难,因此需要更优的分解与重建方案。 Method: 采用2D高斯集合表示每一层几何结构,并利用预训练的2D扩散模型结合得分蒸馏采样(SDS)修复被遮挡区域;通过三阶段训练策略:先进行单层粗略重建,再联合优化内外层细节。 Result: 在4D-Dress和Thuman2.0两个数据集上实验表明,该方法在渲染质量、图层分解与重组方面优于现有最先进方法,支持新视角和新姿态下的逼真虚拟试穿。 Conclusion: 该方法有效解决了衣物与身体分离中的遮挡问题,提升了多层3D人体建模的质量与实用性,推动了高保真虚拟人资产在沉浸式应用中的发展。 Abstract: We propose a novel framework for decomposing arbitrarily posed humans into animatable multi-layered 3D human avatars, separating the body and garments. Conventional single-layer reconstruction methods lock clothing to one identity, while prior multi-layer approaches struggle with occluded regions. We overcome both limitations by encoding each layer as a set of 2D Gaussians for accurate geometry and photorealistic rendering, and inpainting hidden regions with a pretrained 2D diffusion model via score-distillation sampling (SDS). Our three-stage training strategy first reconstructs the coarse canonical garment via single-layer reconstruction, followed by multi-layer training to jointly recover the inner-layer body and outer-layer garment details. Experiments on two 3D human benchmark datasets (4D-Dress, Thuman2.0) show that our approach achieves better rendering quality and layer decomposition and recomposition than the previous state-of-the-art, enabling realistic virtual try-on under novel viewpoints and poses, and advancing practical creation of high-fidelity 3D human assets for immersive applications. Our code is available at https://github.com/RockyXu66/LayerGS

[103] Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation

Kaiwen Huang,Yizhe Zhang,Yi Zhou,Tianyang Xu,Tao Zhou

Main category: cs.CV

TL;DR: 提出了一种用于半监督医学图像分割的双向通道选择性语义交互框架(BCSI),通过语义-空间扰动机制和通道选择性路由组件提升模型稳定性和特征交互效率,在多个3D医学数据集上优于现有方法。

Details Motivation: 现有半监督方法存在误差累积、结构复杂以及忽略有标签与无标签数据流之间交互的问题,限制了在医学图像分割中的性能。 Method: 提出BCSI框架,包括语义-空间扰动(SSP)机制进行强增强与伪标签学习,通道选择性路由器(CR)动态筛选相关通道以减少噪声干扰,并采用双向通道级交互(BCI)策略增强关键通道的语义表示。 Result: 在多个3D医学图像分割基准数据集上验证了方法的有效性,实验结果表明所提方法优于现有的半监督学习方法。 Conclusion: BCSI框架有效提升了半监督医学图像分割的性能,通过增强模型鲁棒性与优化跨数据流的特征交互,为有限标注数据场景提供了更优解决方案。 Abstract: Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams. To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction~(BCSI) framework for semi-supervised medical image segmentation. First, we propose a Semantic-Spatial Perturbation~(SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency on the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose a Channel-selective Router~(CR) component, which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction~(BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on multiple benchmarking 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches.

[104] Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection

Zhen-Xin Lin,Shang-Kuan Chen

Main category: cs.CV

TL;DR: 本文提出了一种新的深度伪造检测框架Phase4DFD,通过显式建模频域中的相位与幅度信息交互,利用相位不连续性引导注意力机制,有效提升检测性能。

Details Motivation: 现有深度伪造检测方法多依赖频谱幅度信息,忽视了相位信息的作用,而合成图像常引入相位不连续性,因此探索相位信息有助于发现更细微的伪造痕迹。 Method: 提出Phase4DFD框架,结合FFT幅度和局部二值模式(LBP)作为多域输入,并设计相位感知注意力模块,在骨干网络前利用相位信息引导关注可疑频率模式,使用BNext-M骨干网络进行特征提取,可选通道空间注意力优化语义特征。 Result: 在CIFAKE和DFFD数据集上实验表明,Phase4DFD优于现有的空间域和频域检测方法,且计算开销低;消融实验证明相位建模提供了幅度之外的互补信息。 Conclusion: 显式建模相位信息对深度伪造检测至关重要,Phase4DFD通过相位感知注意力机制有效融合多域特征,实现了高效、准确的伪造检测。 Abstract: Recent deepfake detection methods have increasingly explored frequency domain representations to reveal manipulation artifacts that are difficult to detect in the spatial domain. However, most existing approaches rely primarily on spectral magnitude, implicitly under exploring the role of phase information. In this work, we propose Phase4DFD, a phase aware frequency domain deepfake detection framework that explicitly models phase magnitude interactions via a learnable attention mechanism. Our approach augments standard RGB input with Fast Fourier Transform (FFT) magnitude and local binary pattern (LBP) representations to expose subtle synthesis artifacts that remain indistinguishable under spatial analysis alone. Crucially, we introduce an input level phase aware attention module that uses phase discontinuities commonly introduced by synthetic generation to guide the model toward frequency patterns that are most indicative of manipulation before backbone feature extraction. The attended multi domain representation is processed by an efficient BNext M backbone, with optional channel spatial attention applied for semantic feature refinement. Extensive experiments on the CIFAKE and DFFD datasets demonstrate that our proposed model Phase4DFD outperforms state of the art spatial and frequency-based detectors while maintaining low computational overhead. Comprehensive ablation studies further confirm that explicit phase modeling provides complementary and non-redundant information beyond magnitude-only frequency representations.

[105] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Yohann Perron,Vladyslav Sydorov,Christophe Pottier,Loic Landrieu

Main category: cs.CV

TL;DR: 提出一种基于视觉Transformer的多尺度推理方法,通过并行处理局部和全局尺度图像,并利用可学习的中继令牌传递特征,在保持细节的同时增强全局感知,显著提升超高清图像分割性能。

Details Motivation: 现有方法在处理超高清图像时,要么丢失全局上下文(滑窗法),要么损失细节(下采样),难以兼顾局部精细与全局结构。 Method: 在局部尺度(高分辨率小裁剪)和全局尺度(低分辨率大裁剪)并行处理图像,通过少量可学习的中继令牌在两个分支间聚合和传播特征,兼容标准Transformer架构(如ViT、Swin),且增加参数不足2%。 Result: 在Archaeoscape、URUR、Gleason三个超高清分割基准及Cityscapes数据集上均取得一致提升,相对mIoU最高提升达15%。 Conclusion: 该方法有效平衡了局部细节保留与全局上下文建模,以极低额外开销提升了多尺度图像分割性能,适用于多种Transformer主干网络。 Abstract: Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .

[106] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets

Pankaj Gupta,Priya Mudgil,Niharika Dutta,Kartik Bose,Nitish Kumar,Anupam Kumar,Jimil Shah,Vaneet Jearth,Jayanta Samanta,Vishal Sharma,Harshal Mandavdhare,Surinder Rana,Saroj K Sinha,Usha Dutta

Main category: cs.CV

TL;DR: 该研究开发了一种基于Vision Transformer的深度学习模型,用于超声内镜(EUS)图像中胰腺肿瘤的分割,在大规模数据上表现出良好的性能,但在多中心泛化和误报方面仍有改进空间。

Details Motivation: 胰腺癌诊断依赖于操作者经验的EUS技术存在主观性问题,亟需一种客观、自动化的肿瘤分割方法以提高诊断准确性和一致性。 Method: 采用基于Vision Transformer骨干网络的USFM框架,使用17,367张EUS图像进行五折交叉验证,并在一个独立的350张图像外部数据集上测试;图像经过灰度转换、裁剪和调整至512x512像素处理。 Result: 五折交叉验证中平均DSC为0.651,IoU为0.579,敏感性69.8%,特异性98.8%;外部验证集上DSC达0.657(95% CI: 0.634–0.769),IoU为0.614,敏感性71.8%,特异性97.7%,但9.7%的病例出现错误的多重预测。 Conclusion: 基于Vision Transformer的模型在EUS胰腺肿瘤分割中表现良好,具备临床辅助潜力,但因数据异质性和外部验证有限,仍需进一步优化、标准化及前瞻性研究验证。 Abstract: Background: Pancreatic cancer is one of the most aggressive cancers, with poor survival rates. Endoscopic ultrasound (EUS) is a key diagnostic modality, but its effectiveness is constrained by operator subjectivity. This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors. Methods: A segmentation model using the USFM framework with a Vision Transformer backbone was trained and validated with 17,367 EUS images (from two public datasets) in 5-fold cross-validation. The model was tested on an independent dataset of 350 EUS images from another public dataset, manually segmented by radiologists. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Metrics included Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy. Results: In 5-fold cross-validation, the model achieved a mean DSC of 0.651 +/- 0.738, IoU of 0.579 +/- 0.658, sensitivity of 69.8%, specificity of 98.8%, and accuracy of 97.5%. For the external validation set, the model achieved a DSC of 0.657 (95% CI: 0.634-0.769), IoU of 0.614 (95% CI: 0.590-0.689), sensitivity of 71.8%, and specificity of 97.7%. Results were consistent, but 9.7% of cases exhibited erroneous multiple predictions. Conclusions: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images. However, dataset heterogeneity and limited external validation highlight the need for further refinement, standardization, and prospective studies.

[107] Context-Aware Decoding for Faithful Vision-Language Generation

Mehrdad Fazli,Bowen Wei,Ziwei Zhu

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的轻量级方法CEI,通过利用上下文嵌入作为 grounding 信号来减少大视觉语言模型(LVLMs)在生成过程中的幻觉问题,并基于Logit Lens发现真实与幻觉token在概率积累上的层间差异。

Details Motivation: 大视觉语言模型在开放任务中容易产生与视觉输入不一致的幻觉问题,影响其可靠性,亟需有效缓解策略。 Method: 使用Logit Lens分析LVLM解码器各层的生成动态,发现“承诺-深度”差距,并提出上下文嵌入注入(CEI)方法,将输入末尾token的隐藏状态作为视觉保真度的引导信号。 Result: 在CHAIR、AMBER和MMHal-Bench基准上评估表明,CEI在三种LVLM上均优于现有最先进基线方法,动态变体实现了最低的整体幻觉率。 Conclusion: 通过揭示生成机制中的关键动态并引入简单高效的干预手段,该工作为无需训练地减轻LVLM幻觉提供了新思路。 Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.

[108] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Chanchan Wang,Yuanfang Wang,Qing Xu,Guanxin Chen

Main category: cs.CV

TL;DR: 提出WaveRNet,一种基于小波引导的频域学习框架,用于鲁棒的多源域泛化视网膜血管分割,通过频域建模和层次化细化提升跨域性能。

Details Motivation: 现有基于SAM的方法忽略频域信息且无法保持细小血管结构,导致在光照和对比度变化下泛化能力差。 Method: 设计了三个模块:频谱引导域调制器(SDM)分离低频光照鲁棒特征与高频血管边界;频率自适应域融合(FADF)在测试时基于小波相似性进行软加权融合;层次化掩码提示优化器(HMPR)实现粗到精的细节恢复。 Result: 在四个公开数据集上采用Leave-One-Domain-Out协议验证,WaveRNet实现了最先进的域泛化性能。 Conclusion: WaveRNet通过引入小波频域分析与层次化细化机制,有效提升了视网膜血管分割在跨域场景下的鲁棒性与精度。 Abstract: Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM's direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM's upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at https://github.com/Chanchan-Wang/WaveRNet.

[109] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Longbin Ji,Xiaoxiong Liu,Junyuan Shang,Shuohuan Wang,Yu Sun,Hua Wu,Haifeng Wang

Main category: cs.CV

TL;DR: 本文提出了VideoAR,首个大规模视觉自回归(VAR)视频生成框架,通过多尺度下一帧预测与自回归建模结合,在保持高效推理的同时实现了与扩散模型相媲美的性能。

Details Motivation: 现有的视频生成方法主要依赖扩散或流匹配模型,虽然效果好但计算开销大、难以扩展。因此需要一种更高效、可扩展的自回归方法来实现高质量视频生成。 Method: 提出VideoAR框架,采用3D多尺度分词器解耦时空依赖,结合帧内VAR建模与因果下一帧预测;引入Multi-scale Temporal RoPE、跨帧误差校正和随机帧掩码策略,并设计多阶段预训练流程逐步提升时空建模能力。 Result: 在UCF-101上FVD从99.5降至88.6,推理步数减少10倍以上;VBench得分为81.74,性能接近大一个数量级的扩散模型。 Conclusion: VideoAR显著缩小了自回归与扩散模型在视频生成上的性能差距,提供了一种高效、可扩展且时间一致性强的新范式。 Abstract: Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

[110] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

Yinsong Wang,Xinzhe Luo,Siyi Du,Chen Qin

Main category: cs.CV

TL;DR: 提出了一种自适应条件对比无关的可变形图像配准框架AC-CAR,能够泛化到训练中未见的任意成像对比度,具有更高的配准精度和可靠性。

Details Motivation: 传统方法依赖迭代优化,速度慢;现有学习方法泛化能力差,仅适用于训练中见过的对比度。 Method: 提出基于随机卷积的对比度增强方案和自适应条件特征调制器(ACFM),结合对比度不变的潜在正则化和方差网络,实现对比度无关的配准与不确定性估计。 Result: 实验表明AC-CAR在配准精度上优于基线方法,并对未见对比度表现出优异的泛化能力。 Conclusion: AC-CAR是一种高效、通用且可靠的多对比度医学图像配准框架,具备实际应用潜力。 Abstract: Deformable multi-contrast image registration is a challenging yet crucial task due to the complex, non-linear intensity relationships across different imaging contrasts. Conventional registration methods typically rely on iterative optimization of the deformation field, which is time-consuming. Although recent learning-based approaches enable fast and accurate registration during inference, their generalizability remains limited to the specific contrasts observed during training. In this work, we propose an adaptive conditional contrast-agnostic deformable image registration framework (AC-CAR) based on a random convolution-based contrast augmentation scheme. AC-CAR can generalize to arbitrary imaging contrasts without observing them during training. To encourage contrast-invariant feature learning, we propose an adaptive conditional feature modulator (ACFM) that adaptively modulates the features and the contrast-invariant latent regularization to enforce the consistency of the learned feature across different imaging contrasts. Additionally, we enable our framework to provide contrast-agnostic registration uncertainty by integrating a variance network that leverages the contrast-agnostic registration encoder to improve the trustworthiness and reliability of AC-CAR. Experimental results demonstrate that AC-CAR outperforms baseline methods in registration accuracy and exhibits superior generalization to unseen imaging contrasts. Code is available at https://github.com/Yinsong0510/AC-CAR.

[111] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

Adrian Serrano,Erwan Umlil,Ronan Thomas

Main category: cs.CV

TL;DR: 本研究扩展了DUMB和DUMBer方法,评估了五种先进深度伪造检测器在对抗攻击下的鲁棒性,发现对抗训练在分布内数据中增强鲁棒性,但在跨数据集情况下可能降低性能,强调需针对实际场景设计案例感知的防御策略。

Details Motivation: 现实环境中深度伪造检测系统面临对手通过不可察觉扰动削弱模型性能的威胁,而现有对抗训练方法在知识受限和数据分布不匹配情况下的有效性尚不明确,因此需要更贴近真实场景的鲁棒性评估。 Method: 采用扩展的DUMB和DUMBer框架,在跨数据集配置和可迁移性约束下,对RECCE、SRM、XCeption、UCF、SPSL五种检测器和PGD、FGSM、FPBA三种攻击进行评估,分析攻防双方在不同数据分布不匹配场景下的表现。 Result: 实验表明对抗训练在同分布数据中提升了检测器鲁棒性,但在跨数据集设置下其效果依赖具体策略,某些情况下反而会削弱鲁棒性;不同检测器和攻击组合表现出显著差异。 Conclusion: 对抗训练并非在所有现实场景下均有效,实际部署中的深度伪造检测系统需根据具体威胁模型和数据分布选择适配的防御策略,强调了案例感知防御的重要性。 Abstract: Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions -- where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB -- Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.