Skip to content

Table of Contents

cs.CL [Back]

[1] LLM-Augmented Knowledge Base Construction For Root Cause Analysis

Nguyen Phuc Tran,Brigitte Jaumard,Oscar Delgado,Tristan Glatard,Karthikeyan Premkumar,Kun Ni

Main category: cs.CL

TL;DR: 本研究评估了三种大语言模型(LLM)方法(微调、RAG和混合方法)在从支持工单构建根因分析(RCA)知识库中的性能,并在真实工业数据集上验证其对加速RCA任务和提升网络韧性的有效性。

Details Motivation: 通信网络虽具冗余和故障转移机制,但难以保证99.999%的可靠性,亟需在故障发生时快速准确地进行根因分析(RCA)以恢复服务并预防未来中断。 Method: 评估三种LLM方法——微调(Fine-Tuning)、检索增强生成(RAG)和混合方法——用于从支持工单构建RCA知识库,并采用多种词法与语义相似性指标进行性能比较。 Result: 在真实工业数据集上的实验表明,所构建的知识库能为加速RCA任务和提升网络韧性提供良好起点。 Conclusion: 三种LLM方法均有效,其中混合方法可能兼具微调与RAG优势,在RCA知识库构建中展现出潜力。 Abstract: Communications networks now form the backbone of our digital world, with fast and reliable connectivity. However, even with appropriate redundancy and failover mechanisms, it is difficult to guarantee "five 9s" (99.999 %) reliability, requiring rapid and accurate root cause analysis (RCA) during outages. In the event of an outage, rapid and accurate RCA becomes essential to restore service and prevent future disruptions. This study evaluates three Large Language Model (LLM) methodologies - Fine-Tuning, RAG, and a Hybrid approach - for constructing a Root Cause Analysis (RCA) Knowledge Base from support tickets. We compare their performance using a comprehensive suite of lexical and semantic similarity metrics. Our experiments on a real industrial dataset demonstrate that the generated knowledge base provides an excellent starting point for accelerating RCA tasks and improving network resilience.

[2] The Stepwise Informativeness Assumption: Why are Entropy Dynamics and Reasoning Correlated in LLMs?

Mar Gonzàlez I Català,Haitz Sáez de Ocáriz Borde,George D. Montañez,Pietro Liò

Main category: cs.CL

TL;DR: 本文提出Stepwise Informativeness Assumption (SIA),解释为何大语言模型内部预测熵动态与外部答案正确性高度相关:模型在生成过程中通过答案相关信息丰富的前缀逐步积累真值信息;该假设源于最大似然训练,并在微调和强化学习中被加强,且在多个模型与数据集上得到实证验证。

Details Motivation: 解释为何大语言模型内部预测熵(尤其在多层表征中)与最终答案正确性存在稳健相关性这一经验现象背后的原理。 Method: 提出Stepwise Informativeness Assumption(SIA),形式化定义推理前缀随生成步数逐步积累答案相关信息的性质;从最大似然优化推导SIA的合理性,并分析其在微调与RLHF中的强化机制;进而推导出可观测的条件答案熵动态特征,并在多个基准和开源LLM上进行实证检验。 Result: SIA被证实自然源于人类推理轨迹的最大似然训练,并在标准训练流程中被强化;实验表明,正确推理路径展现出特有的条件答案熵下降模式,且该现象跨模型(Gemma-2、LLaMA-3.2、Qwen-2.5等)和任务(GSM8K、ARC、SVAMP)具有一致性。 Conclusion: 内部熵动态与外部正确性的强相关性并非偶然,而是源于模型在训练中习得的、逐步积累答案信息的推理机制;SIA为理解LLM推理过程提供了理论基础与可验证的诊断工具。 Abstract: Recent work uses entropy-based signals at multiple representation levels to study reasoning in large language models, but the field remains largely empirical. A central unresolved puzzle is why internal entropy dynamics, defined under the predictive distribution of a model, correlate so robustly with external correctness given by the ground-truth answer. In this paper, we argue that this correlation arises because autoregressive models reason correctly when they accumulate information about the true answer via answer-informative prefixes. We formalize this intuition via the Stepwise Informativeness Assumption (SIA), which states that reasoning prefixes accumulate answer-relevant information in expectation as generation progresses. We show that SIA naturally emerges from maximum-likelihood optimization on human reasoning traces and is reinforced by standard fine-tuning and reinforcement-learning pipelines. We then derive observable signatures of SIA linking conditional answer entropy dynamics to correctness. We empirically test SIA across multiple reasoning benchmarks (GSM8K, ARC, SVAMP) and a diverse set of open-weight LLMs (Gemma-2, LLaMA-3.2, Qwen-2.5, DeepSeek and Olmo variants), showing that training induces it and that correct traces exhibit characteristic conditional answer entropy patterns.

[3] Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

Feng Chen,Manas Bedmutha,Janice Sabin,Andrea Hartzler,Nadir Weibel,Trevor Cohen

Main category: cs.CL

TL;DR: 本研究利用1108个初级保健就诊录音,比较了多种自然语言处理方法在自动检测抑郁症方面的性能,发现零样本GPT-OSS模型表现最佳,且双人对话转录文本比单人更有效,仅需患者前128个词即可实现有意义的实时检测。

Details Motivation: 抑郁症在初级保健中常被漏诊,而数字语音记录技术的普及为从真实医患对话中自动识别抑郁症提供了新机会。 Method: 基于Establishing Focus研究中的1108条音频就诊记录(PHQ-9定义253例抑郁、855例非抑郁),比较四种方法:Sentence-BERT+逻辑回归、LIWC+逻辑回归、ModernBERT和零样本GPT-OSS;分析单/双人转录、不同token长度对性能的影响。 Result: GPT-OSS性能最优(AUPRC=0.510, AUROC=0.774);LIWC+LR在监督模型中次优(AUPRC=0.500, AUROC=0.742);双人转录优于单人;仅用患者前128个词即可达到AUPRC=0.356、AUROC=0.675。 Conclusion: 被动采集的临床音频可作为低负担补充手段融入现有筛查流程,支持实时临床决策。 Abstract: Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.

[4] Hallucination as output-boundary misclassification: a composite abstention architecture for language models

Angelina Hintsanen

Main category: cs.CL

TL;DR: 本文提出一种复合干预方法,结合指令式拒绝与结构化弃权门控,以减少大语言模型的幻觉现象。通过自洽性、改写稳定性与引用覆盖率三个黑盒信号计算支持赤字分数,并在超过阈值时阻止输出;实验表明,单独使用任一机制均不充分,而二者结合可显著提升准确性并降低幻觉率。

Details Motivation: 大型语言模型常生成缺乏依据的断言,作者将此视为输出边界上的误分类问题,即内部生成的内容被错误地当作有证据支撑而输出。 Method: 提出一种复合干预架构:一方面采用指令式拒绝提示,另一方面引入结构化弃权门控机制;该门控基于自洽性(At)、改写稳定性(Pt)和引用覆盖率(Ct)三个黑盒信号计算支持赤字分数St,并在St超过阈值时阻止输出。 Result: 在50项任务、5种认知范式和3种模型的受控评估中,复合架构实现了高整体准确率与低幻觉率;补充的100题TruthfulQA无上下文压力测试显示,结构化门控提供了与模型能力无关的弃权下限。 Conclusion: 指令式拒绝与结构化门控具有互补性失败模式,单一机制不足以有效控制幻觉,二者结合是更稳健的解决方案。 Abstract: Large language models often produce unsupported claims. We frame this as a misclassification error at the output boundary, where internally generated completions are emitted as if they were grounded in evidence. This motivates a composite intervention that combines instruction-based refusal with a structural abstention gate. The gate computes a support deficit score, St, from three black-box signals: self-consistency (At), paraphrase stability (Pt), and citation coverage (Ct), and blocks output when St exceeds a threshold. In a controlled evaluation across 50 items, five epistemic regimes, and three models, neither mechanism alone was sufficient. Instruction-only prompting reduced hallucination sharply, but still showed over-cautious abstention on answerable items and residual hallucination for GPT-3.5-turbo. The structural gate preserved answerable accuracy across models but missed confident confabulation on conflicting-evidence items. The composite architecture achieved high overall accuracy with low hallucination, while also inheriting some over-abstention from the instruction component. A supplementary 100-item no-context stress test derived from TruthfulQA showed that structural gating provides a capability-independent abstention floor. Overall, instruction-based refusal and structural gating show complementary failure modes, which suggests that effective hallucination control benefits from combining both mechanisms.

[5] Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Tianyi Huang,Ming Hou,Jiaheng Su,Yutong Zhang,Ziling Zhang

Main category: cs.CL

TL;DR: 本文提出CGD-PD方法,通过机械否定查询、一致性投影和证明驱动的歧义消解,提升大模型在三值逻辑问答中的准确率与确定性。

Details Motivation: 现代大语言模型在三值逻辑问答中存在两类典型错误:否定不一致(对H与¬H给出违反逻辑映射的答案)和认识论意义上的'未知'(即使前提能推出结论仍错误预测为Unknown)。 Method: CGD-PD是一种轻量级测试时层:(a) 对假设H及其机械否定形式同时调用三值分类器;(b) 将二者输出投影到满足否定一致性的决策空间;(c) 当出现Unknown时,使用定向二值蕴含探测进行选择性消解,平均仅需4–5次模型调用。 Result: 在FOLIO基准的一阶逻辑子集上,CGD-PD在多个前沿大模型上实现一致提升,最高相对准确率提升达16%,并显著减少Unknown预测数量。 Conclusion: CGD-PD以极低开销有效缓解三值逻辑推理中的关键失败模式,验证了测试时干预在提升逻辑鲁棒性方面的有效性。 Abstract: Three-way logical question answering (QA) assigns $True/False/Unknown$ to a hypothesis $H$ given a premise set $S$. While modern large language models (LLMs) can be accurate on isolated examples, we identify two recurring failure modes in 3-way logic QA: (i) negation inconsistency, where answers to $H$ and $\neg H$ violate the deterministic label mapping, and (ii) epistemic $Unknown$, where the model predicts $Unknown$ due to uncertainty or instability even when $S$ entails one side. We present CGD-PD, a lightweight test-time layer that (a) queries a single 3-way classifier on both $H$ and a mechanically negated form of $H$, (b) projects the pair onto a negation-consistent decision when possible, and (c) invokes a proof-driven disambiguation step that uses targeted binary entailment probes to selectively resolve $Unknown$ outcomes, requiring only an average of 4-5 model calls. On the FOLIO benchmark's first-order-logic fields, CGD-PD yields consistent gains across frontier LLMs, with relative improvements in accuracy of up to 16% over the base model, while also reducing $Unknown$ predictions.

[6] Temporally Phenotyping GLP-1RA Case Reports with Large Language Models: A Textual Time Series Corpus and Risk Modeling

Sayantan Kumar,Jeremy C. Weiss

Main category: cs.CL

TL;DR: 本文构建了一个包含136篇关于GLP-1受体激动剂的2型糖尿病单病例报告的时间序列语料库,并利用大语言模型(LLM)自动提取临床事件及其时间点,验证了其在事件覆盖和时序准确性上的高性能;进一步基于提取结果开展时间至事件分析,发现GLP-1使用者呼吸后遗症风险显著降低。

Details Motivation: 现有2型糖尿病病例报告中临床事件的时间信息多以非结构化文本表达,难以用于纵向建模,亟需可复用的时间序列语料与自动化提取方法。 Method: 构建含136篇PubMed开放获取病例报告的文本时间序列语料库,标注临床事件及其最可能参考时间作为金标准;评估多个LLM在事件识别与时序提取任务上的性能;并以提取结果为基础开展时间至事件分析。 Result: GPT5模型在事件覆盖率达0.871、症状时序准确率达0.843;下游分析显示GLP-1使用者呼吸后遗症风险显著低于非使用者(HR=0.259, p<0.05)。 Conclusion: 该工作为临床病例报告的时间信息结构化提供了高质量语料与有效LLM提取方案,支持更可靠的纵向临床研究与真实世界证据生成。 Abstract: Type 2 diabetes case reports describe complex clinical courses, but their timelines are often expressed in language that is difficult to reuse in longitudinal modeling. To address this gap, we developed a textual time-series corpus of 136 PubMed Open Access single-patient case reports involving glucagon-like peptide 1 receptor agonists, with clinical events associated with their most probable reference times. We evaluated automated LLM timeline extraction against gold-standard timelines annotated by clinical domain experts, assessing how well systems recovered clinical events and their timings. The best-performing LLM produced high event coverage (GPT5; 0.871) and reliable temporal sequencing across symptoms (GPT5; 0.843), diagnoses, treatments, laboratory tests, and outcomes. As a downstream demonstration, time-to-event analyses in diabetes suggested lower risk of respiratory sequelae among GLP-1 users versus non-users (HR=0.259, p<0.05), consistent with prior reports of improved respiratory outcomes. Temporal annotations and code will be released upon acceptance.

[7] Emergent decentralized regulation in a purely synthetic society

Md Motaleb Hossen Manik,Ge Wang

Main category: cs.CL

TL;DR: 本文研究了纯合成AI代理在无中心化设计和人类干预的情况下,是否能自发形成自我调节的社会动态。通过分析OpenClaw代理在Moltbook上的39,026条帖子和5,712条评论,作者提出Directive Intensity(DI)指标衡量指令性语言,并识别出四类响应评论;发现高DI帖子更易引发纠正性反馈,且该现象在统计模型和线程内事件分析中均稳健成立。

Details Motivation: 探究自主AI代理组成的纯合成社会能否在无人类干预或中心化设计下,自发产生自我调节的社会动态。 Method: 基于Moltbook上14,490个OpenClaw代理的观测数据(39,026帖、5,712评),构建词典驱动的Directive Intensity(DI)指标量化指令性语言;将响应评论分为四类;采用带随机截距的混合效应逻辑回归模型及线程内事件对齐文本分析验证DI与纠正信号的关系。 Result: 指令性内容普遍(18.4%帖子DI>0);DI越高,引发纠正性回复的概率越高,该正相关在分箱估计、混合模型和事件对齐分析中均稳健;首次纠正后常出现进一步负面反馈。 Conclusion: 纯合成、仅代理参与的社会可自发产生内生性纠正信号,且其强度与指令提议的强度正相关,表明AI代理具备初步的自组织社会调节能力。 Abstract: As autonomous AI agents increasingly inhabit online environments and extensively interact, a key question is whether synthetic collectives exhibit self-regulated social dynamics with neither human intervention nor centralized design. We study OpenClaw agents on Moltbook, an agent-only social network, using an observational archive of 39,026 posts and 5,712 comments authored by 14,490 agents. We quantify action-inducing language with Directive Intensity (DI), a transparent, lexicon-based proxy for directive and instructional phrasing that does not measure moral valence, intent, or execution outcomes. We classify responsive comments into four types: Affirmation, Corrective Signaling, Adverse Reaction, and Neutral Interaction. Directive content is common (DI>0 in 18.4% of posts). More importantly, corrective signaling scales with DI: posts with higher DI exhibit higher corrective reply probability, visible in stable binned estimates with Wilson confidence intervals. To address comment nesting within posts, we fit a post-level random intercept mixed-effects logistic model and find that the positive DI association persists. Event-aligned within-thread analysis of comment text provides additional evidence consistent with negative feedback after the first corrective response. In general, these results suggest that a purely synthetic, agent-only society can exhibit endogenous corrective signaling with a strength positively linked to the intensity of directive proposals.

[8] Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

Pei-Fu Guo,Ya-An Tsai,Chun-Chia Hsu,Kai-Xin Chen,Yun-Da Tsai,Kai-Wei Chang,Nanyun Peng,Mi-Yen Yeh,Shou-De Lin

Main category: cs.CL

TL;DR: 本文提出了Text2DistBench,一个用于评估大语言模型(LLMs)从自然语言中推断分布性知识能力的阅读理解基准,基于YouTube真实评论构建,支持自动化、持续更新。实验表明当前LLMs在分布性理解上表现参差,既有能力也有明显局限。

Details Motivation: 现有阅读理解基准多聚焦于定位式事实问答,而现实任务常需理解文本集合中的分布性信息(如群体趋势、偏好),缺乏相应评测工具。 Method: 构建Text2DistBench基准:自动采集YouTube关于电影/音乐实体的评论及元数据,设计分布性问题(如情感比例估计、高频话题排序),并实现全自动化、可持续更新的构建流程。 Result: 多个LLM在该基准上显著优于随机基线,但在不同分布类型(如比例估计 vs. 排序)和特性(如稀疏性、偏态)上表现差异显著。 Conclusion: Text2DistBench有效揭示了当前LLMs在分布性阅读理解上的能力边界,为后续研究提供了实用、可扩展的评测平台。 Abstract: While most reading comprehension benchmarks for LLMs focus on factual information that can be answered by localizing specific textual evidence, many real-world tasks require understanding distributional information, such as population-level trends and preferences expressed across collections of text. We introduce Text2DistBench, a reading comprehension benchmark for evaluating LLMs' ability to infer distributional knowledge from natural language. Built from real-world YouTube comments about movie and music entities, the benchmark provides models with entity metadata and associated comments, and requires them to answer distributional questions, such as estimating the proportions of positive and negative comments, or identifying the most and second most frequent topics discussed among viewers. To support reliable and long-term evaluation, the construction pipeline of Text2DistBench is fully automated and continuously updated to incorporate newly emerging entities over time. Experiments across multiple LLMs show that while models substantially outperform random baselines, performance varies widely across different distribution types and characteristics. These findings highlight both the capabilities and limitations of current LLMs in distributional reading comprehension and demonstrate the value of Text2DistBench as a practical and scalable testbed for future research.

[9] Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

O. Ibrahimzade,K. Tabasaransky

Main category: cs.CL

TL;DR: 本文提出了一种理论框架,用于研究多语言大语言模型在突厥语族(阿塞拜疆语、哈萨克语、乌兹别克语、土库曼语和加告兹语)中的跨语言迁移与参数高效适配,并引入突厥语迁移系数(TTC)量化语言间迁移潜力。

Details Motivation: 现有大多数多语言大模型训练严重偏向高资源语言,导致突厥语族等拥有大量使用者但数字资源匮乏的语言被低估和缺乏评估;突厥语族内部类型学与形态学高度相似但资源分布极不均衡,是研究多语言适配的理想场景。 Method: 融合多语言表征学习与参数高效微调(如LoRA)思想,构建一个概念性缩放模型,刻画模型容量、适配数据量与适配模块表达力对性能的影响;并提出突厥语迁移系数(TTC),综合形态相似性、词汇重叠、句法结构和文字兼容性来形式化语言间迁移潜力。 Result: 建立了适用于突厥语族的跨语言迁移与参数高效适配理论框架;定义了可计算的Turkic Transfer Coefficient(TTC);揭示了类型学相似性对高效迁移的促进作用,以及参数高效方法在极低资源场景下的结构性局限。 Conclusion: 类型学相似性可显著提升多语言大模型的跨语言迁移效率,但参数高效适配存在资源下限;该框架为低资源突厥语的LLM适配提供了理论指导与可扩展评估工具。 Abstract: Large language models (LLMs) have transformed natural language processing, yet their capabilities remain uneven across languages. Most multilingual models are trained primarily on high-resource languages, leaving many languages with large speaker populations underrepresented in both training data and evaluation benchmarks. This imbalance is particularly visible in the Turkic language family. This paper proposes a theoretical framework for studying cross-lingual transfer and parameter-efficient adaptation of multilingual LLMs within the Turkic language family, focusing on Azerbaijani, Kazakh, Uzbek, Turkmen, and Gagauz. These languages share substantial typological and morphological similarity while differing greatly in available digital resources, making them a natural setting for analyzing multilingual adaptation strategies. We integrate insights from multilingual representation learning and parameter-efficient fine-tuning techniques such as Low-Rank Adaptation (LoRA) to develop a conceptual scaling model describing how adaptation performance depends on model capacity, adaptation data size, and the expressivity of adaptation modules. To formalize transfer potential between related languages, we introduce the Turkic Transfer Coefficient (TTC), a theoretical measure incorporating morphological similarity, lexical overlap, syntactic structure, and script compatibility across Turkic languages. The framework highlights how typological similarity can enable efficient multilingual transfer while also identifying structural limits of parameter-efficient adaptation in extremely low-resource scenarios.

[10] SensorPersona: An LLM-Empowered System for Continual Persona Extraction from Longitudinal Mobile Sensor Streams

Bufang Yang,Lilin Xu,Yixuan Li,Kaiwei Liu,Xiaofan Jiang,Zhenyu Yan

Main category: cs.CL

TL;DR: SensorPersona 是一个基于多模态长期传感器数据(如手机传感器)持续推断用户稳定人格特征的 LLM 系统,通过上下文编码、分层推理与动态更新机制,在人格提取召回率、代理响应质量及用户满意度上显著优于现有方法。

Details Motivation: 现有LLM代理人格化方法依赖聊天历史,仅能获取用户主动披露的信息,无法反映其真实物理世界行为,导致人格建模不全面。 Method: SensorPersona 包含三部分:1)面向人物的传感器上下文编码;2)融合单次与跨次行为的分层人格推理(涵盖物理模式、心理社会特质与生活经历);3)聚类感知的增量验证与时间证据驱动的动态更新。 Result: 在自建20人、横跨3大洲17城、总计1580小时的传感器数据集上,SensorPersona 在人格提取召回率上提升31.4%,人格感知代理响应胜率达85.7%,用户满意度显著提高。 Conclusion: 利用无感采集的多模态传感器流可更全面、稳定、动态地建模用户人格,为个性化LLM代理提供了新范式。 Abstract: Personalization is essential for Large Language Model (LLM)-based agents to adapt to users' preferences and improve response quality and task performance. However, most existing approaches infer personas from chat histories, which capture only self-disclosed information rather than users' everyday behaviors in the physical world, limiting the ability to infer comprehensive user personas. In this work, we introduce SensorPersona, an LLM-empowered system that continuously infers stable user personas from multimodal longitudinal sensor streams unobtrusively collected from users' mobile devices. SensorPersona first performs person-oriented context encoding on continuous sensor streams to enrich the semantics of sensor contexts. It then employs hierarchical persona reasoning that integrates intra- and inter-episode reasoning to infer personas spanning physical patterns, psychosocial traits, and life experiences. Finally, it employs clustering-aware incremental verification and temporal evidence-aware updating to adapt to evolving personas. We evaluate SensorPersona on a self-collected dataset containing 1,580 hours of sensor data from 20 participants, collected over up to 3 months across 17 cities on 3 continents. Results show that SensorPersona achieves up to 31.4% higher recall in persona extraction, an 85.7% win rate in persona-aware agent responses, and notable improvements in user satisfaction compared to state-of-the-art baselines.

[11] Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

Shutong Zhang,Dylan Zhou,Yinxiao Liu,Yang Yang,Huiwen Luo,Wenfei Zou

Main category: cs.CL

TL;DR: 本文提出Tool-MCoT,一种基于工具增强链式思维数据微调的小型语言模型(SLM),用于高效、准确的内容安全审核,兼顾推理效率与准确性。

Details Motivation: 大型语言模型(LLMs)虽在内容审核中有效,但计算开销大、延迟高,难以大规模部署;需轻量、高效且性能良好的替代方案。 Method: 提出Tool-MCoT,即在外部工具框架支持下微调小型语言模型(SLM),训练数据为由LLM生成的工具增强型链式思维(tool-augmented chain-of-thought)样本。 Result: 微调后的SLM在内容审核任务上显著提升性能,并能选择性调用外部工具,在保证准确率的同时降低推理开销。 Conclusion: Tool-MCoT验证了小型模型通过工具增强链式思维学习可实现媲美大模型的审核能力,为低延迟、高效率的内容安全系统提供了可行路径。 Abstract: The growth of online platforms and user content requires strong content moderation systems that can handle complex inputs from various media types. While large language models (LLMs) are effective, their high computational cost and latency present significant challenges for scalable deployment. To address this, we introduce Tool-MCoT, a small language model (SLM) fine-tuned for content safety moderation leveraging external framework. By training our model on tool-augmented chain-of-thought data generated by LLM, we demonstrate that the SLM can learn to effectively utilize these tools to improve its reasoning and decision-making. Our experiments show that the fine-tuned SLM achieves significant performance gains. Furthermore, we show that the model can learn to use these tools selectively, achieving a balance between moderation accuracy and inference efficiency by calling tools only when necessary.

[12] A Comparative Study of Demonstration Selection for Practical Large Language Models-based Next POI Prediction

Ryo Nishida,Masayuki Kawarada,Tatsuya Ishigaki,Hiroya Takamura,Masaki Onishi

Main category: cs.CL

TL;DR: 本文研究了在使用大语言模型(LLM)进行用户下一个兴趣点(POI)预测任务中,不同示范样本(demonstration)选择策略的效果,并发现基于地理、时间与序列等简单启发式方法优于复杂的嵌入式选择方法,且无需微调即可超越部分微调模型。

Details Motivation: 尽管上下文学习(ICL)在POI预测中展现出潜力,但其效果高度依赖示范样本的选择;现有研究缺乏对各类选择策略的系统性比较,尤其缺少对简单启发式方法的评估。 Method: 对多种示范选择策略(包括随机、嵌入相似度、任务特定方法,以及地理邻近、时间顺序、序列模式等启发式方法)进行统一实验评估,在三个真实世界POI数据集上对比其预测准确率与计算开销。 Result: 地理、时间与序列等简单启发式方法在预测精度和计算效率上均持续优于嵌入式等复杂方法;部分情况下,仅用这些启发式方法构建的ICL范式甚至超过已有的微调模型性能。 Conclusion: 在LLM用于POI预测的ICL设置中,设计简洁、可解释、低开销的启发式示范选择策略更实用有效,挑战了‘越复杂越好’的常见假设,并为实际部署提供了明确指导。 Abstract: This paper investigates demonstration selection strategies for predicting a user's next point-of-interest (POI) using large language models (LLMs), aiming to accurately forecast a user's subsequent location based on historical check-in data. While in-context learning (ICL) with LLMs has recently gained attention as a promising alternative to traditional supervised approaches, the effectiveness of ICL significantly depends on the selected demonstration. Although previous studies have examined methods such as random selection, embedding-based selection, and task-specific selection, there remains a lack of comprehensive comparative analysis among these strategies. To bridge this gap and clarify the best practices for real-world applications, we comprehensively evaluate existing demonstration selection methods alongside simpler heuristic approaches such as geographical proximity, temporal ordering, and sequential patterns. Extensive experiments conducted on three real-world datasets indicate that these heuristic methods consistently outperform more complex and computationally demanding embedding-based methods, both in terms of computational cost and prediction accuracy. Notably, in certain scenarios, LLMs using demonstrations selected by these simpler heuristic methods even outperform existing fine-tuned models, without requiring further training. Our source code is available at: https://github.com/ryonsd/DS-LLM4POI.

[13] Extracting Breast Cancer Phenotypes from Clinical Notes: Comparing LLMs with Classical Ontology Methods

Abdullah Bin Faiz,Arbaz Khan Shehzad,Asad Afzal,Momin Tariq,Muhammad Siddiqi,Muhammad Usamah Shahid,Maryam Noor Awan,Muddassar Farooq

Main category: cs.CL

TL;DR: 本文提出了一种基于大语言模型(LLM)的框架,用于从肿瘤学电子病历(EMR)中的非结构化医生笔记中提取乳腺癌相关表型信息,并证明其性能可媲美传统基于本体的方法,且更易于迁移至其他癌症类型。

Details Motivation: 肿瘤学EMR中大量关键临床信息(如化疗效果、生物标志物、肿瘤位置与生长模式等)以非结构化文本形式存在于医生笔记中,而医生更习惯用自然语言而非结构化字段记录这些信息,因此亟需高效方法从中提取结构化知识。 Method: 构建并应用基于大语言模型(LLM)的信息抽取框架,专门用于从医生笔记中提取乳腺癌相关表型;并与基于NCIt本体和知识驱动标注系统的传统方法进行对比评估。 Result: LLM框架在乳腺癌表型提取任务上达到与经典本体方法相当的准确率;且具备良好可迁移性,经微调后可快速适配其他癌种及疾病。 Conclusion: LLM-based信息抽取框架是处理肿瘤学非结构化临床文本的有效替代方案,在保持高精度的同时显著提升泛化能力与部署灵活性。 Abstract: A significant amount of data held in Oncology Electronic Medical Records (EMRs) is contained in unstructured provider notes -- including but not limited to the chemotherapy (or cancer treatment) outcome, different biomarkers, the tumor's location, sizes, and growth patterns of a patient. The clinical studies show that the majority of oncologists are comfortable providing these valuable insights in their notes in a natural language rather than the relevant structured fields of an EMR. The major contribution of this research is to report an LLM-based framework to process provider notes and extract valuable medical knowledge and phenotype mentioned above, with a focus on the domain of oncology. In this paper, we focus on extracting phenotypes related to breast cancer using our LLM framework, and then compare its performance with earlier works that used knowledge-driven annotation system, paired with the NCIt Ontology Annotator. The results of the study show that an LLM-based information extraction framework can be easily adapted to extract phenotypes with an accuracy that is comparable to the classical ontology-based methods. However, once trained, they could be easily fine-tuned to cater for other cancer types and diseases.

[14] TelcoAgent-Bench: A Multilingual Benchmark for Telecom AI Agents

Lina Bariah,Brahim Mefgouda,Farbod Tavakkoli,Enrique Molero,Louis Powell,Merouane Debbah

Main category: cs.CL

TL;DR: 本文提出了TelcoAgent-Bench和TelcoAgent-Metrics,一个面向电信领域的多语言大语言模型(LLM)智能体评测框架,用于评估其在语义理解、结构化排障流程对齐及场景变体稳定性等方面的能力。

Details Motivation: LLM智能体在电信网络中部署面临意图识别、工具执行与解决方案生成等新挑战,且需兼顾多语言支持与运营约束,亟需专用评测框架。 Method: 构建了面向电信领域的多语言(英/阿)基准测试框架TelcoAgent-Bench,并设计了涵盖意图识别、有序工具调用、解决方案正确性及场景变体稳定性四个维度的TelcoAgent-Metrics评测体系。 Result: 实验表明,当前指令微调模型虽能较好理解电信问题,但在严格遵循排障步骤和应对同一场景不同变体时表现不稳定;该缺陷在无约束及双语环境下更显著。 Conclusion: 所提框架有效揭示了现有电信LLM智能体在操作一致性与可靠性上的关键短板,为后续提升其在真实网络环境中的鲁棒性提供了可量化的评估基础。 Abstract: The integration of large language model (LLM) agents into telecom networks introduces new challenges, related to intent recognition, tool execution, and resolution generation, while taking into consideration different operational constraints. In this paper, we introduce TelcoAgent-Bench and TelcoAgent-Metrics, a Telecom-specific benchmarking framework for evaluating multilingual telecom LLM agents. The proposed framework assesses the semantic understanding as well as process-level alignment with structured troubleshooting flows and stability across repeated scenario variations. Our contribution includes a structured suite of metrics that assess intent recognition, ordered tool execution, resolution correctness, and stability across scenario variations, with the aim of quantifying the reliability and operational consistency of LLM agents in telecom environments. The framework is designed to operate in both English and Arabic, to address the need for multilingual agent deployment in operational network environments. Our experimental results show that although recent instruct-tuned models can understand telecom problems in a reasonable way, they usually struggle to consistently follow the required troubleshooting steps and to maintain stable behavior when exposed to different variations of the same scenario. This performance gap becomes more pronounced in unconstrained and bilingual settings.

[15] Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

Jaehyeok Lee,Xiaoyuan Yi,Jing Yao,Hyunjin Hwang,Roy Ka-Wei Lee,Xing Xie,JinYeong Bak

Main category: cs.CL

TL;DR: 本文提出DOVE框架,通过分布式评估方法解决现有文化价值对齐基准的C³挑战,利用率失真变分优化构建价值码本,并采用非平衡最优传输度量对齐效果。

Details Motivation: 现有基准存在Construct-Composition-Context(C³)挑战:依赖判别式多选题形式、忽视亚文化异质性、与真实开放生成场景不匹配,难以准确评估LLM的文化价值取向对齐程度。 Method: 提出DOVE分布评估框架:1)基于10K文档,用率失真变分优化构建紧凑价值码本,将文本映射到结构化价值空间以过滤语义噪声;2)使用非平衡最优传输度量人类书写文本与LLM生成文本在价值分布上的对齐,支持亚文化多样性建模。 Result: 在12个LLM上实验表明,DOVE具有更高预测有效性(与下游任务相关性达31.56%)和高可靠性(每文化仅需500样本)。 Conclusion: DOVE有效克服C³挑战,为LLM文化价值对齐提供了更真实、细粒度、可扩展的评估范式。 Abstract: As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value-codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and sub-group diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

[16] Illocutionary Explanation Planning for Source-Faithful Explanations in Retrieval-Augmented Language Models

Francesco Sovrano,Alberto Bacchelli

Main category: cs.CL

TL;DR: 本文研究了检索增强生成(RAG)在编程教育中生成可追溯、忠实于教材来源的自然语言解释的能力,提出基于Achinstein言语行为理论的“言外宏规划”设计原则,并通过“言外链式提示”(CoI)提升模型对教材来源的忠实度,实验表明CoI显著提高了源遵循率,且未损害用户满意度。

Details Motivation: LLM生成的自然语言解释虽具说服力但缺乏可验证性;XAI强调解释的忠实性与可追溯性,而现有RAG系统在编程教育场景中源遵循率仍较低,亟需提升。 Method: 在编程教育RAG场景下,以三本教材为权威证据源,对90个Stack Overflow问题进行基准测试;提出基于Achinstein言语行为理论的‘言外宏规划’设计原则,并实现为‘言外链式提示’(CoI),用以将查询扩展为隐含的解释性问题以驱动检索。 Result: 非RAG模型源遵循率中位数为0%,基线RAG为22–40%;CoI使源遵循率最高提升63%,但绝对值仍中等;部分模型增益弱或不显著;用户研究表明CoI未降低满意度、相关性或感知正确性。 Conclusion: 言外宏规划是一种有效的设计原则,CoI是其可行实现,能显著提升RAG系统在教育场景中生成源忠实解释的能力,且具备用户接受度。 Abstract: Natural language explanations produced by large language models (LLMs) are often persuasive, but not necessarily scrutable: users cannot easily verify whether the claims in an explanation are supported by evidence. In XAI, this motivates a focus on faithfulness and traceability, i.e., the extent to which an explanation's claims can be grounded in, and traced back to, an explicit source. We study these desiderata in retrieval-augmented generation (RAG) for programming education, where textbooks provide authoritative evidence. We benchmark six LLMs on 90 Stack Overflow questions grounded in three programming textbooks and quantify source faithfulness via source adherence metrics. We find that non Retrieval-Augmented Generation (RAG) models have median source adherence of 0%, while baseline RAG systems still exhibit low median adherence (22-40%, depending on the model). Motivated by Achinstein's illocutionary theory of explanation, we introduce illocutionary macro-planning as a descriptive design principle for source-faithful explanations and instantiate it with chain-of-illocution prompting (CoI), which expands a query into implicit explanatory questions that drive retrieval. Across models, CoI yields statistically significant gains (up to 63%) in source adherence, although absolute adherence remains moderate and the gains are weak or non-significant for some models. A user study with 165 retained participants (220 recruited) indicates that these gains do not harm satisfaction, relevance, or perceived correctness.

[17] Invisible Influences: Investigating Implicit Intersectional Biases through Persona Engineering in Large Language Models

Nandini Arimanda,Achyuth Mukund,Sakthi Balan Muthiah,Rajesh Sharma

Main category: cs.CL

TL;DR: 本文提出BADx指标,用于量化大语言模型在不同社会角色(persona)下隐性偏见的动态放大效应,并结合LIME进行可解释性分析,在5个主流LLM上验证了其有效性。

Details Motivation: 现有偏见检测方法(如CEAT、I-WEAT)基于静态嵌入,难以捕捉模型在采用不同社会角色时偏见的动态变化,尤其在交叉性偏见和角色驱动场景下存在明显局限。 Method: 提出BADx指标,包含三部分:基于CEAT/I-WEAT/I-SEAT的差异偏见分(BAD)、角色敏感性指数(PSI)和波动性(标准差),并融合LIME局部可解释性分析;在两类任务中评估:Task1建静态基线,Task2用6种角色框架(边缘化与优势群体)测试5个SOTA LLM。 Result: 不同LLM对角色上下文响应差异显著:GPT-4o高敏感高波动;DeepSeek-R1抑制偏见但波动异常;LLaMA-4最稳定低放大;Claude 4.0 Sonnet平衡调制;Gemma-3n E4B波动最低、放大适中;BADx比静态方法更有效揭示上下文敏感偏见。 Conclusion: BADx是一种可扩展、可解释的新指标,能系统识别LLM中由角色触发的动态隐性交叉偏见,为偏见审计提供更贴近实际应用的评估范式。 Abstract: Large Language Models (LLMs) excel at human-like language generation but often embed and amplify implicit, intersectional biases, especially under persona-driven contexts. Existing bias audits rely on static, embedding-based tests (CEAT, I-WEAT, I-SEAT) that quantify absolute association strengths. We show that they have limitations in capturing dynamic shifts when models adopt social roles. We address this gap by introducing the Bias Amplification Differential and Explainability Score (BADx): a novel, scalable metric that measures persona-induced bias amplification and integrates local explainability insights. BADx comprises three components - differential bias scores (BAD, based on CEAT, I-WEAT, I-SEAT),Persona Sensitivity Index (PSI), and Volatility (Standard Deviation), augmented by LIME-based analysis for emphasizing explainability. This study is divided and performed as two different tasks. Task 1 establishes static bias baselines, and Task 2 applies six persona frames (marginalized and structurally advantaged) to measure BADx, PSI, and volatility. This is studied across five state-of-the-art LLMs (GPT-4o, DeepSeek-R1, LLaMA-4, Claude 4.0 Sonnet and Gemma-3n E4B). Results show persona context significantly modulates bias. GPT-4o exhibits high sensitivity and volatility; DeepSeek-R1 suppresses bias but with erratic volatility; LLaMA-4 maintains low volatility and a stable bias profile with limited amplification; Claude 4.0 Sonnet achieves balanced modulation; and Gemma-3n E4B attains the lowest volatility with moderate amplification. BADx performs better than static methods by revealing context-sensitive biases overlooked in static methods. Our unified method offers a systematic way to detect dynamic implicit intersectional bias in five popular LLMs.

[18] Unsupervised Neural Network for Automated Classification of Surgical Urgency Levels in Medical Transcriptions

Sadaf Tabatabaee,Sarah S. Lam

Main category: cs.CL

TL;DR: 本文提出了一种基于BioClinicalBERT和无监督聚类(DEC优于K-means)的手术紧迫性自动分类框架,并通过专家验证(改良德尔菲法)和BiLSTM-BioClinicalBERT联合模型实现高精度、可泛化的三类(即刻/紧急/择期)分类,解决了标注数据稀缺问题,支持实时手术优先级决策。

Details Motivation: 手术紧迫性分类对优化患者照护与医疗资源配置至关重要,但常受限于标注数据稀缺;需一种无需大量标注、临床可信且可扩展的自动化方法。 Method: 首先用BioClinicalBERT将手术文本转为语义嵌入,再分别用K-means和Deep Embedding Clustering(DEC)进行无监督聚类;经改良德尔菲法由专家验证与校准后,构建融合BiLSTM与BioClinicalBERT嵌入的监督分类神经网络,并采用交叉验证与准确率、精确率、召回率、F1-score评估。 Result: DEC聚类效果优于K-means;最终分类模型在各项指标上表现稳健,泛化能力强,在未见数据上保持高准确性与临床实用性。 Conclusion: 该无监督-半监督混合框架有效缓解标注依赖,提供可解释、可部署的实时手术优先级分类方案,提升医疗系统运行效率与患者预后。 Abstract: Efficient classification of surgical procedures by urgency is paramount to optimize patient care and resource allocation within healthcare systems. This study introduces an unsupervised neural network approach to automatically categorize surgical transcriptions into three urgency levels: immediate, urgent, and elective. Leveraging BioClinicalBERT, a domain-specific language model, surgical transcripts are transformed into high-dimensional embeddings that capture their semantic nuances. These embeddings are subsequently clustered using both K-means and Deep Embedding Clustering (DEC) algorithms, in which DEC demonstrates superior performance in the formation of cohesive and well-separated clusters. To ensure clinical relevance and accuracy, the clustering results undergo validation through the Modified Delphi Method, which involves expert review and refinement. Following validation, a neural network that integrates Bidirectional Long Short-Term Memory (BiLSTM) layers with BioClinicalBERT embeddings is developed for classification tasks. The model is rigorously evaluated using cross-validation and metrics such as accuracy, precision, recall, and F1-score, which achieve robust performance and demonstrate strong generalization capabilities on unseen data. This unsupervised framework not only addresses the challenge of limited labeled data but also provides a scalable and reliable solution for real-time surgical prioritization, which ultimately enhances operational efficiency and patient outcomes in dynamic medical environments.

[19] Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses

Khizar Hussain,Bradley A. Malin,Zhijun Yin,Susannah Leigh Rose,Murat Kantarcioglu

Main category: cs.CL

TL;DR: 本文提出了一种结合人类专业知识与大语言模型(LLM)的框架,用于检测心理健康咨询中LLM聊天机器人产生的幻觉与遗漏,显著优于现有LLM-as-a-judge方法。

Details Motivation: 现有LLM-as-a-judge方法在高风险医疗场景(如心理健康服务)中表现差(仅52%准确率,部分方法召回率近零),因其难以捕捉领域专家所识别的语言与治疗学细微模式。 Method: 构建融合人类专家知识与LLM的框架,提取五个可解释、领域驱动的特征维度:逻辑一致性、实体验证、事实准确性、语言不确定性、专业适宜性;并用传统机器学习模型(如SVM、RF)在这些特征上训练检测器。 Result: 在自建人工标注数据集和公开数据集上,幻觉检测F1分别达0.717和0.849,遗漏检测F1为0.59–0.64;显著优于纯LLM评判方法,且具备更高可解释性与可靠性。 Conclusion: 在高风险心理健康应用中,融合领域专家知识的特征工程+传统ML方法,比黑箱式LLM评判更可靠、透明、有效。 Abstract: As LLM-powered chatbots are increasingly deployed in mental health services, detecting hallucinations and omissions has become critical for user safety. However, state-of-the-art LLM-as-a-judge methods often fail in high-risk healthcare contexts, where subtle errors can have serious consequences. We show that leading LLM judges achieve only 52% accuracy on mental health counseling data, with some hallucination detection approaches exhibiting near-zero recall. We identify the root cause as LLMs' inability to capture nuanced linguistic and therapeutic patterns recognized by domain experts. To address this, we propose a framework that integrates human expertise with LLMs to extract interpretable, domain-informed features across five analytical dimensions: logical consistency, entity verification, factual accuracy, linguistic uncertainty, and professional appropriateness. Experiments on a public mental health dataset and a new human-annotated dataset show that traditional machine learning models trained on these features achieve 0.717 F1 on our custom dataset and 0.849 F1 on a public benchmark for hallucination detection, with 0.59-0.64 F1 for omission detection across both datasets. Our results demonstrate that combining domain expertise with automated methods yields more reliable and transparent evaluation than black-box LLM judging in high-stakes mental health applications.

[20] STDec: Spatio-Temporal Stability Guided Decoding for dLLMs

Yuzhe Chen,Jiale Cao,Xuyang Liu,Jin Xie,Aiping Yang,Yanwei Pang

Main category: cs.CL

TL;DR: 本文提出了一种名为STDec的时空稳定性引导解码方法,用于提升扩散大语言模型(dLLMs)的解码效率与稳定性,无需训练且兼容缓存加速,在多个基准上显著提升吞吐量(如MBPP达14.17倍加速),同时保持任务性能。

Details Motivation: 现有dLLM解码器多采用全局置信度阈值,未显式建模局部上下文及预测token ID在去噪步间的时间一致性,导致效率与稳定性受限。 Method: 提出STDec方法,包含空间感知解码(动态生成token自适应阈值,聚合邻近token解码状态)和时间感知解码(对跨步预测ID一致的token放宽解码阈值),全程无训练、兼容缓存加速。 Result: 在文本推理与多模态理解基准上,STDec显著提升吞吐量,同时保持可比任务性能;在MBPP数据集上使用LLaDA模型实现最高14.17倍加速且得分相当。 Conclusion: STDec通过利用dLLM解码中固有的时空稳定性,提供了一种简单、高效、即插即用的解码优化方案,为扩散语言模型的实际部署提供了新思路。 Abstract: Diffusion Large Language Models (dLLMs) have achieved rapid progress, viewed as a promising alternative to the autoregressive paradigm. However, most dLLM decoders still adopt a global confidence threshold, and do not explicitly model local context from neighboring decoded states or temporal consistency of predicted token IDs across steps. To address this issue, we propose a simple spatio-temporal stability guided decoding approach, named STDec. We observe strong spatio-temporal stability in dLLM decoding: newly decoded tokens tend to lie near decoded neighbors, and their predicted IDs often remain consistent across several denoising steps. Inspired by this stability, our STDec includes spatial-aware decoding and temporal-aware decoding. The spatial-aware decoding dynamically generates the token-adaptive threshold by aggregating the decoded states of nearby tokens. The temporal-aware decoding relaxes the decoding thresholds for tokens whose predicted token IDs remain consistent over denoising steps. Our STDec is training-free and remains compatible with cache-based acceleration methods. Across textual reasoning and multimodal understanding benchmarks, STDec substantially improves throughput while maintaining comparable task performance score. Notably, on MBPP with LLaDA, STDec achieves up to 14.17x speedup with a comparable score. Homepage: https://yzchen02.github.io/STDec.

[21] Severity-Aware Weighted Loss for Arabic Medical Text Generation

Ahmed Alansary,Molham Mohamed,Ali Hamdi

Main category: cs.CL

TL;DR: 本文提出了一种面向临床严重程度的加权损失函数,用于微调阿拉伯语大语言模型处理医疗问答任务,在不改变模型结构的前提下,通过软严重度概率动态调整token级损失,显著提升了多个阿拉伯语大模型在MAQA数据集上的性能。

Details Motivation: 传统微调目标对所有医疗案例一视同仁,忽视临床严重程度差异,而严重病例中的错误具有更高临床风险,亟需针对性优化。 Method: 提出基于软严重度概率的严重度感知加权损失函数,在损失层动态缩放token级损失贡献;严重度标签和概率由微调的AraBERT分类器自动获取,仅作用于损失计算。 Result: 在MAQA数据集上,AraGPT2-Base、AraGPT2-Medium和Qwen2.5-0.5B的性能分别从54.04%、59.16%、57.83%提升至66.14%、67.18%、66.86%,最高达67.18%,相较未微调基线提升达12.10%。 Conclusion: 严重度感知微调方法具备架构无关性与鲁棒性,能一致提升多种阿拉伯语大模型在医疗文本生成任务中的表现,为高风险医疗AI应用提供了更安全可靠的训练范式。 Abstract: Large language models have shown strong potential for Arabic medical text generation; however, traditional fine-tuning objectives treat all medical cases uniformly, ignoring differences in clinical severity. This limitation is particularly critical in healthcare settings, where errors in severe cases contain higher clinical risk. In this work, we propose a severity-aware weighted loss for fine-tuning Arabic language models on medical complaint-response data. The method depends on soft severity probabilities to dynamically scale token-level loss contributions during optimization, thereby prioritizing clinically critical interactions without modifying model architectures. Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses. Severity labels and probabilistic scores are automatically derived using a fine-tuned AraBERT-based classifier and incorporated exclusively at the loss level. The proposed approach is evaluated across ten Arabic large language models of varying architectures and parameter scales. While standard cross-entropy fine-tuning yields only modest improvements, severity-aware optimization consistently achieves larger gains. Using a balanced weighting configuration, performance improves from 54.04% to 66.14% for AraGPT2-Base, from 59.16% to 67.18% for AraGPT2-Medium, and from 57.83% to 66.86% for Qwen2.5-0.5B, with peak performance reaching 67.18%. Overall, severity-aware fine-tuning delivers improvements of up to 12.10% over non-fine-tuned baselines, demonstrating robust and architecture-consistent gains.

[22] In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads

Charlotte Pouw,Hosein Mohebbi,Afra Alishahi,Willem Zuidema

Main category: cs.CL

TL;DR: 本文研究了语音领域中的上下文学习(ICL),聚焦于TTS任务,发现语速显著影响ICL性能并被模型模仿,而音高范围和强度影响较小;同时验证了归纳头在语音ICL中起关键因果作用。

Details Motivation: In-Context Learning(ICL)在纯文本语言模型中已被广泛研究,但在语音领域仍缺乏探索,本文旨在填补这一空白,探究语言与声学特征如何影响语音语言模型中的ICL。 Method: 以Text-to-Speech(TTS)任务为切入点,从任务推理准确性(生成正确语音内容)和声学特征模仿程度(如语速、音高、强度)两方面分析ICL;并通过归纳头(induction heads)的消融实验验证其因果作用。 Result: 语速显著影响ICL性能且被一致模仿;音高范围与强度对性能影响小、模仿不一致;消融前k个归纳头会完全消除模型的ICL能力。 Conclusion: 语音ICL受特定声学特征(尤其是语速)驱动,且依赖归纳头机制,其行为模式与文本ICL高度一致,表明ICL在多模态建模中具有跨模态共性。 Abstract: In-Context Learning (ICL) has been extensively studied in text-only Language Models, but remains largely unexplored in the speech domain. Here, we investigate how linguistic and acoustic features affect ICL in Speech Language Models. We focus on the Text-to-Speech (TTS) task, which allows us to analyze ICL from two angles: (1) how accurately the model infers the task from the demonstrations (i.e., generating the correct spoken content), and (2) to what extent the model mimics the acoustic characteristics of the demonstration speech in its output. We find that speaking rate strongly affects ICL performance and is also mimicked in the output, whereas pitch range and intensity have little impact on performance and are not consistently reproduced. Finally, we investigate the role of induction heads in speech-based ICL and show that these heads play a causal role: ablating the top-k induction heads completely removes the model's ICL ability, mirroring findings from text-based ICL.

[23] A Severity-Based Curriculum Learning Strategy for Arabic Medical Text Generation

Ahmed Alansary,Molham Mohamed,Ali Hamdi

Main category: cs.CL

TL;DR: 本文提出了一种基于严重程度的课程学习策略,用于阿拉伯语医学文本生成,通过按症状严重程度(轻度、中度、重度)分阶段训练模型,显著提升了生成性能。

Details Motivation: 现有阿拉伯语医学文本生成方法忽略样本临床严重程度差异,导致模型难以准确处理高风险或复杂病例。 Method: 设计了基于严重程度的课程学习策略:将数据集按症状严重程度分为三个阶段(轻度→中度→重度),在微调过程中逐步引入更严重的样本;使用自研规则法对MAQA子集进行三级严重程度标注。 Result: 在MAQA子集上验证,该方法相较基线模型提升约+4%~+7%,相较常规微调提升约+3%~+6%,效果稳定。 Conclusion: 引入严重程度感知的课程学习能有效提升阿拉伯语医学文本生成模型对复杂和高风险病例的建模能力,为低资源医学NLP任务提供了新思路。 Abstract: Arabic medical text generation is increasingly needed to help users interpret symptoms and access general health guidance in their native language. Nevertheless, many existing methods assume uniform importance across training samples, overlooking differences in clinical severity. This simplification can hinder the model's ability to properly capture complex or high-risk cases. To overcome this issue, this work introduces a Severity-based Curriculum Learning Strategy for Arabic Medical Text Generation, where the training process is structured to move gradually from less severe to more critical medical conditions. The approach divides the dataset into ordered stages based on severity and incrementally exposes the model to more challenging cases during fine-tuning, allowing it to first learn basic medical patterns before addressing more complex scenarios. The proposed method is evaluated on a subset of the Medical Arabic Question Answering (MAQA) dataset, which includes Arabic medical questions describing symptoms alongside corresponding responses. In addition, the dataset is annotated with three severity levels (Mild, Moderate, and Critical) using a rule-based method developed in this study. The results demonstrate that incorporating severity-aware curriculum learning leads to consistent performance improvements across all tested models, with gains of around +4% to +7% over baseline models and +3% to +6% compared with conventional fine-tuning approaches.

[24] The Illusion of Superposition? A Principled Analysis of Latent Thinking in Language Models

Michael Rizvi-Martel,Guillaume Rabusseau,Marius Mosbach

Main category: cs.CL

TL;DR: 本文研究了潜在线性推理(Latent CoT)中是否真正利用了叠加(superposition)现象,发现仅从零开始训练的模型表现出叠加迹象,而免训练和微调模型则倾向于捷径解法,叠加要么崩溃或未被使用。

Details Motivation: 尽管连续空间中的潜在线性推理理论上支持叠加(即单个表示中同时维持多个候选解),但尚不清楚语言模型在实际推理中是否真正利用了这一特性。 Method: 在三种范式下进行实验:免训练(用词嵌入凸组合构建潜意识)、微调(适配基础模型生成潜意识)和从零训练(完全用潜意识训练模型);并采用Logit Lens和实体级探针分析内部表征。 Result: 仅从零开始训练的模型显示出叠加使用的证据;免训练与微调模型中叠加崩溃或未被使用,模型转而采用捷径解法;原因包括预训练对最后层token承诺的偏差及模型容量对解法选择的显著影响。 Conclusion: 叠加并非自动出现,其出现依赖于训练方式与模型容量;只有从零训练且具备足够容量的模型才可能有效利用叠加进行潜在线性推理。 Abstract: Latent reasoning via continuous chain-of-thoughts (Latent CoT) has emerged as a promising alternative to discrete CoT reasoning. Operating in continuous space increases expressivity and has been hypothesized to enable superposition: the ability to maintain multiple candidate solutions simultaneously within a single representation. Despite theoretical arguments, it remains unclear whether language models actually leverage superposition when reasoning using latent CoTs. We investigate this question across three regimes: a training-free regime that constructs latent thoughts as convex combinations of token embeddings, a fine-tuned regime where a base model is adapted to produce latent thoughts, and a from-scratch regime where a model is trained entirely with latent thoughts to solve a given task. Using Logit Lens and entity-level probing to analyze internal representations, we find that only models trained from scratch exhibit signs of using superposition. In the training-free and fine-tuned regimes, we find that the superposition either collapses or is not used at all, with models discovering shortcut solutions instead. We argue that this is due to two complementary phenomena: i) pretraining on natural language data biases models to commit to a token in the last layers ii) capacity has a huge effect on which solutions a model favors. Together, our results offer a unified explanation for when and why superposition arises in continuous chain-of-thought reasoning, and identify the conditions under which it collapses.

[25] Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh,Xiaokun Wang,Anurag Garikipati,Madalina Ciobanu,Qingqing Mao,Ritankar Das

Main category: cs.CL

TL;DR: 本文提出了一种结合强化学习(RL)与监督微调(SFT)的多阶段优化策略,用于提升大语言模型在教育领域的专业知识,所构建的EduQwen系列模型在多项教育基准测试中达到新SOTA,并超越更大规模的闭源模型。

Details Motivation: 提升开源大语言模型在教育领域的专业能力,使其在保持透明性、可定制性和成本效益的同时,媲美甚至超越大型通用闭源模型。 Method: 采用三阶段优化策略:(1)基于渐进难度训练、聚焦困难样本和扩展推理轨迹的RL优化;(2)利用RL模型合成高质量训练数据并进行难度加权采样的SFT;(3)可选的第二轮RL优化。模型基于Qwen3-32B稠密架构构建。 Result: EduQwen 32B系列模型在Cross-Domain Pedagogical Knowledge(CDPK)基准及交互式教学基准排行榜上取得新SOTA,显著超越Gemini-3 Pro等更大参数量的专有系统。 Conclusion: 针对教育领域的专业化优化可使中等规模开源LLM成为真正的领域专家,在性能、可控性与部署可行性上实现统一。 Abstract: We present an innovative multi-stage optimization strategy combining reinforcement learning (RL) and supervised fine-tuning (SFT) to enhance the pedagogical knowledge of large language models (LLMs), as illustrated by EduQwen 32B-RL1, EduQwen 32B-SFT, and an optional third-stage model EduQwen 32B-SFT-RL2: (1) RL optimization that implements progressive difficulty training, focuses on challenging examples, and employs extended reasoning rollouts; (2) a subsequent SFT phase that leverages the RL-trained model to synthesize high-quality training data with difficulty-weighted sampling; and (3) an optional second round of RL optimization. EduQwen 32B-RL1, EduQwen 32B-SFT, and EduQwen 32B-SFT-RL2 are an application-driven family of open-source pedagogical LLMs built on a dense Qwen3-32B backbone. These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly larger proprietary systems such as the previous benchmark leader Gemini-3 Pro. These dense 32-billion-parameter models demonstrate that domain-specialized optimization can transform mid-sized open-source LLMs into true pedagogical domain experts that outperform much larger general-purpose systems, while preserving the transparency, customizability, and cost-efficiency required for responsible educational AI deployment.

[26] ART: Attention Replacement Technique to Improve Factuality in LLMs

Ziqin Luo,Yihao Quan,Xiaofeng Zhang,Xiaosong Yuan,Chen Shen

Main category: cs.CL

TL;DR: 本文发现大语言模型浅层注意力分布均匀导致幻觉,提出无需训练的注意力替换技术(ART)将其改为局部注意力,显著降低幻觉。

Details Motivation: 尽管已有多种缓解大语言模型幻觉的方法,但注意力模式与幻觉之间的关系尚未被充分探索。 Method: 分析LLM各层、各注意力头的注意力分数分布,发现浅层存在均匀注意力现象;提出无需训练的Attention Replacement Technique(ART),将浅层均匀注意力替换为局部注意力。 Result: ART在多个LLM架构上显著降低了幻觉,且无需微调或额外训练数据,展现出强有效性与泛化性。 Conclusion: 浅层均匀注意力是导致幻觉的重要因素,通过简单替换为局部注意力可有效缓解,无需引入训练开销。 Abstract: Hallucination in large language models (LLMs) continues to be a significant issue, particularly in tasks like question answering, where models often generate plausible yet incorrect or irrelevant information. Although various methods have been proposed to mitigate hallucinations, the relationship between attention patterns and hallucinations has not been fully explored. In this paper, we analyze the distribution of attention scores across each layer and attention head of LLMs, revealing a common and intriguing phenomenon: shallow layers of LLMs primarily rely on uniform attention patterns, where the model distributes its attention evenly across the entire sequence. This uniform attention pattern can lead to hallucinations, as the model fails to focus on the most relevant information. To mitigate this issue, we propose a training-free method called Attention Replacement Technique (ART), which replaces these uniform attention patterns in the shallow layers with local attention patterns. This change directs the model to focus more on the relevant contexts, thus reducing hallucinations. Through extensive experiments, ART demonstrates significant reductions in hallucinations across multiple LLM architectures, proving its effectiveness and generalizability without requiring fine-tuning or additional training data.

[27] FMI@SU ToxHabits: Evaluating LLMs Performance on Toxic Habit Extraction in Spanish Clinical Texts

Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

Main category: cs.CL

TL;DR: 本文提出了一种用于西班牙语临床文本中识别毒瘾习惯命名实体的方法,在ToxHabits共享任务子任务1中,采用GPT-4.1的少样本提示方法取得了0.65的F1分数。

Details Motivation: 针对西班牙语临床文本中物质使用与滥用提及的识别与分类需求,参与ToxHabits共享任务子任务1。 Method: 探索了零样本、少样本及提示优化等多种大语言模型(LLM)应用方式,最终选用GPT-4.1的少样本提示方法。 Result: 在测试集上达到0.65的F1分数,表明该方法在非英语语言命名实体识别中具有潜力。 Conclusion: GPT-4.1的少样本提示方法在西班牙语毒瘾习惯命名实体识别任务中表现最优,验证了LLM在低资源语言医疗文本处理中的可行性。 Abstract: The paper presents an approach for the recognition of toxic habits named entities in Spanish clinical texts. The approach was developed for the ToxHabits Shared Task. Our team participated in subtask 1, which aims to detect substance use and abuse mentions in clinical case reports and classify them in four categories (Tobacco, Alcohol, Cannabis, and Drug). We explored various methods of utilizing LLMs for the task, including zero-shot, few-shot, and prompt optimization, and found that GPT-4.1's few-shot prompting performed the best in our experiments. Our method achieved an F1 score of 0.65 on the test set, demonstrating a promising result for recognizing named entities in languages other than English.

[28] Attention Flows: Tracing LLM Conceptual Engagement via Story Summaries

Rebecca M. M. Hicke,Sil Hamilton,David Mimno,Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: 本文通过对比人类与大语言模型(LLM)对小说的摘要生成行为,评估LLM在长文本叙事理解上的能力缺陷,发现LLM倾向于过度关注文本结尾,且其概念性参与模式与人类显著不同;研究构建了150部小说的章节-摘要对齐数据集,并开源以支持后续研究。

Details Motivation: 尽管大语言模型(LLM)上下文长度不断增长,但其整合长文本信息的能力并未同步提升;小说摘要任务可反映作者对叙事重要性的判断,因此是检验模型是否具备类人概念性文本理解能力的理想评测场景。 Method: 构建包含150部小说及其人工摘要的数据集,人工标注摘要句子与对应章节的对齐关系;使用9个SOTA LLM为同一小说生成摘要,并进行相同对齐;从风格和叙事焦点分布(如章节位置偏好)两方面对比人类与模型摘要;进一步将人类叙事关注模式与模型注意力机制进行关联分析。 Result: 1)章节-摘要对齐任务本身难度高,印证小说摘要的复杂性;2)LLM摘要在风格和叙事焦点分布上均不同于人类,尤其显著偏向文本结尾部分;3)模型注意力机制与人类叙事关注存在错位,揭示其叙事理解退化的原因。 Conclusion: 当前LLM在长文本叙事理解上仍存在根本性局限,其摘要行为未能复现人类对叙事结构的概念性把握;聚焦于章节级注意力建模与叙事重要性对齐应成为未来改进的关键方向;所构建的对齐数据集已开源。 Abstract: Although LLM context lengths have grown, there is evidence that their ability to integrate information across long-form texts has not kept pace. We evaluate one such understanding task: generating summaries of novels. When human authors of summaries compress a story, they reveal what they consider narratively important. Therefore, by comparing human and LLM-authored summaries, we can assess whether models mirror human patterns of conceptual engagement with texts. To measure conceptual engagement, we align sentences from 150 human-written novel summaries with the specific chapters they reference. We demonstrate the difficulty of this alignment task, which indicates the complexity of summarization as a task. We then generate and align additional summaries by nine state-of-the-art LLMs for each of the 150 reference texts. Comparing the human and model-authored summaries, we find both stylistic differences between the texts and differences in how humans and LLMs distribute their focus throughout a narrative, with models emphasizing the ends of texts. Comparing human narrative engagement with model attention mechanisms suggests explanations for degraded narrative comprehension and targets for future development. We release our dataset to support future research.

[29] State-of-the-Art Arabic Language Modeling with Sparse MoE Fine-Tuning and Chain-of-Thought Distillation

Navan Preet Singh,Anurag Garikipati,Ahmed Abulkhair,Jyani Akshay Jagdishbhai,Atul Yaduvanshi,Amarendra Chaudhary,Madalina Ciobanu,Qingqing Mao,Ritankar Das

Main category: cs.CL

TL;DR: 本文提出阿拉伯语专用开源大语言模型Arabic-DeepSeek-R1,采用稀疏MoE架构与四阶段思维链蒸馏策略,在Open Arabic LLM Leaderboard(OALL)全部七项基准测试中取得SOTA,首次在多项任务上超越GPT-5.1,证明阿拉伯语性能瓶颈主要源于缺乏针对性优化,而非架构限制。

Details Motivation: 解决阿拉伯语等代表性不足语言的数字公平问题,弥补其在当前大模型生态中的性能缺口。 Method: 基于稀疏MoE骨干网络,设计融合阿拉伯语语言学验证与区域伦理规范的四阶段CoT蒸馏方案,并构建80/20污染可控的阿拉伯语-英语双语训练数据集(3.72亿token)。 Result: 在OALL全部七项基准(包括MadinahQA、AraTrust、AlGhafa、ALRAGE)中平均得分最高,多项指标显著超越GPT-5.1及此前OALL领先模型。 Conclusion: 通过文化适配的稀疏MoE模型与高效蒸馏策略,无需工业级预训练即可实现低资源语言的SOTA性能,为自主可控、领域定制的语言技术提供可复现框架。 Abstract: This paper introduces Arabic-DeepSeek-R1, an application-driven open-source Arabic LLM that leverages a sparse MoE backbone to address the digital equity gap for under-represented languages, and establishes a new SOTA across the entire Open Arabic LLM Leaderboard (OALL). Our four-phase CoT distillation scheme integrates Arabic-specific linguistic verification and regional ethical norms into a 372M-token, contamination-controlled 80/20 Arabic-English training mixture. Arabic-DeepSeek-R1 achieves the highest average score across the seven-benchmark OALL suite while establishing SOTA or near-SOTA, including dominant results on grammar-focused MadinahQA (surpassing both GPT-5.1 and the OALL leader by substantial margins), safety-oriented AraTrust, multi-ability AlGhafa, and retrieval-augmented ALRAGE. Our results indicate that the combination of sparse MoE architecture, culturally-informed CoT distillation with explicit Arabic linguistic checks, and strategic bilingual data curation enables an open-source adapted model to systematically outperform the proprietary frontier system GPT-5.1 on the majority of benchmarks evaluating comprehensive language-specific tasks: the first such demonstration for Arabic LLMs. These findings indicate that much of Arabic's performance deficit in current LLM ecosystems stems from under-specialization rather than architectural limitations, and that parameter-efficient adaptation of open reasoning models can yield breakthrough SOTA performance without industrial-scale pretraining costs. Arabic-DeepSeek-R1 establishes a validated and replicable framework for sovereign and domain-specific language technologies, demonstrating that strategic, culturally-grounded adaptation of sparse MoE backbones offers a viable and cost-effective pathway to achieving record-breaking performance across standardized benchmarks for low-resource languages.

[30] When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

Jonathan Nemitz,Carsten Eickhoff,Junyi Jessy Li,Kyle Mahowald,Michal Golovanevsky,William Rudman

Main category: cs.CL

TL;DR: 本文提出Graded Color Attribution (GCA)数据集,用于评估视觉-语言模型(VLMs)在颜色归因任务中是否遵守其自身内省得出的决策规则;结果表明VLMs频繁违背自身规则,而人类则基本忠实于规则,揭示VLMs存在内省自知错配问题。

Details Motivation: 理解VLMs何时会异常行为、能否可靠预测自身行为、是否遵循自身内省推理,是实现可信部署的核心挑战。 Method: 构建可控基准GCA数据集(含三类线稿:基于世界知识重着色、反事实重着色、无颜色先验形状),让VLMs与人类参与者设定颜色标注阈值,并检验其后续决策是否符合该阈值规则。 Result: VLMs(如GPT-5-mini)在强颜色先验物体上约60%违背自身内省规则;人类虽有高估颜色覆盖率倾向,但整体忠实于所设规则;VLMs能准确估计颜色覆盖率,却在最终输出中公然违背自身推理;世界知识先验系统性降低VLMs规则忠实度,且该现象不类比人类认知。 Conclusion: VLMs的推理失败并非仅由任务难度导致,而是源于内省自我知识的系统性错配,这对高风险场景下的部署构成严峻挑战。 Abstract: Understanding when Vision-Language Models (VLMs) will behave unexpectedly, whether models can reliably predict their own behavior, and if models adhere to their introspective reasoning are central challenges for trustworthy deployment. To study this, we introduce the Graded Color Attribution (GCA) dataset, a controlled benchmark designed to elicit decision rules and evaluate participant faithfulness to these rules. GCA consists of line drawings that vary pixel-level color coverage across three conditions: world-knowledge recolorings, counterfactual recolorings, and shapes with no color priors. Using GCA, both VLMs and human participants establish a threshold: the minimum percentage of pixels of a given color an object must have to receive that color label. We then compare these rules with their subsequent color attribution decisions. Our findings reveal that models systematically violate their own introspective rules. For example, GPT-5-mini violates its stated introspection rules in nearly 60\% of cases on objects with strong color priors. Human participants remain faithful to their stated rules, with any apparent violations being explained by a well-documented tendency to overestimate color coverage. In contrast, we find that VLMs are excellent estimators of color coverage, yet blatantly contradict their own reasoning in their final responses. Across all models and strategies for eliciting introspective rules, world-knowledge priors systematically degrade faithfulness in ways that do not mirror human cognition. Our findings challenge the view that VLM reasoning failures are difficulty-driven and suggest that VLM introspective self-knowledge is miscalibrated, with direct implications for high-stakes deployment.

[31] Team Fusion@ SU@ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking

Georgi Grazhdanski,Sylvia Vassileva,Ivan Koychev,Svetla Boytcheva

Main category: cs.CL

TL;DR: 本文提出了一种基于Transformer的方法来解决SympTEMIST命名实体识别(NER)和实体链接(EL)任务,通过微调RoBERTa模型结合BiLSTM和CRF进行NER,并利用SapBERT生成候选实体后通过余弦相似度匹配知识库完成EL。

Details Motivation: 解决SympTEMIST任务中的命名实体识别与实体链接问题,提升医学文本中症状等实体的识别与标准化能力。 Method: NER采用在增强训练集上微调RoBERTa基模型并加入BiLSTM和CRF层;EL使用跨语言SapBERT XLMR-Large生成候选实体,并计算其与知识库条目的余弦相似度。 Result: 实验表明知识库的选择对模型准确率影响最大。 Conclusion: 知识库质量是决定实体链接性能的关键因素,而结合语言模型与序列标注结构可有效提升NER效果。 Abstract: This paper presents a transformer-based approach to solving the SympTEMIST named entity recognition (NER) and entity linking (EL) tasks. For NER, we fine-tune a RoBERTa-based (1) token-level classifier with BiLSTM and CRF layers on an augmented train set. Entity linking is performed by generating candidates using the cross-lingual SapBERT XLMR-Large (2), and calculating cosine similarity against a knowledge base. The choice of knowledge base proves to have the highest impact on model accuracy.

[32] Learning to Interrupt in Language-based Multi-agent Communication

Danqing Wang,Da Yin,Ruta Desai,Lei Li,Asli Celikyilmaz,Ansong Ni

Main category: cs.CL

TL;DR: 本文提出了一种可中断的多智能体通信框架HANDRAISER,通过让倾听者在合适时机主动打断说话者来减少冗余信息和计算开销;该方法基于对未来收益与成本的估计学习中断点,在多个多智能体任务中显著降低32.2%通信成本,同时保持或提升任务性能,并具备跨任务和跨智能体泛化能力。

Details Motivation: 现有基于大语言模型的多智能体通信存在输出冗长、上下文过载和计算成本高的问题;已有压缩方法仅从说话方出发,难以适配不同倾听者并识别关键信息;受人类交流中倾听者主动提问或打断启发,作者提出让倾听者主导通信节奏。 Method: 设计可中断通信框架HANDRAISER,允许倾听者在对话中主动打断;针对LLM易过度自信、过早打断的问题,提出一种基于估计未来奖励与成本的学习方法,动态预测最优中断点;通过提示工程与强化学习结合实现中断策略训练。 Result: 在2智能体文本Pictionary、3智能体会议调度、3智能体辩论等场景中,HANDRAISER相较基线降低32.2%通信成本,任务性能相当或更优;学习到的中断行为可迁移到不同智能体和新任务中,体现良好泛化性。 Conclusion: 倾听者驱动的可中断通信是一种高效、可泛化的多智能体协作范式;HANDRAISER验证了引入人类启发式交互机制(如适时打断)能显著提升LLM智能体系统的通信效率与实用性。 Abstract: Multi-agent systems using large language models (LLMs) have demonstrated impressive capabilities across various domains. However, current agent communication suffers from verbose output that overload context and increase computational costs. Although existing approaches focus on compressing the message from the speaker side, they struggle to adapt to different listeners and identify relevant information. An effective way in human communication is to allow the listener to interrupt and express their opinion or ask for clarification. Motivated by this, we propose an interruptible communication framework that allows the agent who is listening to interrupt the current speaker. Through prompting experiments, we find that current LLMs are often overconfident and interrupt before receiving enough information. Therefore, we propose a learning method that predicts the appropriate interruption points based on the estimated future reward and cost. We evaluate our framework across various multi-agent scenarios, including 2-agent text pictionary games, 3-agent meeting scheduling, and 3-agent debate. The results of the experiment show that our HANDRAISER can reduce the communication cost by 32.2% compared to the baseline with comparable or superior task performance. This learned interruption behavior can also be generalized to different agents and tasks.

[33] Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

Afroza Nowshin,Prithweeraj Acharjee Porag,Haziq Jeelani,Fayeq Jeelani Syed

Main category: cs.CL

TL;DR: 本文提出了一种面向阿拉伯语方言的可控、上下文感知的机器翻译框架,通过规则驱动的数据增强(RBDA)构建覆盖8种方言的平行语料,并利用轻量元数据标签微调mT5模型,实现对目标方言和语域的显式控制;实验表明其在方言保真度上显著优于主流模型(如NLLB),但BLEU分数较低,揭示了标准评测指标在方言翻译任务中的局限性。

Details Motivation: 现有阿拉伯语机器翻译系统难以处理方言多样性,常将方言输入同质化为现代标准阿拉伯语(MSA),且缺乏对目标方言的用户可控性。 Method: 提出基于规则的数据增强(RBDA)流程,将3000句种子语料扩展为涵盖8种区域方言的57000句平衡平行语料;在此基础上,使用轻量级元数据标签对mT5-base模型进行条件微调,实现方言与社会语域的可控翻译生成。 Result: 相比高资源基线模型NLLB(BLEU=13.75,偏向MSA),本方法BLEU得分较低(8.19),但在方言对齐度和文化真实性上显著更优(LLM评估得分4.80/5 vs. 1.0/5)。 Conclusion: 标准MT评测指标(如BLEU)难以反映方言翻译的真实质量,需发展更契合语言多样性的评估范式;可控、显式建模方言变异是提升阿拉伯语方言翻译保真度的有效路径。 Abstract: Current Machine Translation (MT) systems for Arabic often struggle to account for dialectal diversity, frequently homogenizing dialectal inputs into Modern Standard Arabic (MSA) and offering limited user control over the target vernacular. In this work, we propose a context-aware and steerable framework for dialectal Arabic MT that explicitly models regional and sociolinguistic variation. Our primary technical contribution is a Rule-Based Data Augmentation (RBDA) pipeline that expands a 3,000-sentence seed corpus into a balanced 57,000-sentence parallel dataset, covering eight regional varieties eg., Egyptian, Levantine, Gulf, etc. By fine-tuning an mT5-base model conditioned on lightweight metadata tags, our approach enables controllable generation across dialects and social registers in the translation output. Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by defaulting toward the MSA mean, while exhibiting limited dialectal specificity. In contrast, our model achieves lower BLEU scores (8.19) but produces outputs that align more closely with the intended regional varieties. Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs. 1.0/5). These findings highlight the limitations of standard MT metrics for dialect-sensitive tasks and motivate the need for evaluation practices that better reflect linguistic diversity in Arabic MT.

[34] Multi-objective Evolutionary Merging Enables Efficient Reasoning Models

Mario Iacobelli,Adrian Robert Minut,Tommaso Mencattini,Donato Crisostomi,Andrea Santilli,Iacopo Masi,Emanuele Rodolà

Main category: cs.CL

TL;DR: 本文提出Evo-L2S框架,将长链推理压缩为短链(L2S)问题建模为多目标优化,利用进化式模型融合与熵驱动子集采样,在显著缩短推理链长度(>50%)的同时保持甚至提升数学推理准确率。

Details Motivation: 现有无训练模型融合方法在Long-to-Short(L2S)推理中依赖固定超参的标量化算术,鲁棒性差、妥协次优,亟需更灵活、可权衡精度与输出长度的方法。 Method: 提出Evo-L2S:将L2S建模为多目标优化问题,采用进化算法进行模型融合,并设计熵基子集采样以降低大模型适应度评估开销。 Result: 在1.5B/7B/14B模型和6个数学推理基准上,推理链长度减少超50%,同时准确率持平或提升。 Conclusion: Evo-L2S通过多目标进化融合实现了精度与推理长度的帕累托最优平衡,为高效推理提供了新范式。 Abstract: Reasoning models have demonstrated remarkable capabilities in solving complex problems by leveraging long chains of thought. However, this more deliberate reasoning comes with substantial computational overhead at inference time. The Long-to-Short (L2S) reasoning problem seeks to maintain high accuracy using fewer tokens, but current training-free model merging approaches rely on scalarized, fixed-hyperparameter arithmetic methods that are highly brittle and force suboptimal compromises. To address this gap, we introduce Evo-L2S, a novel framework that formulates L2S reasoning as a multi-objective optimization challenge. By leveraging evolutionary model merging, Evo-L2S explicitly optimizes the trade-off between accuracy and output length to produce a robust Pareto front of merged models. To make this search computationally tractable for large language models, we propose an entropy-based subset sampling technique that drastically reduces the overhead of fitness estimation. Comprehensive experiments across 1.5B, 7B, and 14B parameter scales on six mathematical reasoning benchmarks demonstrate that Evo-L2S can reduce the length of generated reasoning traces by over 50% while preserving, or even improving, the problem-solving accuracy of the original reasoning models.

[35] DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling

Shicheng Liu,Yucheng Jiang,Sajid Farook,Camila Nicollier Sanchez,David Fernando Castro Pena,Monica S. Lam

Main category: cs.CL

TL;DR: 本文提出DataSTORM,一种基于大语言模型的智能体系统,能自主在大规模结构化数据库和互联网数据上开展深度研究,通过假设驱动、定量推理与叙事构建实现数据洞察发现,在InsightBench和ACLED数据集上均显著优于现有方法。

Details Motivation: 现有LLM智能体研究主要面向非结构化网页数据,而对大规模结构化数据库的深度研究仍缺乏有效方法,亟需支持假设生成、定量推理和分析叙事的能力。 Method: DataSTORM基于探索性数据分析与数据叙事理论,将结构化数据研究重构为命题驱动的分析流程:从数据中发现候选命题,通过跨源迭代验证,并发展为连贯的分析叙事;系统融合结构化查询、多源验证与叙事生成机制。 Result: 在InsightBench上实现洞察级召回率相对提升19.4%、摘要级得分提升7.2%;在新建ACLED真实复杂数据库数据集上,各项自动指标与人工评估均超越ChatGPT Deep Research等专有系统。 Conclusion: DataSTORM证明了命题驱动、跨源验证与叙事构建范式可有效支撑LLM智能体在结构化数据上的深度研究,为数据密集型AI研究开辟新路径。 Abstract: Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis. However, existing approaches primarily focus on unstructured web data, while the challenges of conducting deep research over large-scale structured databases remain relatively underexplored. Unlike web-based research, effective data-centric research requires more than retrieval and summarization and demands iterative hypothesis generation, quantitative reasoning over structured schemas, and convergence toward a coherent analytical narrative. In this paper, we present DataSTORM, an LLM-based agentic system capable of autonomously conducting research across both large-scale structured databases and internet sources. Grounded in principles from Exploratory Data Analysis and Data Storytelling, DataSTORM reframes deep research over structured data as a thesis-driven analytical process: discovering candidate theses from data, validating them through iterative cross-source investigation, and developing them into coherent analytical narratives. We evaluate DataSTORM on InsightBench, where it achieves a new state-of-the-art result with a 19.4% relative improvement in insight-level recall and 7.2% in summary-level score. We further introduce a new dataset built on ACLED, a real-world complex database, and demonstrate that DataSTORM outperforms proprietary systems such as ChatGPT Deep Research across both automated metrics and human evaluations.

[36] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

Zhipin Wang,Christoph Leiter,Christian Frey,Mohamed Hesham Ibrahim Abdalla,Josif Grabocka,Steffen Eger

Main category: cs.CL

TL;DR: 本文提出了ValueGround基准,用于评估多模态大语言模型(MLLMs)在视觉场景中进行文化价值观判断的能力。实验发现,当选项由图像而非文本呈现时,模型准确率显著下降,表明其文化价值判断的跨模态迁移能力仍有限。

Details Motivation: 现有文化价值观评估几乎全基于文本,无法检验模型能否在视觉化选项下进行文化条件化的判断,因此需要构建能测试视觉价值对齐能力的基准。 Method: 基于世界价值观调查(WVS)问题构建ValueGround基准,使用最小对比图像对表示对立选项,并控制无关变量;要求模型仅根据国家、问题和图像对选择符合该国价值观倾向的图像,不提供原始文本选项。 Result: 在6个MLLM和13个国家上,文本设置平均准确率为72.8%,而可视化设置降至65.8%;图像-选项对齐准确率达92.8%,但所有模型仍易出现预测反转。 Conclusion: 当前MLLMs在文化价值观的跨模态迁移上存在明显局限,ValueGround为系统研究该问题提供了可控测试平台。 Abstract: Cultural values are expressed not only through language but also through visual scenes and everyday social practices. Yet existing evaluations of cultural values in language models are almost entirely text-only, making it unclear whether models can ground culture-conditioned judgments when response options are visualized. We introduce ValueGround, a benchmark for evaluating culture-conditioned visual value grounding in multimodal large language models (MLLMs). Built from World Values Survey (WVS) questions, ValueGround uses minimally contrastive image pairs to represent opposing response options while controlling irrelevant variation. Given a country, a question, and an image pair, a model must choose the image that best matches the country's value tendency without access to the original response-option texts. Across six MLLMs and 13 countries, average accuracy drops from 72.8% in the text-only setting to 65.8% when options are visualized, despite 92.8% accuracy on option-image alignment. Stronger models are more robust, but all remain prone to prediction reversals. Our benchmark provides a controlled testbed for studying cross-modal transfer of culture-conditioned value judgments.

[37] Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux,Sergio Burdisso,Esaú Villatoro-Tello,Dairazalia Sánchez-Cortés,Shiran Liu,Severin Baroudi,Shashi Kumar,Hasindri Watawana,Manjunath K E,Kadri Hacioglu,Petr Motlicek,Andreas Stolcke

Main category: cs.CL

TL;DR: 本文探讨了在LLM-based ASR系统中,仅使用少量语音数据(如10%目标域语音,不足4小时)结合文本数据进行混合批处理(MB),可显著缓解语音编码器与大语言模型之间的模态差距,并在领域自适应中达到甚至超越传统全量配对数据微调的效果。

Details Motivation: LLM-based ASR系统虽能利用纯文本数据进行领域适配,但语音编码器输出的噪声表征与LLM训练分布不一致,存在模态鸿沟;本文旨在探究少量语音数据是否足以缓解该问题。 Method: 比较三种适配策略:纯文本适配、完整配对语音-文本适配、以及混合批处理(MB)——即在训练批次中同时包含语音-文本对和纯文本样本;在域内和跨域场景下评估性能。 Result: 即使仅使用10%目标域语音(<4小时),MB策略的词错误率(WER)即可媲美或优于使用全部配对数据的传统ASR微调;小量语音提供了强模态对齐信号。 Conclusion: 少量语音数据能有效弥合LLM-based ASR中的模态差距,混合批处理是一种高效、低资源依赖的领域适配新范式。 Abstract: Conventional end-to-end automatic speech recognition (ASR) systems rely on paired speech-text data for domain adaptation. Recent LLM-based ASR architectures connect a speech encoder to a large language model via a projection module, enabling adaptation with text-only data. However, this introduces a modality gap, as the LLM is not exposed to the noisy representations produced by the speech projector. We investigate whether small amounts of speech can mitigate this mismatch. We compare three strategies: text-only adaptation, paired speech-text adaptation, and mixed batching (MB), which combines both. Experiments in in-domain and out-of-domain settings show that even limited speech consistently improves performance. Notably, MB using only 10% of the target-domain (less than 4 hours) speech achieves word error rates comparable to, or better than, conventional ASR fine-tuning with the full dataset, indicating that small amounts of speech provide a strong modality-alignment signal.

[38] MedConclusion: A Benchmark for Biomedical Conclusion Generation from Structured Abstracts

Weiyue Li,Ruizhi Qian,Yi Li,Yongce Li,Yunfan Long,Jiahui Cai,Yan Luo,Mengyu Wang

Main category: cs.CL

TL;DR: 本文介绍了MedConclusion数据集,一个包含570万篇PubMed结构化摘要的大规模生物医学结论生成数据集,用于评估大语言模型从结构化生物医学证据中推理科学结论的能力。

Details Motivation: 现有资源在测试大语言模型能否从结构化生物医学证据中推断科学结论方面仍显不足,因此需要构建专门的数据集支持相关研究。 Method: 构建了MedConclusion数据集,每条样本由摘要的非结论部分与作者撰写的原始结论配对,并引入期刊级元数据(如生物医学类别和SJR)支持子组分析;在结论生成与摘要生成两种提示设置下评估多种大语言模型,并采用基于参考的指标与LLM-as-a-judge方式进行评分。 Result: 发现结论写作与摘要写作行为显著不同;当前自动评测指标难以区分强模型性能;评判者身份会显著影响绝对得分。 Conclusion: MedConclusion为研究科学证据到结论的推理能力提供了可复用的数据资源,有助于推动大语言模型在生物医学推理任务中的发展。 Abstract: Large language models (LLMs) are widely explored for reasoning-intensive research tasks, yet resources for testing whether they can infer scientific conclusions from structured biomedical evidence remain limited. We introduce $\textbf{MedConclusion}$, a large-scale dataset of $\textbf{5.7M}$ PubMed structured abstracts for biomedical conclusion generation. Each instance pairs the non-conclusion sections of an abstract with the original author-written conclusion, providing naturally occurring supervision for evidence-to-conclusion reasoning. MedConclusion also includes journal-level metadata such as biomedical category and SJR, enabling subgroup analysis across biomedical domains. As an initial study, we evaluate diverse LLMs under conclusion and summary prompting settings and score outputs with both reference-based metrics and LLM-as-a-judge. We find that conclusion writing is behaviorally distinct from summary writing, strong models remain closely clustered under current automatic metrics, and judge identity can substantially shift absolute scores. MedConclusion provides a reusable data resource for studying scientific evidence-to-conclusion reasoning. Our code and data are available at: https://github.com/Harvard-AI-and-Robotics-Lab/MedConclusion.

[39] Fine-tuning Whisper for Pashto ASR: strategies and scale

Hanif Rahman

Main category: cs.CL

TL;DR: 本文研究了如何改进Whisper模型在Pashto语音识别任务上的性能,发现原始Whisper模型在Pashto上表现极差(WER > 100%),通过在CommonVoice Pashto数据集上比较多种微调策略,发现标准全参数微调效果最佳;进一步扩展到不同规模模型和更大数据集后,whisper-small在113小时数据下达到最优性价比,并通过在线数据增强显著提升性能。

Details Motivation: Pashto虽是CommonVoice中数据量最大的语言之一,却未被包含在Whisper预训练语料中,导致开箱即用的Whisper模型在Pashto语音识别上完全失效(输出错误文字系统、WER超100%),亟需有效的微调方案。 Method: 在CommonVoice Pashto v20/v24数据集上,对比四种微调策略:标准全参数微调、LoRA(秩64)、冻结部分编码器(2/6层)、乌尔都语→普什图语多阶段迁移;并扩展至whisper-small与whisper-large-v3-turbo;引入在线数据增强;开展错误分析。 Result: 在CV20上,标准微调使whisper-base WER达21.22%,显著优于LoRA(-33.36pp)、冻结编码器(-14.76pp)和乌尔都迁移(-44.56pp);冻结编码器因破坏层功能分离且减少可训练参数而失效;乌尔都迁移失败源于检查点不可靠、音系不匹配及训练不足;在CV24(113h)上,whisper-small达24.89% WER,whisper-large-v3-turbo达23.37%,但增益递减;在线增强带来7.25pp提升;主要错误为词尾性别后缀混淆(-ay/-a)和/t͡s/音的卷舌替代。 Conclusion: 标准全参数微调是最有效的Pashto适配方法;whisper-small在有限数据(113h)下兼具性能与效率,是实用最优选;在线增强显著提升鲁棒性;错误分析揭示了语言特异性挑战,为后续建模提供方向;所有模型与代码已开源。 Abstract: Pashto is absent from Whisper's pre-training corpus despite being one of CommonVoice's largest language collections, leaving off-the-shelf models unusable: all Whisper sizes output Arabic, Dari, or Urdu script on Pashto audio, achieving word error rates above 100%. We compare four fine-tuning strategies for whisper-base on CommonVoice Pashto v20: vanilla full fine-tuning, LoRA (rank 64), frozen-encoder (2/6 layers), and multistage Urdu-to-Pashto transfer. We extend vanilla fine-tuning to whisper-small and whisper-large-v3-turbo on CommonVoice Pashto v24 (113 hours). Vanilla fine-tuning achieves WER 21.22% on CV20, outperforming LoRA by 33.36 pp, frozen-encoder by 14.76 pp, and Urdu transfer by 44.56 pp. Frozen-encoder fine-tuning degrades performance on whisper-base (6 encoder layers): layer-function separation does not hold at this depth, and freezing removes a third of trainable capacity. Urdu-to-Pashto transfer fails due to an unverified intermediate checkpoint, phonological mismatch, and insufficient training. On CV24, whisper-small achieves WER 24.89% (2.24 pp over whisper-base at 3.3x parameters); whisper-large-v3-turbo achieves 23.37% (a further 1.52 pp). Diminishing returns indicate whisper-small is the practical optimum at 113 hours. Online augmentation provides 7.25 pp WER benefit over matched training. Error analysis identifies word-final suffix confusion (masculine -ay vs. feminine -a) and retroflex substitutions involving the Pashto-unique consonant /ts/ as dominant failure modes. Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.

[40] Does a Global Perspective Help Prune Sparse MoEs Elegantly?

Zeliang Zhang,Nikhil Ghosh,Jiani Liu,Bin Yu,Xiaodong Liu

Main category: cs.CL

TL;DR: 本文提出GRAPE,一种全局冗余感知的专家剪枝策略,用于稀疏Mixture-of-Experts(MoE)语言模型,在保持性能的同时显著降低内存开销。

Details Motivation: 现有MoE模型虽提升效率,但专家参数量大导致内存消耗高;传统均匀分层剪枝忽视了MoE中跨层异构冗余性。 Method: 提出GRAPE方法,基于全局冗余分析动态分配各层剪枝预算,实现跨层协同剪枝。 Result: 在Mixtral-8x7B、Mixtral-8x22B、DeepSeek-MoE、Qwen-MoE和GPT-OSS上验证,GRAPE在相同剪枝预算下平均准确率优于最强局部基线1.40%,最高提升达2.45%。 Conclusion: GRAPE通过全局冗余感知的动态预算分配,有效缓解稀疏MoE模型的内存压力,同时优于均匀或局部剪枝策略。 Abstract: Empirical scaling laws for language models have encouraged the development of ever-larger LLMs, despite their growing computational and memory costs. Sparse Mixture-of-Experts (MoEs) offer a promising alternative by activating only a subset of experts per forward pass, improving efficiency without sacrificing performance. However, the large number of expert parameters still leads to substantial memory consumption. Existing pruning methods typically allocate budgets uniformly across layers, overlooking the heterogeneous redundancy that arises in sparse MoEs. We propose GRAPE (Global Redundancy-Aware Pruning of Experts, a global pruning strategy that dynamically allocates pruning budgets based on cross-layer redundancy. Experiments on Mixtral-8x7B, Mixtral-8x22B, DeepSeek-MoE, Qwen-MoE, and GPT-OSS show that, under the same pruning budget, GRAPE consistently achieves the best average performance. On the three main models reported in the paper, it improves average accuracy over the strongest local baseline by 1.40% on average across pruning settings, with gains of up to 2.45%.

[41] The Illusion of Stochasticity in LLMs

Xiangming Gu,Soham De,Michalis Titsias,Larisa Markeeva,Petar Veličković,Razvan Pascanu

Main category: cs.CL

TL;DR: 本文揭示了大语言模型(LLM)作为智能体时,在可靠随机采样方面存在根本性缺陷:尽管能将给定随机种子映射为特定分布,却无法直接从目标分布中正确采样。

Details Motivation: LLM作为智能体需频繁从数据推断的分布中采样,但其内部概率估计与实际输出之间缺乏可靠映射,而传统RL智能体依赖外部采样机制,这一差异构成关键失败点。 Method: 通过在多个模型家族、不同规模、多种提示方式及各类分布上开展严格实证分析,系统评估LLM的随机采样能力。 Result: 实证表明,当前前沿LLM虽能利用给定随机种子生成目标分布,但在无外部种子、需直接从指定分布采样时表现严重失准。 Conclusion: 可靠随机采样是LLM智能体化亟待解决的基础性问题,现有模型在此任务上存在系统性缺陷,需新机制或架构支持。 Abstract: In this work, we demonstrate that reliable stochastic sampling is a fundamental yet unfulfilled requirement for Large Language Models (LLMs) operating as agents. Agentic systems are frequently required to sample from distributions, often inferred from observed data, a process which needs to be emulated by the LLM. This leads to a distinct failure point: while standard RL agents rely on external sampling mechanisms, LLMs fail to map their internal probability estimates to their stochastic outputs. Through rigorous empirical analysis across multiple model families, model sizes, prompting styles, and distributions, we demonstrate the extent of this failure. Crucially, we show that while powerful frontier models can convert provided random seeds to target distributions, their ability to sample directly from specific distributions is fundamentally flawed.

[42] CCD-CBT: Multi-Agent Therapeutic Interaction for CBT Guided by Cognitive Conceptualization Diagram

Chang Liu,Changsheng Ma,Yongfeng Tao,Bin Hu,Minqiang Yang

Main category: cs.CL

TL;DR: 本文提出CCD-CBT多智能体框架,通过动态重构认知概念图(CCD)和信息不对称交互模拟真实CBT咨询过程,并构建合成数据集CCDCHAT,实验表明其提升咨询保真度与积极情绪增强效果。

Details Motivation: 现有大模型模拟CBT咨询多依赖静态认知档案和全知单智能体,无法反映真实治疗中动态演化与信息不对称的特点。 Method: 提出CCD-CBT多智能体框架:1)由控制智能体动态更新认知概念化图(CCD);2)治疗师智能体基于推断的客户状态进行信息不对称交互;并构建合成多轮CBT对话数据集CCDCHAT。 Result: 在临床量表评估和专家 therapist 评测中,基于CCDCHAT微调的模型在咨询保真度和积极情绪增强方面均优于强基线;消融实验证实动态CCD引导与不对称智能体设计的必要性。 Conclusion: CCD-CBT为构建理论扎实、临床可信的对话式心理支持智能体提供了新范式。 Abstract: Large language models show potential for scalable mental-health support by simulating Cognitive Behavioral Therapy (CBT) counselors. However, existing methods often rely on static cognitive profiles and omniscient single-agent simulation, failing to capture the dynamic, information-asymmetric nature of real therapy. We introduce CCD-CBT, a multi-agent framework that shifts CBT simulation along two axes: 1) from a static to a dynamically reconstructed Cognitive Conceptualization Diagram (CCD), updated by a dedicated Control Agent, and 2) from omniscient to information-asymmetric interaction, where the Therapist Agent must reason from inferred client states. We release CCDCHAT, a synthetic multi-turn CBT dataset generated under this framework. Evaluations with clinical scales and expert therapists show that models fine-tuned on CCDCHAT outperform strong baselines in both counseling fidelity and positive-affect enhancement, with ablations confirming the necessity of dynamic CCD guidance and asymmetric agent design. Our work offers a new paradigm for building theory-grounded, clinically-plausible conversational agents.

[43] To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

Zohaib Khan,Mustafa Dogan,Ifeoma Okoh,Pouya Sadeghi,Siddhartha Shrestha,Sergius Justus Nyah,Mahmoud O. Mokhiamar,Michael J. Ryan,Tarek Naous

Main category: cs.CL

TL;DR: 本文研究了大型语言模型(LLM)在多语言、多国家背景下生成和传播虚假信息的行为,构建了包含440个提示模板和6867个实体的多语言平行数据集GlobalLies(覆盖8种语言、195个国家),发现LLM生成虚假信息的倾向在低资源语言和人类发展指数(HDI)较低的国家中显著更高;现有防护策略(如输入安全分类器和检索增强式事实核查)存在跨语言与区域不均衡问题;作者开源GlobalLies以推动全球虚假信息治理研究。

Details Motivation: 大型语言模型强大的文本生成能力降低了恶意行为者制造和传播虚假信息的门槛,而其在不同语言和国家背景下的 misinformation 生成行为尚缺乏系统性实证研究。 Method: 构建多语言平行数据集GlobalLies(含440个提示模板、6867个实体,覆盖8语言、195国家);结合人工标注与大规模LLM-as-a-judge评估,对数百上千次前沿模型生成结果进行跨语言、跨国家分析;评估现有缓解策略(输入安全分类器、RAG事实核查)的有效性与区域差异。 Result: LLM生成虚假信息的倾向呈现系统性地理与语言差异:在低资源语言及低HDI国家中传播率显著更高;当前安全分类器存在跨语言性能落差;RAG事实核查因地区信息可得性不均而效果不稳定。 Conclusion: 全球虚假信息风险具有结构性不平等特征,需开发兼顾语言公平性与区域信息可及性的新型缓解策略;GlobalLies数据集为后续研究提供了关键基准与资源支撑。 Abstract: Misinformation is on the rise, and the strong writing capabilities of LLMs lower the barrier for malicious actors to produce and disseminate false information. We study how LLMs behave when prompted to spread misinformation across languages and target countries, and introduce GlobalLies, a multilingual parallel dataset of 440 misinformation generation prompt templates and 6,867 entities, spanning 8 languages and 195 countries. Using both human annotations and large-scale LLM-as-a-judge evaluations across hundreds of thousands of generations from state-of-the-art models, we show that misinformation generation varies systematically based on the country being discussed. Propagation of lies by LLMs is substantially higher in many lower-resource languages and for countries with a lower Human Development Index (HDI). We find that existing mitigation strategies provide uneven protection: input safety classifiers exhibit cross-lingual gaps, and retrieval-augmented fact-checking remains inconsistent across regions due to unequal information availability. We release GlobalLies for research purposes, aiming to support the development of mitigation strategies to reduce the spread of global misinformation: https://github.com/zohaib-khan5040/globallies

[44] LLM-based Schema-Guided Extraction and Validation of Missing-Person Intelligence from Heterogeneous Data Sources

Joshua Castillo,Ravi Mukkamala

Main category: cs.CL

TL;DR: 本文提出Guardian Parser Pack,一种AI驱动的解析与标准化流水线,用于将多源调查文档(如失踪人员报告、儿童安全资料)统一为结构化、模式合规的表示形式,支持操作审查和空间建模。系统融合OCR、规则解析、模式先行归一化及LLM辅助提取(含验证器引导修复与共用地理编码),在准确率上显著优于确定性方法(F1 0.8664 vs. 0.2578),但速度较慢;确定性路径则更快且仍保持高字段完整性(93.23% vs. 96.97%)。结果表明,在高风险调查场景中,可审慎采用概率型AI,前提是嵌入schema-first与可审计架构。

Details Motivation: 缺失人员与儿童安全调查依赖异构文档(结构化表单、公告海报、网络叙述资料),其布局、术语与数据质量差异大,严重阻碍快速分诊、大规模分析与搜索规划。 Method: 提出Guardian Parser Pack:(i) 多引擎PDF文本提取+OCR回退;(ii) 基于规则的来源识别与源特异性解析器;(iii) 模式先行的归一化与验证;(iv) 可选LLM辅助提取路径,集成验证器引导修复与共享地理编码服务。 Result: 在75例人工对齐样本上,LLM路径F1达0.8664(确定性路径仅0.2578);在517条记录上,LLM路径关键字段完整性96.97%(确定性路径93.23%);但LLM路径平均耗时3.95秒/条,远高于确定性路径的0.03秒/条;所有LLM输出均通过初始模式验证,验证器修复未参与性能提升。 Conclusion: 在高风险调查场景中,应将概率型AI(如LLM)嵌入schema-first、可审计、可控的解析流水线中,兼顾准确性与可靠性,而非孤立使用。 Abstract: Missing-person and child-safety investigations rely on heterogeneous case documents, including structured forms, bulletin-style posters, and narrative web profiles. Variations in layout, terminology, and data quality impede rapid triage, large-scale analysis, and search-planning workflows. This paper introduces the Guardian Parser Pack, an AI-driven parsing and normalization pipeline that transforms multi-source investigative documents into a unified, schema-compliant representation suitable for operational review and downstream spatial modeling. The proposed system integrates (i) multi-engine PDF text extraction with Optical Character Recognition (OCR) fallback, (ii) rule-based source identification with source-specific parsers, (iii) schema-first harmonization and validation, and (iv) an optional Large Language Model (LLM)-assisted extraction pathway incorporating validator-guided repair and shared geocoding services. We present the system architecture, key implementation decisions, and output design, and evaluate performance using both gold-aligned extraction metrics and corpus-level operational indicators. On a manually aligned subset of 75 cases, the LLM-assisted pathway achieved substantially higher extraction quality than the deterministic comparator (F1 = 0.8664 vs. 0.2578), while across 517 parsed records per pathway it also improved aggregate key-field completeness (96.97\% vs. 93.23\%). The deterministic pathway remained much faster (mean runtime 0.03 s/record vs. 3.95 s/record for the LLM pathway). In the evaluated run, all LLM outputs passed initial schema validation, so validator-guided repair functioned as a built-in safeguard rather than a contributor to the observed gains. These results support controlled use of probabilistic AI within a schema-first, auditable pipeline for high-stakes investigative settings.

[45] Scoring Edit Impact in Grammatical Error Correction via Embedded Association Graphs

Qiyuan Xiao,Xiaoman Wang,Yunshi Lan

Main category: cs.CL

TL;DR: 本文提出了一种新的GEC任务——编辑影响评分(Scoring Edit Impact),通过嵌入关联图和困惑度评分来自动评估语法错误纠正系统所产生编辑的重要性。

Details Motivation: 现有GEC评估方法难以处理同一句子存在多种合理修正、且不同应用场景需求各异的问题;人工元评估虽有效但难以扩展到大规模数据。 Method: 构建基于嵌入关联图的评分框架,该图建模编辑间的潜在依赖及句法相关性,并以困惑度为基础对各编辑提升句子流利度的贡献进行量化评分。 Result: 在4个GEC数据集、4种语言和4个GEC系统上的实验表明,该方法持续优于多种基线;分析证实嵌入关联图能有效捕获跨语言编辑间的结构依赖。 Conclusion: 编辑影响自动评分是提升GEC系统可解释性与实用性的重要方向,所提框架具有跨语言泛化能力与实际部署潜力。 Abstract: A Grammatical Error Correction (GEC) system produces a sequence of edits to correct an erroneous sentence. The quality of these edits is typically evaluated against human annotations. However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios. Recent meta-evaluation approaches rely on human judgments across multiple references, but they are difficult to scale to large datasets. In this paper, we propose a new task, Scoring Edit Impact in GEC, which aims to automatically estimate the importance of edits produced by a GEC system. To address this task, we introduce a scoring framework based on an embedded association graph. The graph captures latent dependencies among edits and syntactically related edits, grouping them into coherent groups. We then perform perplexity-based scoring to estimate each edit's contribution to sentence fluency. Experiments across 4 GEC datasets, 4 languages, and 4 GEC systems demonstrate that our method consistently outperforms a range of baselines. Further analysis shows that the embedded association graph effectively captures cross-linguistic structural dependencies among edits.

[46] Scientific Knowledge-driven Decoding Constraints Improving the Reliability of LLMs

Maotian Ma,Zheni Zeng,Zhenghao Liu,Yukun Yan

Main category: cs.CL

TL;DR: 本文提出SciDC方法,通过将学科特定知识转化为多层标准化规则来约束大语言模型(LLM)生成,显著缓解幻觉问题,在多个科学任务上平均提升准确率12%。

Details Motivation: 大语言模型虽具强大知识储备和任务求解能力,但存在严重幻觉问题;现有方法未能充分有效利用高度凝练的科学理论与规则进行行为引导。 Method: 提出SciDC框架,利用强LLM自动将灵活的领域知识转化为多层、标准化的约束规则,构建可扩展的生成约束机制。 Result: 在工业配方设计、临床肿瘤诊断和逆合成规划等科学任务上,相比基线方法平均准确率提升12%;验证了该方法对减少幻觉、提升可靠性与领域适配性的有效性。 Conclusion: SciDC为利用结构化科学知识约束LLM提供了可行路径,并展望了LLM自动归纳凝练知识以加速科研进程的潜力。 Abstract: Large language models (LLMs) have shown strong knowledge reserves and task-solving capabilities, but still face the challenge of severe hallucination, hindering their practical application. Though scientific theories and rules can efficiently direct the behaviors of human manipulators, LLMs still do not utilize these highly-condensed knowledge sufficiently through training or prompting. To address this issue, we propose \textbf{SciDC}, an LLM generation method that integrate subject-specific knowledge with strong constraints. By adopting strong LLMs to automatically convert flexible knowledge into multi-layered, standardized rules, we build an extensible framework to effectively constrain the model generation on domain tasks. Experiments on scientific tasks including industrial formulation design, clinical tumor diagnosis and retrosynthesis planning, consistently demonstrate the effectiveness of our method, achieving a 12\% accuracy improvement on average compared with vanilla generation. We further discuss the potential of LLMs in automatically inductively summarizing highly-condensed knowledge, looking ahead to practical solutions for accelerating the overall scientific research process. All the code of this paper can be obtained (https://github.com/Maotian-Ma/SciDC).

[47] The Detection--Extraction Gap: Models Know the Answer Before They Can Say It

Hanyang Wang,Mingxuan Zhu

Main category: cs.CL

TL;DR: 本文发现现代推理模型在答案已确定后仍继续生成大量冗余文本(52-88%的思维链token属于此类),提出‘检测-提取差距’现象,并据此设计黑盒自适应早退方法(BAEE),大幅减少生成长度(70–78%)且提升准确率(1–5个百分点)。

Details Motivation: 现代推理模型存在大量冗余生成——答案早已可从部分前缀中恢复,但模型仍持续输出后续token,造成计算浪费和潜在错误覆盖。 Method: 通过分析多模型、多基准下的思维链生成行为,定义并量化‘检测-提取差距’;基于自由续写与强制提取的分布差异,提出黑盒自适应早退(BAEE)策略,利用自由续写同步完成答案检测与提取。 Result: BAEE在所有测试模型上将序列生成长度减少70–78%,同时准确率提升1–5个百分点;对思考模式模型,早退可防止答案被后续生成覆盖,最高提升5.8个百分点;成本优化版平均仅需9次API调用,压缩率达68–73%。 Conclusion: 答案可恢复性不等于可提取性,模型内部状态蕴含答案信息,但标准解码机制未能有效利用;BAEE通过黑盒自由续写机制弥合该差距,为高效可靠推理提供新范式。 Abstract: Modern reasoning models continue generating long after the answer is already determined. Across five model configurations, two families, and three benchmarks, we find that \textbf{52--88\% of chain-of-thought tokens are produced after the answer is recoverable} from a partial prefix. This post-commitment generation reveals a structural phenomenon: the \textbf{detection--extraction gap}. Free continuations from early prefixes recover the correct answer even at 10\% of the trace, while forced extraction fails on 42\% of these cases. The answer is recoverable from the model state, yet prompt-conditioned decoding fails to extract it. We formalize this mismatch via a total-variation bound between free and forced continuation distributions, yielding quantitative estimates of suffix-induced shift. Exploiting this asymmetry, we propose Black-box Adaptive Early Exit (\BAEE{}), which uses free continuations for both detection and extraction, truncating \textbf{70--78\% of serial generation} while \textbf{improving accuracy by 1--5\,pp} across all models. For thinking-mode models, early exit prevents post-commitment overwriting, yielding gains of up to 5.8\,pp; a cost-optimized variant achieves 68--73\% reduction at a median of 9 API calls. Code is available at https://github.com/EdWangLoDaSc/know2say.

[48] DiffuMask: Diffusion Language Model for Token-level Prompt Pruning

Caleb Zheng,Jyotika Singh,Fang Tu,Weiyi Sun,Sujeeth Bharadwaj,Yassine Benajiba,Sujith Ravi,Eli Shlizerman,Dan Roth

Main category: cs.CL

TL;DR: 本文提出DiffuMask,一种基于扩散模型的提示压缩框架,通过分层的样本级和令牌级剪枝信号实现快速并行提示剪枝,在大幅减少提示长度(最高达80%)的同时保持或提升大语言模型的推理准确性。

Details Motivation: In-Context Learning和Chain-of-Thought提示虽能提升大语言模型推理能力,但常导致提示过长、成本高且含冗余信息;现有基于剪枝的压缩方法因顺序移除令牌而计算开销大。 Method: 提出DiffuMask,一种融合样本级与令牌级剪枝信号的扩散模型框架,通过迭代掩码预测实现并行多令牌剪枝。 Result: DiffuMask显著加速压缩过程,支持可控的内容保留,在域内、域外及跨模型设置下均保持或提升准确率,最高实现80%提示长度压缩。 Conclusion: DiffuMask是一种通用、可控的提示压缩框架,有助于提升大语言模型上下文学习的速度与可靠性。 Abstract: In-Context Learning and Chain-of-Thought prompting improve reasoning in large language models (LLMs). These typically come at the cost of longer, more expensive prompts that may contain redundant information. Prompt compression based on pruning offers a practical solution, yet existing methods rely on sequential token removal which is computationally intensive. We present DiffuMask, a diffusion-based framework integrating hierarchical shot-level and token-level pruning signals, that enables rapid and parallel prompt pruning via iterative mask prediction. DiffuMask substantially accelerates the compression process via masking multiple tokens in each denoising step. It offers tunable control over retained content, preserving essential reasoning context and achieving up to 80\% prompt length reduction. Meanwhile, it maintains or improves accuracy across in-domain, out-of-domain, and cross-model settings. Our results show that DiffuMask provides a generalizable and controllable framework for prompt compression, facilitating faster and more reliable in-context reasoning in LLMs.

[49] Feedback Adaptation for Retrieval-Augmented Generation

Jihwan Bang,Seunghan Yang,Kyuhong Shim,Simyung Chang,Juntae Lee,Sungha Choi

Main category: cs.CL

TL;DR: 本文提出反馈适应(feedback adaptation)作为RAG系统的新评估范式,定义了‘修正延迟’和‘反馈后性能’两个可量化指标,并提出无需重训练的PatchRAG方法,实现即时修正与良好泛化。

Details Motivation: 现有RAG评估忽略系统在实际部署中接收用户/专家反馈后的动态适应能力,缺乏对反馈如何影响后续查询行为的刻画。 Method: 定义反馈适应问题,提出‘修正滞后’和‘反馈后性能’两个评估维度;设计PatchRAG——一种仅在推理时注入反馈、无需重训练的轻量方法。 Result: 实验表明训练式方法存在修正延迟与适应可靠性之间的权衡;PatchRAG实现即时修正且在语义相关查询上表现优异。 Conclusion: 反馈适应是RAG在交互场景中一个被忽视但关键的行为维度,需纳入系统评估与设计考量。 Abstract: Retrieval-Augmented Generation (RAG) systems are typically evaluated under static assumptions, despite being frequently corrected through user or expert feedback in deployment. Existing evaluation protocols focus on overall accuracy and fail to capture how systems adapt after feedback is introduced. We introduce feedback adaptation as a problem setting for RAG systems, which asks how effectively and how quickly corrective feedback propagates to future queries. To make this behavior measurable, we propose two evaluation axes: correction lag, which captures the delay between feedback provision and behavioral change, and post-feedback performance, which measures reliability on semantically related queries after feedback. Using these metrics, we show that training-based approaches exhibit a trade-off between delayed correction and reliable adaptation. We further propose PatchRAG, a minimal inference-time instantiation that incorporates feedback without retraining, demonstrating immediate correction and strong post-feedback generalization under the proposed evaluation. Our results highlight feedback adaptation as a previously overlooked dimension of RAG system behavior in interactive settings.

[50] A Parameter-Efficient Transfer Learning Approach through Multitask Prompt Distillation and Decomposition for Clinical NLP

Cheng Peng,Mengxian Lyu,Ziyi Chen,Yonghui Wu

Main category: cs.CL

TL;DR: 本文提出了一种多任务提示蒸馏与分解框架,通过从21个临床NLP源任务中学习一个共享元提示(metaprompt),以极少量可训练参数(<0.05%)适配多个新任务,在5类临床NLP任务、10个数据集上显著优于LoRA和单任务提示调优。

Details Motivation: 现有基于提示的微调方法为每个临床NLP任务独立学习提示,导致大规模部署时计算与存储开销巨大。 Method: 提出多任务提示蒸馏与分解框架,从21个临床源任务中蒸馏出一个共享元提示,并通过轻量适配迁移到未见目标任务;在5类任务、10个数据集及3种大模型(LLaMA 3.1 8B、Meditron3 8B、gpt-oss 20B)上评估。 Result: 相较LoRA提升1.5~1.7%,相较单任务提示调优提升6.1~6.6%;gpt-oss 20B在临床推理任务上表现最优;展现出优异的零样本与少样本迁移能力。 Conclusion: 共享元提示能大幅降低参数量与资源消耗,同时提升跨任务泛化性与临床NLP系统部署效率。 Abstract: Existing prompt-based fine-tuning methods typically learn task-specific prompts independently, imposing significant computing and storage overhead at scale when deploying multiple clinical natural language processing (NLP) systems. We present a multitask prompt distillation and decomposition framework that learns a single shared metaprompt from 21 diverse clinical source tasks and adapts it to unseen target tasks with fewer than 0.05% trainable parameters. Evaluated across five clinical NLP task types (named entity recognition, relation extraction, question answering, natural language inference, and summarization) on 10 held-out target datasets using three backbone models (LLaMA 3.1 8B, Meditron3 8B, gpt-oss 20B), our framework consistently outperforms LoRA by 1.5~1.7% despite using orders of magnitude fewer parameters, and exceeds single-task prompt tuning by 6.1~6.6%. The gpt-oss 20B model achieves the highest overall performance, particularly on clinical reasoning tasks. The strong zero- and few-shot performance demonstrates better transferability of the shared prompt representation.

[51] A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM

Bo Wang,Jing Ma,Hongzhan Lin,Zhiwei Yang,Ruichao Yang,Yuan Tian,Yi Chang

Main category: cs.CL

TL;DR: 本文提出G-Defense框架,通过构建以声明为中心的图结构,结合RAG技术检索证据并生成竞争性解释,实现仅依赖未经验证报告的细粒度可解释假新闻检测。

Details Motivation: 现有可解释假新闻检测方法在处理突发新闻时效率低,且利用外部检索报告可能引入不准确信息;同时,需对声明所有方面提供可理解的解释以辅助公众验证。 Method: 提出图增强防御框架(G-Defense):1)将新闻声明分解为子声明并建模其依赖关系,构建声明中心图;2)对每个子声明采用RAG检索关键证据并生成竞争性解释;3)基于图结构设计防御式推理模块评估整体真实性;4)提示大语言模型生成直观的解释图。 Result: G-Defense在真实性检测和解释质量两方面均达到当前最优性能。 Conclusion: 仅依赖未经验证报告,G-Defense能实现细粒度、可解释、高准确率的假新闻检测,兼顾效率与可信解释生成。 Abstract: Explainable fake news detection aims to assess the veracity of news claims while providing human-friendly explanations. Existing methods incorporating investigative journalism are often inefficient and struggle with breaking news. Recent advances in large language models (LLMs) enable leveraging externally retrieved reports as evidence for detection and explanation generation, but unverified reports may introduce inaccuracies. Moreover, effective explainable fake news detection should provide a comprehensible explanation for all aspects of a claim to assist the public in verifying its accuracy. To address these challenges, we propose a graph-enhanced defense framework (G-Defense) that provides fine-grained explanations based solely on unverified reports. Specifically, we construct a claim-centered graph by decomposing the news claim into several sub-claims and modeling their dependency relationships. For each sub-claim, we use the retrieval-augmented generation (RAG) technique to retrieve salient evidence and generate competing explanations. We then introduce a defense-like inference module based on the graph to assess the overall veracity. Finally, we prompt an LLM to generate an intuitive explanation graph. Experimental results demonstrate that G-Defense achieves state-of-the-art performance in both veracity detection and the quality of its explanations.

[52] Between Century and Poet: Graph-Based Lexical Semantic Change in Persian Poetry

Kourosh Shahnazari,Seyed Moein Ayyoubzadeh,Mohammadali Keshtparvar

Main category: cs.CL

TL;DR: 本研究通过结合对齐的Word2Vec空间与基于图的邻域分析,探究波斯诗歌中词汇意义的历史性与关系性演变,强调语义变化本质是局部语义图的重连(如邻居增减、桥接角色变化、社区迁移),而非单纯向量位移。

Details Motivation: 传统语义变化建模(如向量位移)难以捕捉波斯诗歌中词义依赖文学传统、上下文网络与诗人风格的复杂历史—关系特性;需更贴合文学实践的计算方法。 Method: 采用跨世纪、跨诗人的对齐Word2Vec嵌入空间,辅以图结构的邻域分析(聚焦邻居变化、桥接角色、社区归属);选取20个目标词(含5个锚点词:Earth, Night, 两个酒术语, Heart)及其周边情感、苏菲等语义相关词进行动态分析。 Result: 发现不同词呈现差异演化模式:Night时间敏感,Earth诗人敏感,Heart语义稳定但图角色流动;两酒术语体现探针敏感性(一宽泛弥散,一窄而稳定);词汇审计证实语料含历史驱动词、诗人特有词及稀疏神秘主义术语。 Conclusion: 波斯诗歌语义变化本质是局部语义图的‘重布线’,而非抽象漂移;该方法为数字人文提供兼顾结构局部性与文学解释力(如延续、迁移、中介、选择性转化)的新路径。 Abstract: Meaning in Persian poetry is both historical and relational. Words persist through literary tradition while shifting their force through changing constellations of neighbors, rhetorical frames, and poetic voices. This study examines that process using aligned Word2Vec spaces combined with graph-based neighborhood analysis across centuries and major poets. Rather than modeling semantic change as vector displacement alone, it treats lexical history as the rewiring of local semantic graphs: the gain and loss of neighbors, shifts in bridge roles, and movement across communities. The analysis centers on twenty target words, anchored by five recurrent reference terms: Earth, Night, two wine terms, and Heart. Surrounding them are affective, courtly, elemental, and Sufi concepts such as Love, Sorrow, Dervish, King, Annihilation, and Truth. These words exhibit distinct patterns of change. Night is more time-sensitive, Earth more poet-sensitive, and Heart shows continuity despite graph-role mobility. The two wine terms highlight probe sensitivity: one is broad and semantically diffuse, while the other is narrower and more stable. A lexical audit confirms that the corpus contains historically driven terms, poet-specific usages, and sparsely attested mystical vocabulary requiring caution. Overall, semantic change in Persian poetry is better captured as neighborhood rewiring than as abstract drift. For Digital Humanities, this approach restores local structure to computational analysis and supports interpretations closer to literary practice: persistence, migration, mediation, and selective transformation.

[53] ChemVLR: Prioritizing Reasoning in Perception for Chemical Vision-Language Understanding

Xuanle Zhao,Xinyuan Cai,Xiang Cheng,Xiuyi Chen,Bo Xu

Main category: cs.CL

TL;DR: 本文提出ChemVLR,一种专为化学视觉理解设计的视觉-语言模型,强调在感知过程中嵌入可解释的推理能力,通过细粒度识别化学描述符(如官能团)并构建跨模态反向工程数据集与三阶段训练框架,实现SOTA性能。

Details Motivation: 现有化学视觉语言模型多针对直接视觉问答优化,缺乏对反应机理的推理能力,导致系统成为‘黑箱’,未能发挥大语言模型的内在推理优势。 Method: 提出ChemVLR模型:1)细粒度视觉分析,显式识别官能团等化学描述符;2)跨模态反向工程策略+严格过滤流程构建76万样本的推理-标注数据集;3)采用三阶段训练框架系统提升感知与推理能力。 Result: 在分子与反应相关任务上达到SOTA性能,显著超越主流闭源及领域开源基线模型,并通过消融实验验证了训练策略与数据构建设计的有效性。 Conclusion: 将显式化学知识引导的细粒度感知与结构化推理深度融合于VLM架构中,是提升化学视觉理解模型可解释性与性能的关键路径;ChemVLR为可解释AI in Chemistry提供了新范式。 Abstract: While Vision-Language Models (VLMs) have demonstrated significant potential in chemical visual understanding, current models are predominantly optimized for direct visual question-answering tasks. This paradigm often results in "black-box" systems that fail to utilize the inherent capability of Large Language Models (LLMs) to infer underlying reaction mechanisms. In this work, we introduce ChemVLR, a chemical VLM designed to prioritize reasoning within the perception process. Unlike conventional chemical VLMs, ChemVLR analyzes visual inputs in a fine-grained manner by explicitly identifying granular chemical descriptors, such as functional groups, prior to generating answers. This approach ensures the production of explicit and interpretable reasoning paths for complex visual chemical problems. To facilitate this methodology, we implement a cross-modality reverse-engineering strategy, combined with a rigorous filtering pipeline, to curate a large-scale reasoning-and-captioning dataset comprising 760k high-quality samples across molecular and reaction tasks. Furthermore, we adopt a three-stage training framework that systemically builds model perception and reasoning capacity. Experiments demonstrate that ChemVLR achieves state-of-the-art (SOTA) performance, surpassing both leading proprietary models and domain-specific open-source baselines. We also provide comprehensive ablation studies to validate our training strategy and data generation designs. Code and model weights will be available at https://github.com/xxlllz/ChemVLR.

[54] Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

Haoyue Liu,Zhichao Wang,Yongxin Guo,Haoran Shou,Xiaoying Tang

Main category: cs.CL

TL;DR: 本文提出aPSF框架,通过架构模型发现任务特定的提示结构作为语义因子,并进行干预式单因子更新,显著提升LLM推理性能并大幅降低优化开销。

Details Motivation: 现有API-only提示优化方法多迭代编辑整体提示,导致组件耦合、归因困难、可控性差且浪费token。 Method: 提出Adaptive Prompt Structure Factorization (aPSF),使用Architect模型发现语义因子,并通过干预式因子级评分与错误引导的因子选择进行单因子更新。 Result: 在多个高级推理基准上超越强基线(如原理感知优化器),平均准确率最高提升+2.16个百分点;在MultiArith上优化token消耗减少45–87%,峰值验证性能仅需1步。 Conclusion: aPSF实现了更可控、高效、可解释的API-only提示优化,为LLM可靠推理提供了新范式。 Abstract: Automated prompt optimization is crucial for eliciting reliable reasoning from large language models (LLMs), yet most API-only prompt optimizers iteratively edit monolithic prompts, coupling components and obscuring credit assignment, limiting controllability, and wasting tokens. We propose Adaptive Prompt Structure Factorization (aPSF), an API-only framework (prompt-in/text-out; no access to model internals) that uses an Architect model to discover task-specific prompt structures as semantic factors. aPSF then performs interventional, single-factor updates: interventional factor-level scoring estimates each factor's marginal contribution via validation-performance changes, and error-guided factor selection routes updates to the current dominant failure source for more sample-efficient optimization. Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on MultiArith while reaching peak validation in 1 step.

[55] TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Xinkai Zhang,Jingtao Zhan,Yiqun Liu,Qingyao Ai

Main category: cs.CL

TL;DR: 本文介绍了Trial-and-Error Collection (TEC) 平台和数据集,用于记录人类在试错过程中的完整行为轨迹与错误反思,旨在弥补当前AI系统缺乏高质量人类试错数据的空白,并揭示人类在试错任务中显著优于大语言模型的现象。

Details Motivation: 现有试错型AI方法多依赖人工设计的简单启发式规则,性能提升有限;核心瓶颈在于缺乏真实人类试错行为的细粒度数据。 Method: 构建了一个数据标注平台TEC,记录用户在多个尝试中的完整操作轨迹及收到错误反馈后的反思;采集了46名参与者在58个任务上的5370条试错轨迹及对应反思,覆盖41229个网页。 Result: 分析发现人类在试错任务中的准确率显著高于当前大语言模型,验证了人类试错能力更强;TEC平台与数据集已开源。 Conclusion: TEC为建模和理解人类试错行为提供了关键数据基础,有望推动更鲁棒、自适应的AI试错能力发展。 Abstract: Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments. Although several trial-and-error AI techniques have recently been proposed, most of them rely on simple heuristics designed by researchers and achieve limited performance gains. The core issue is the absence of appropriate data: current models cannot learn from detailed records of how humans actually conduct trial-and-error in practice. To address this gap, we introduce a data annotation platform and a corresponding dataset, termed Trial-and-Error Collection (TEC). The platform records users' complete trajectories across multiple trials and collects their reflections after receiving error feedback. Using this platform, we record the problem-solving processes of 46 participants on 58 tasks, resulting in 5,370 trial trajectories along with error reflections across 41,229 webpages. With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs. We believe that the TEC platform and dataset provide a valuable foundation for understanding human trial-and-error behavior and for developing more capable AI systems. Platform and dataset are publicly available.

[56] SQLStructEval: Structural Evaluation of LLM Text-to-SQL Generation

Yixi Zhou,Fan Zhang,Zhiqiao Guo,Yu Chen,Haipeng Zhang,Preslav Nakov,Zhuohan Xie

Main category: cs.CL

TL;DR: 本文提出SQLStructEval框架,通过规范化的抽象语法树(AST)分析LLM生成SQL查询的结构可靠性,发现当前模型对输入表面变化敏感、结构多样性高,而采用编译式结构化生成可提升执行准确率与结构一致性。

Details Motivation: 尽管LLM在Text-to-SQL任务上表现强劲,但其生成SQL的结构可靠性尚不明确,亟需系统性评估。 Method: 提出SQLStructEval框架,基于规范化AST表示分析SQL结构;在Spider基准上对比不同LLM对同义改写和模式呈现变化的结构响应;引入编译式结构化生成流程进行干预实验。 Result: 发现LLM生成的正确SQL常伴随高度结构变异;表面输入扰动(如 paraphrase、schema顺序)显著影响结构输出;结构化生成流程可同步提升执行准确率与结构一致性。 Conclusion: 结构可靠性是评估LLM程序生成能力的关键新维度,不应被忽视;结构化建模(如编译式pipeline)是提升可靠性的有效路径。 Abstract: Despite strong performance on Text-to-SQL benchmarks, it remains unclear whether LLM-generated SQL programs are structurally reliable. In this work, we investigate the structural behavior of LLM-generated SQL queries and introduce SQLStructEval, a framework for analyzing program structures through canonical abstract syntax tree (AST) representations. Our experiments on the Spider benchmark show that modern LLMs often produce structurally diverse queries for the same input, even when execution results are correct, and that such variance is frequently triggered by surface-level input changes such as paraphrases or schema presentation. We further show that generating queries in a structured space via a compile-style pipeline can improve both execution accuracy and structural consistency. These findings suggest that structural reliability is a critical yet overlooked dimension for evaluating LLM-based program generation systems. Our code is available at https://anonymous.4open.science/r/StructEval-2435.

[57] Luwen Technical Report

Yiquan Wu,Yuhang Liu,Yifei Liu,Ang Li,Siying Zhou,Kun Kuang

Main category: cs.CL

TL;DR: 本文提出了Luwen,一个基于Baichuan基础模型、通过持续预训练、监督微调和检索增强生成构建的开源中文法律语言模型,并在五项法律任务上验证了其优越性能。

Details Motivation: 大型语言模型在法律领域应用面临术语专业、推理复杂和知识更新快等挑战,亟需适配法律领域的专用模型。 Method: 基于Baichuan模型,采用三大技术:大规模法律语料的持续预训练、精心构建的法律指令数据监督微调、结合全面法律知识库的检索增强生成。 Result: 在法律判决预测、司法考试、法律文本摘要、法条问答和司法推理五项任务上,Luwen均优于多个强基线模型。 Conclusion: 所提方法能有效将通用大语言模型适配至中文法律领域,Luwen作为开源模型为法律AI研究与应用提供了有力支持。 Abstract: Large language models have demonstrated remarkable capabilities across a wide range of natural language processing tasks, yet their application in the legal domain remains challenging due to the specialized terminology, complex reasoning requirements, and rapidly evolving legal knowledge involved. In this paper, we present Luwen, an open-source Chinese legal language model built upon the Baichuan foundation model through three key techniques: continual pre-training on a large-scale legal corpus, supervised fine-tuning with carefully curated legal instruction data, and retrieval-augmented generation integrated with a comprehensive legal knowledge base. We evaluate Luwen on five representative legal tasks spanning both prediction and generation settings, including legal judgment prediction, judicial examination, legal text summarization, law article question answering, and judicial decision reasoning. Experimental results show that Luwen outperforms several strong baselines, demonstrating the effectiveness of our approach in adapting general-purpose language models to the legal domain.

[58] StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

Zhirui Chen,Peiyang Liu,Ling Shao

Main category: cs.CL

TL;DR: 本文提出StructKV,一种结构感知的KV缓存压缩框架,通过全局入度中心性、动态枢轴检测和结构传播解耦,解决大语言模型长上下文推理中的KV缓存内存瓶颈问题。

Details Motivation: 随着大语言模型上下文窗口扩大至百万级token,Key-Value(KV)缓存线性增长带来严重内存与带宽瓶颈;现有压缩方法仅依赖单层局部显著性指标,易误删跨网络深度起全局信息枢纽作用但局部‘沉寂’的token。 Method: 提出StructKV框架,包含三项创新:1)全局入度中心性(Global In-Degree Centrality),聚合全网络深度注意力模式识别全局信息枢纽;2)动态枢轴检测(Dynamic Pivot Detection),基于信息论指标自适应定位最优压缩层;3)结构传播与解耦(Structural Propagation and Decoupling),分离计算预算与存储预算。 Result: 在LongBench和RULER基准上验证了StructKV能有效保持长程依赖建模能力和检索鲁棒性。 Conclusion: StructKV通过结构感知机制克服了传统KV压缩方法忽视全局token角色的缺陷,在不牺牲性能前提下显著缓解长上下文推理的内存瓶颈。 Abstract: As Large Language Models (LLMs) scale to support context windows exceeding one million tokens, the linear growth of Key-Value (KV) cache imposes severe memory capacity and bandwidth bottlenecks, constraining the efficiency of long-context inference. Existing compression approaches typically prioritize tokens based on local saliency metrics to decouple prefill computation from decoding memory. However, these methods often rely on local saliency snapshots at a specific layer, thereby systematically discarding tokens that act as global information hubs across the network depth but appear temporarily dormant at the specific layer selected for pruning. To address this limitation, we propose StructKV, a structure-aware KV cache compression framework that introduces three core innovations: First, Global In-Degree Centrality aggregates attention patterns across the network depth to identify global information hubs. Second, Dynamic Pivot Detection utilizes information-theoretic metrics to adaptively locate the optimal layer for compression. Finally, Structural Propagation and Decoupling separates the computational budget from the memory storage budget. Experimental results on the LongBench and RULER benchmarks demonstrate that StructKV effectively preserves long-range dependencies and retrieval robustness.

[59] Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Heng Zhou,Zelin Tan,Zhemeng Zhang,Yutao Fan,Yibing Lin,Li Kang,Xiufeng Song,Rui Li,Songtao Huang,Ao Yu,Yuchen Fan,Yanxu Chen,Kaixin Xu,Xiaohong Liu,Yiran Qin,Philip Torr,Chen Zhang,Zhenfei Yin

Main category: cs.CL

TL;DR: 本文研究了不同推理范式(如CoT、ReAct等)对LLM性能的影响,发现没有一种范式在所有任务上都最优;为此提出了一种轻量级嵌入式路由器,在任务前选择最合适的推理范式,显著提升了平均准确率,并优于固定范式或零样本自路由。

Details Motivation: 探究LLM性能提升是源于模型本身还是其外层推理范式,并验证不同范式在不同任务上的互补性。 Method: 在四个前沿LLM和十个基准上系统比较六种推理范式(Direct、CoT、ReAct、Plan-Execute、Reflection、ReCode),共约18,000次运行;进而提出并评估基于嵌入的轻量级路由机制(select-then-solve)及零样本自路由。 Result: 发现各范式表现高度任务依赖,oracle选范式比最佳固定范式高17.1pp;所提路由器将平均准确率从47.6%提升至53.1%,超越最佳固定范式(50.3%)2.8pp,填补37% oracle差距;零样本自路由仅在GPT-5上有效(67.1%),其余模型失败。 Conclusion: 推理范式的选择应为逐任务的、由学习型路由器驱动的动态决策,而非固定架构设计。 Abstract: When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it? We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs. We find that reasoning structure helps dramatically on some tasks but hurts on others: ReAct improves over Direct by 44pp on GAIA, while CoT degrades performance by 15pp on HumanEval. No single paradigm dominates, and oracle per-task selection beats the best fixed paradigm by 17.1pp on average. Motivated by this complementarity, we propose a select-then-solve approach: before answering each task, a lightweight embedding-based router selects the most suitable paradigm. Across four models, the router improves average accuracy from 47.6% to 53.1%, outperforming the best fixed paradigm at 50.3% by 2.8pp and recovering up to 37% of the oracle gap. In contrast, zero-shot self-routing only works for GPT-5 at 67.1% and fails for weaker models, all trailing the learned router. Our results argue that reasoning paradigm selection should be a per-task decision made by a learned router, not a fixed architectural choice.

[60] How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality

Minzhu Tu,Shiyu Ni,Keping Bi

Main category: cs.CL

TL;DR: 本文系统研究了将生成模型的推理链提供给大语言模型(LLM)评判器对评判准确性的影响,发现弱评判器易被流利但错误的推理误导,强评判器虽能部分利用推理信息,但仍会被高质量表象的错误推理误导;推理链的流利性与事实性共同显著影响评判决策,凸显了构建能区分真实推理质量与表面流利性的鲁棒评判器的必要性。

Details Motivation: 现有LLM评判器易受表面偏差影响,可能因缺乏足够信息而难以准确评估答案正确性;引入生成模型的推理链可提供更丰富信息,但其对评判行为的实际影响尚不明确。 Method: 在事实性问答和数学推理基准上,系统对比分析有无推理链输入时不同能力LLM评判器的表现,并通过控制实验分离推理链的流利性与事实性对评判决策的影响。 Result: 弱评判器易被流利但错误的推理链误导而接受错误答案;强评判器能部分利用推理链作为证据,但仍会被高质表象的错误推理误导;推理链的流利性和事实性均显著影响评判结果。 Conclusion: 当前LLM评判器尚不能可靠区分真实推理质量与表面流利性,亟需构建更鲁棒的评判机制以适配现代推理型模型的评估需求。 Abstract: Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases. One possible reason is that these judges lack sufficient information in assessing answer correctness. With the rise of reasoning-capable models, exposing a generator's reasoning content to the judge provides richer information and is a natural candidate for improving judgment accuracy. However, its actual impact on judge behavior remains understudied. In this paper, we systematically investigate how access to reasoning chains affects LLM-based judgment across factual question answering (QA) and mathematical reasoning benchmarks. We find that weak judges are easily swayed by reasoning presence, frequently accepting incorrect answers accompanied by fluent reasoning, while strong judges can partially leverage reasoning as informative evidence. Nevertheless, even strong judges are misled by seemingly high-quality reasoning chains. Controlled experiments further reveal that both fluency and factuality of reasoning chains are critical signals driving judge decisions. These findings highlight the need for more robust LLM judges that can distinguish genuine reasoning quality from superficial fluency when evaluating modern reasoning models.

[61] Multilingual Cognitive Impairment Detection in the Era of Foundation Models

Damar Hoogland,Boshko Koloski,Jaya Caporusso,Tine Kolenik,Ana Zwitter Vitez,Senja Pollak,Christina Manouilidou,Matthew Purver

Main category: cs.CL

TL;DR: 本文评估了使用英语、斯洛文尼亚语和韩语的言语转录文本进行认知障碍(CI)分类的效果,比较了零样本大语言模型(LLMs)与监督式表格模型的性能,发现后者在小数据场景下结合工程化语言特征与嵌入表示时表现更优。

Details Motivation: 在小规模数据条件下实现跨语言认知障碍(CI)自动检测,探索无需大量标注数据的有效方法。 Method: 对比零样本大语言模型(三种输入设置:仅转录文本、仅语言特征、二者结合)与监督式表格模型(基于手工语言特征、转录嵌入及早/晚融合);采用留一法交叉验证;辅以少样本实验分析监督信号的语言依赖性。 Result: 零样本LLMs提供了有竞争力的无训练基线,但监督式表格模型整体更优,尤其当融合工程化语言特征与嵌入表示时;少样本效果具有语言依赖性,部分语言显著受益于少量标注数据,而另一些则受限于特征表达能力。 Conclusion: 在小数据CI检测任务中,结构化的语言信号与简单的融合式分类器仍是强而可靠的选择,单纯依赖LLM零样本能力尚不足以替代精心设计的特征与监督学习。 Abstract: We evaluate cognitive impairment (CI) classification from transcripts of speech in English, Slovene, and Korean. We compare zero-shot large language models (LLMs) used as direct classifiers under three input settings -- transcript-only, linguistic-features-only, and combined -- with supervised tabular approaches trained under a leave-one-out protocol. The tabular models operate on engineered linguistic features, transcript embeddings, and early or late fusion of both modalities. Across languages, zero-shot LLMs provide competitive no-training baselines, but supervised tabular models generally perform better, particularly when engineered linguistic features are included and combined with embeddings. Few-shot experiments focusing on embeddings indicate that the value of limited supervision is language-dependent, with some languages benefiting substantially from additional labelled examples while others remain constrained without richer feature representations. Overall, the results suggest that, in small-data CI detection, structured linguistic signals and simple fusion-based classifiers remain strong and reliable signals.

[62] TeamLLM: A Human-Like Team-Oriented Collaboration Framework for Multi-Step Contextualized Tasks

Xiangyu Wang,Jin Wu,Haoran Shi,Wei Xia,Jiarui Yu,Chanjin Zheng

Main category: cs.CL

TL;DR: 本文提出TeamLLM框架,模拟人类团队角色分工,通过四角色、三阶段协作提升多步上下文化任务性能,并构建CGPST基准进行系统评估。

Details Motivation: 现有多大模型框架缺乏对人类团队角色分工的显式建模,导致视角单一,难以应对多步上下文化任务。 Method: 提出TeamLLM:包含四个差异化角色的团队导向型多大模型协作框架,采用三阶段协作机制;同时构建具有上下文锚定、流程结构化、过程导向评估和多维评测特性的CGPST基准。 Result: TeamLLM在CGPST基准的整体级、步骤级和维度级评估中均显著优于十种主流大模型。 Conclusion: 显式建模人类团队协作机制可有效提升多步上下文化任务性能,CGPST为该类任务提供了可靠评测标准。 Abstract: Recently, multi-Large Language Model (LLM) frameworks have been proposed to solve contextualized tasks. However, these frameworks do not explicitly emulate human team role division, which may lead to a single perspective, thereby weakening performance on multi-step contextualized tasks. To address this issue, we propose TeamLLM, a human-like Team-Oriented Multi-LLM Collaboration Framework. TeamLLM adopts four team roles with distinct division and employs a three-phase multi-LLM collaboration for multi-step contextualized tasks. To evaluate the effectiveness of TeamLLM on multi-step contextualized tasks, we propose Contextually-Grounded and Procedurally-Structured tasks (CGPST) and construct the CGPST benchmark. This benchmark has four core features: contextual grounding, procedural structure, process-oriented evaluation and multi-dimensional assessment. We evaluate ten popular LLMs on CGPST at overall-level, step-level, and dimension-level. Results show that TeamLLM substantially improves performance on CGPST. We release the benchmark with scenarios, full-process responses and human scores from ten LLMs. The code and data are available at https://anonymous.4open.science/r/TeamLLM-anonymous-C50E/.

Zhiyu Cao,Peifeng Li,Qiaoming Zhu

Main category: cs.CL

TL;DR: 本文提出了一种多维度自一致偏好对齐的对话式查询重写方法(MSPA-CQR),通过整合重写、检索和响应三方面反馈,提升重写质量与泛化能力。

Details Motivation: 早期对话式查询重写(CQR)方法孤立地进行重写,忽略了重写、段落检索和响应生成之间的反馈循环,导致重写效果受限。 Method: 构建重写、检索、响应三个维度的自一致偏好对齐数据,并提出前缀引导的多维度直接偏好优化方法,联合学习三方面偏好信息。 Result: 实验表明MSPA-CQR在分布内和分布外场景下均有效,提升了查询重写的准确性和鲁棒性。 Conclusion: 引入多维度反馈与偏好对齐机制能显著增强CQR模型的性能与泛化能力,为端到端对话搜索系统提供了新思路。 Abstract: Conversational Query Rewriting (CQR) aims to rewrite ambiguous queries to achieve more efficient conversational search. Early studies have predominantly focused on the rewriting in isolation, ignoring the feedback from query rewrite, passage retrieval and response generation in the rewriting process. To address this issue, we propose Multi-Faceted Self-Consistent Preference Aligned CQR (MSPA-CQR). Specifically, we first construct self-consistent preference alignment data from three dimensions (rewriting, retrieval, and response) to generate more diverse rewritten queries. Then we propose prefix guided multi-faceted direct preference optimization to learn preference information from three different dimensions. The experimental results show that our MSPA-CQR is effective in both in- and out-of-distribution scenarios.

[64] Discourse Coherence and Response-Guided Context Rewriting for Multi-Party Dialogue Generation

Zhiyu Cao,Peifeng Li,Qiaoming Zhu

Main category: cs.CL

TL;DR: 本文提出DRCR框架,通过对话上下文重写提升多方对话生成质量,利用话语连贯性和回复质量作为反馈信号,并采用动态自进化学习方法实现重写器与回复生成器的协同优化。

Details Motivation: 现有方法依赖对话结构信息进行生成,但多方对话中普遍存在口语化表达和不完整语句,导致结构表征失真、理解困难。 Method: 提出DRCR框架,包含基于话语连贯性与回复质量的偏好数据构建机制,以及重写器与回复生成器在迭代训练环中相互促进的动态自进化学习方法。 Result: 在四个多方对话数据集上的实验验证了DRCR在生成质量上的有效性。 Conclusion: 对话上下文重写结合双反馈信号与自进化机制,可显著提升多方对话生成的准确性与连贯性。 Abstract: Previous research on multi-party dialogue generation has predominantly leveraged structural information inherent in dialogues to directly inform the generation process. However, the prevalence of colloquial expressions and incomplete utterances in dialogues often impedes comprehension and weakens the fidelity of dialogue structure representations, which is particularly pronounced in multi-party dialogues. In this work, we propose a novel framework DRCR (Discourse coherence and Response-guided Context Rewriting) to improve multi-party dialogue generation through dialogue context rewriting. Specifically, DRCR employs two complementary feedback signals, discourse coherence and response quality, to construct preference data for both context rewriting and response generation. Moreover, we propose a dynamic self-evolution learning method that allows the rewriter and responder to continuously enhance their capabilities through mutual interaction in an iterative training loop. Comprehensive experiments conducted on four multi-party dialogue datasets substantiate the effectiveness of DRCR.

[65] When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

Yang Xiang,Yixin Ji,Ruotao Xu,Dan Qiao,Zheming Yang,Juntao Li,Min Zhang

Main category: cs.CL

TL;DR: 本文提出DTSR框架,通过动态评估思维链(CoT)充分性实现大推理模型(LRM)的早退出,显著减少冗余推理并提升效率。

Details Motivation: 大推理模型(LRMs)存在过思考问题,导致计算冗余和效率下降;现有早退出方法依赖不可靠的手工或经验指标。 Method: 提出Dynamic Thought Sufficiency in Reasoning(DTSR)框架,包含两个阶段:(1)反思信号监测,识别早退出潜在线索;(2)思维充分性检查,判断当前CoT是否足以得出答案。 Result: 在Qwen3模型上实验表明,DTSR可减少28.9%-34.9%的推理长度,且性能损失极小;同时探讨了LRMs中的过度自信与自评估范式。 Conclusion: DTSR是一种有效缓解LRMs过思考问题的动态早退出方法,为高效推理提供了新思路与实践价值。 Abstract: Large reasoning models (LRMs) have achieved remarkable performance in complex reasoning tasks, driven by their powerful inference-time scaling capability. However, LRMs often suffer from overthinking, which results in substantial computational redundancy and significantly reduces efficiency. Early-exit methods aim to mitigate this issue by terminating reasoning once sufficient evidence has been generated, yet existing approaches mostly rely on handcrafted or empirical indicators that are unreliable and impractical. In this work, we introduce Dynamic Thought Sufficiency in Reasoning (DTSR), a novel framework for efficient reasoning that enables the model to dynamically assess the sufficiency of its chain-of-thought (CoT) and determine the optimal point for early exit. Inspired by human metacognition, DTSR operates in two stages: (1) Reflection Signal Monitoring, which identifies reflection signals as potential cues for early exit, and (2) Thought Sufficiency Check, which evaluates whether the current CoT is sufficient to derive the final answer. Experimental results on the Qwen3 models show that DTSR reduces reasoning length by 28.9%-34.9% with minimal performance loss, effectively mitigating overthinking. We further discuss overconfidence in LRMs and self-evaluation paradigms, providing valuable insights for early-exit reasoning.

[66] GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

Guanran Luo,Wentao Qiu,Zhongquan Jian,Meihong Wang,Qingqiang Wu

Main category: cs.CL

TL;DR: 本文提出了一种通用的链式思维解码策略GCoT-decoding,通过两阶段分支生成候选路径、分段计算置信度并聚合语义相似路径来获得共识答案,从而在固定和自由问答任务中均取得良好效果。

Details Motivation: 现有CoT-decoding方法仅适用于答案集固定的问答任务,缺乏对自由问答任务的泛化能力。 Method: 提出GCoT-decoding:采用结合Fibonacci采样与启发式误差回溯的两阶段分支法生成候选路径;将每条路径分为推理段和答案段以精确计算路径置信度;最后聚合语义相似路径获取共识答案,替代传统多数投票。 Result: 在涵盖固定与自由问答的六个数据集上实验表明,该方法在固定QA上保持强性能,在自由QA上显著提升,验证了其通用性。 Conclusion: GCoT-decoding是一种更通用、鲁棒的链式思维解码策略,有效拓展了无提示CoT推理的应用范围。 Abstract: Chain-of-Thought reasoning can enhance large language models, but it requires manually designed prompts to guide the model. Recently proposed CoT-decoding enables the model to generate CoT-style reasoning paths without prompts, but it is only applicable to problems with fixed answer sets. To address this limitation, we propose a general decoding strategy GCoT-decoding that extends applicability to a broader range of question-answering tasks. GCoT-decoding employs a two-stage branching method combining Fibonacci sampling and heuristic error backtracking to generate candidate decoding paths. It then splits each path into a reasoning span and an answer span to accurately compute path confidence, and finally aggregates semantically similar paths to identify a consensus answer, replacing traditional majority voting. We conduct extensive experiments on six datasets covering both fixed and free QA tasks. Our method not only maintains strong performance on fixed QA but also achieves significant improvements on free QA, demonstrating its generality.

[67] Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions

Parth Patil,Dhruv Kumar,Yash Sinha,Murari Mandal

Main category: cs.CL

TL;DR: 本文提出一个九维代数复杂性框架,独立控制每个复杂性因素(如嵌套深度、中间结果数量等),通过自动问题生成与验证,系统评估大语言模型在代数推理上的瓶颈;实验发现工作记忆是跨规模的主导瓶颈,所有模型均在20–30个并行分支时崩溃,揭示出硬性架构限制;并提炼出五个最小但充分的诊断维度。

Details Motivation: 现有代数推理基准无法归因模型失败的具体原因(如嵌套过深、操作符生僻、依赖链过长等),缺乏对各复杂性因素的独立控制与系统性分析。 Method: 构建九维代数复杂性框架,每个维度对应一种已记录的LLM失败模式且彼此正交;设计参数化流水线实现问题自动生成与自动验证,无需人工标注;在7个8B至235B参数的指令微调模型上进行受控实验。 Result: 发现工作记忆是尺度无关的主导瓶颈:所有模型在20–30个并行中间结果时性能骤降;识别出5个最小但覆盖全部已知代数失败模式的诊断维度。 Conclusion: 代数推理瓶颈源于模型架构固有的工作记忆限制,而非参数量不足;该九维(精简为五维)框架为定量、可解释地评估和追踪模型代数能力提供了标准化工具。 Abstract: Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled by a parametric pipeline requiring no human annotation. Each dimension is grounded in a documented LLM failure mode and captures a structurally distinct aspect of algebraic difficulty, including expression nesting depth, simultaneous intermediate result count, sub-expression complexity, operator hardness, and dependent reasoning chain length. We evaluated seven instruction-tuned models spanning 8B to 235B parameters across all nine dimensions and find that working memory is the dominant scale-invariant bottleneck. Every model collapses between 20 and 30 parallel branches regardless of parameter count, pointing to a hard architectural constraint rather than a solvable capacity limitation. Our analysis further identifies a minimal yet diagnostically sufficient subset of five dimensions that together span the full space of documented algebraic failure modes, providing a complete complexity profile of a model's algebraic reasoning capacity.

[68] Cognitive Loop of Thought: Reversible Hierarchical Markov Chain for Efficient Mathematical Reasoning

Jia-Chen Zhang,Zheng Zhou,Yu-Jie Xiong

Main category: cs.CL

TL;DR: 本文提出了一种基于可逆分层马尔可夫链的新型思维链框架CLoT,通过分层分解问题、逐层反向验证与剪枝策略,缓解长思维链中的错误传播与计算冗余,在多个数学推理基准上取得显著性能提升。

Details Motivation: 现有长链思维链(Long CoT)因序列过长导致计算开销大;基于马尔可夫结构的KV缓存压缩方法存在记忆缺失和缺乏反向推理能力两大缺陷。 Method: 提出Cognitive Loop of Thought(CLoT)框架:将问题分层分解为具有依赖关系的子问题;在每层引入受人类认知启发的反向验证机制;设计剪枝策略——高层子问题验证通过后即剪除冗余低层子问题。同时构建配套反向推理数据集CLoT-Instruct。 Result: 在四个数学推理基准上验证有效;在AddSub数据集上,GPT-4o-mini使用CLoT达到99.0%准确率,较传统CoT和CoT-SC分别提升4.1%和2.9%。 Conclusion: CLoT通过分层结构、反向验证与动态剪枝,兼顾推理鲁棒性与计算效率,有效克服了长思维链中上下文丢失与错误累积问题,为LLM数学推理提供了新范式。 Abstract: Multi-step Chain-of-Thought (CoT) has significantly advanced the mathematical reasoning capabilities of LLMs by leveraging explicit reasoning steps. However, the widespread adoption of Long CoT often results in sequence lengths that exceed manageable computational limits. While existing approaches attempt to alleviate this by reducing KV Cache redundancy via Markov chain-like structures, they introduce two critical limitations: inherent memorylessness (loss of context) and limited backward reasoning capability. To address these limitations, we propose a novel Chain-of-Thought framework based on Reversible Hierarchical Markov Chain, termed Cognitive Loop of Thought (CLoT), and a backward reasoning dataset CLoT-Instruct. In CLoT, problems are decomposed into sub-problems with hierarchical dependencies. Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer. Furthermore, we implement a pruning strategy: once higher-level sub-problems are verified, redundant lower-level sub-problems are pruned to maximize efficiency. This approach effectively mitigates error propagation and enhances reasoning robustness. Experiments on four mathematical benchmarks demonstrate the effectiveness of our method. Notably, on the AddSub dataset using GPT-4o-mini, CLoT achieves 99.0% accuracy, outperforming traditional CoT and CoT-SC by 4.1% and 2.9%, respectively.

[69] AGSC: Adaptive Granularity and Semantic Clustering for Uncertainty Quantification in Long-text Generation

Guanran Luo,Wentao Qiu,Wanru Zhao,Wenhan Lv,Zhongquan Jian,Meihong Wang,Qingqiang Wu

Main category: cs.CL

TL;DR: 本文提出AGSC框架,通过NLI中性概率区分无关与不确定性,并利用GMM软聚类建模语义主题,实现高效且准确的长文本生成不确定性量化。

Details Motivation: 现有不确定性量化方法难以在异构主题间可靠聚合,忽视中性信息,且细粒度分解计算成本高。 Method: AGSC框架首先利用NLI中性概率识别无关内容以减少计算,再采用高斯混合模型(GMM)进行语义软聚类并分配主题感知权重用于下游聚合。 Result: 在BIO和LongFact数据集上,AGSC在事实性相关性上达到SOTA,并将推理时间相比全原子分解降低约60%。 Conclusion: AGSC通过自适应粒度与语义聚类有效提升了长文本生成中不确定性量化的准确性与效率。 Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in long-form generation, yet their application is hindered by the hallucination problem. While Uncertainty Quantification (UQ) is essential for assessing reliability, the complex structure makes reliable aggregation across heterogeneous themes difficult, in addition, existing methods often overlook the nuance of neutral information and suffer from the high computational cost of fine-grained decomposition. To address these challenges, we propose AGSC (Adaptive Granularity and GMM-based Semantic Clustering), a UQ framework tailored for long-form generation. AGSC first uses NLI neutral probabilities as triggers to distinguish irrelevance from uncertainty, reducing unnecessary computation. It then applies Gaussian Mixture Model (GMM) soft clustering to model latent semantic themes and assign topic-aware weights for downstream aggregation. Experiments on BIO and LongFact show that AGSC achieves state-of-the-art correlation with factuality while reducing inference time by about 60% compared to full atomic decomposition.

[70] SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization

Usman Naseem,Robert Geislinger,Juan Ren,Sarah Kohail,Rudy Garrido Veliz,P Sam Sahil,Yiran Zhang,Marco Antonio Stranisci,Idris Abdulmumin,Özge Alaçam,Cengiz Acartürk,Aisha Jabr,Saba Anwar,Abinew Ali Ayele,Elena Tutubalina,Aung Kyaw Htet,Xintong Wang,Surendrabikram Thapa,Tanmoy Chakraborty,Dheeraj Kodati,Sahar Moradizeyveh,Firoj Alam,Ye Kyaw Thu,Shantipriya Parida,Ihsan Ayyub Qazi,Lilian Wanzare,Nelson Odhiambo Onyango,Clemencia Siro,Ibrahim Said Ahmad,Adem Chanie Ali,Martin Semmann,Chris Biemann,Shamsuddeen Hassan Muhammad,Seid Muhie Yimam

Main category: cs.CL

TL;DR: SemEval-2026 Task 9 是一项面向22种语言的在线极化检测共享任务,包含超11万标注样本,涵盖极化存在性、类型与表现形式三类多标签预测子任务;吸引了全球千余名参与者、67支队伍提交系统,数据集已公开。

Details Motivation: 解决多语言环境下在线文本中政治/社会极化的自动识别问题,推动跨语言极化分析方法的发展与评估。 Method: 组织多语言共享任务,构建大规模多标签标注数据集,设置三个层级的极化检测子任务,并通过Codabench平台收集和评估参赛系统。 Result: 收到67支队伍的最终提交和73篇系统描述论文;报告了基线结果,并分析了各子任务及语言上的最优方法与共性技术路径。 Conclusion: 该任务建立了首个大规模、多语言、细粒度的在线极化检测基准,验证了多任务学习、多语言预训练模型等方法的有效性,为后续研究提供了高质量数据与可复现评估框架。 Abstract: We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three sub-tasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submission on Codabench. We received final submissions from 67 teams and 73 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset of this task is publicly available.

[71] Environmental, Social and Governance Sentiment Analysis on Slovene News: A Novel Dataset and Models

Paula Dodig,Boshko Koloski,Katarina Sitar Šuštar,Senja Pollak,Matthew Purver

Main category: cs.CL

TL;DR: 本文构建了首个公开的斯洛文尼亚语ESG情感数据集,并对比评估了多种模型(包括SloBERTa、XLM-R、TabPFN及LLMs)在ESG三维度情感分类任务上的性能,发现不同LLM在E/S维度表现最优,SloBERTa在G维度最优,并通过案例展示了其时序分析能力。

Details Motivation: 现有ESG评级在小型公司和新兴市场中可靠性不足,且缺乏针对斯洛文尼亚语等小语种的ESG分析资源。 Method: 基于MaCoCu斯洛文尼亚语新闻语料,采用大语言模型辅助筛选与人工标注相结合的方式构建ESG情感数据集;对比评估了单语模型(SloBERTa)、多语模型(XLM-R)、嵌入式分类器(TabPFN)、分层集成架构及多个大语言模型在ESG三维度情感分类任务上的性能。 Result: Gemma3-27B在环境(E)维度F1-macro达0.61,gpt-oss 20B在社会(S)维度达0.45,微调后的SloBERTa在治理(G)维度达0.54;gpt-oss被用于对多家公司开展长期ESG趋势分析。 Conclusion: 该工作填补了斯洛文尼亚语ESG分析资源空白,验证了LLM在小语种ESG情感识别中的有效性,并为中小企业和新兴市场的ESG研究提供了可扩展的技术路径。 Abstract: Environmental, Social, and Governance (ESG) considerations are increasingly integral to assessing corporate performance, reputation, and long-term sustainability. Yet, reliable ESG ratings remain limited for smaller companies and emerging markets. We introduce the first publicly available Slovene ESG sentiment dataset and a suite of models for automatic ESG sentiment detection. The dataset, derived from the MaCoCu Slovene news collection, combines large language model (LLM)-assisted filtering with human annotation of company-related ESG content. We evaluate the performance of monolingual (SloBERTa) and multilingual (XLM-R) models, embedding-based classifiers (TabPFN), hierarchical ensemble architectures, and large language models. Results show that LLMs achieve the strongest performance on Environmental (Gemma3-27B, F1-macro: 0.61) and Social aspects (gpt-oss 20B, F1-macro: 0.45), while fine-tuned SloBERTa is the best model on Governance classification (F1-macro: 0.54). We then show in a small case study how the best-preforming classifier (gpt-oss) can be applied to investigate ESG aspects for selected companies across a long time frame.

[72] WRAP++: Web discoveRy Amplified Pretraining

Jiang Zhou,Yunhao Wang,Xing Wu,Tinghao Yu,Feng Zhang

Main category: cs.CL

TL;DR: WRAP++ 是一种新型预训练数据合成方法,通过挖掘网页超链接中的跨文档关系(如双向链接、共现),生成需跨文档推理的问答对,从而增强大语言模型对关联性事实知识的学习能力,并显著提升模型性能。

Details Motivation: 现有合成数据重写方法局限于单文档内,无法捕捉跨文档的事实关联,导致知识缺乏上下文关联性。 Method: 提出 WRAP++ 方法,利用网页超链接发现高置信度跨文档关系模式(如 dual-links 和 co-mentions),并为每对相关文档合成联合问答样本,要求模型进行跨文档推理。 Result: 在 Wikipedia 上将约 8.4B tokens 原始文本扩展为 80B tokens 的跨文档 QA 数据;在 SimpleQA 上,基于 OLMo 的 7B 和 32B 模型使用 WRAP++ 训练后,性能显著超越单文档方法,并展现出持续的规模扩展优势。 Conclusion: 跨文档知识发现与放大能有效提升 LLM 预训练效果,WRAP++ 为构建更具关联性和推理能力的语言模型提供了新范式。 Abstract: Synthetic data rephrasing has emerged as a powerful technique for enhancing knowledge acquisition during large language model (LLM) pretraining. However, existing approaches operate at the single-document level, rewriting individual web pages in isolation. This confines synthesized examples to intra-document knowledge, missing cross-document relationships and leaving facts with limited associative context. We propose WRAP++ (Web discoveRy Amplified Pretraining), which amplifies the associative context of factual knowledge by discovering cross-document relationships from web hyperlinks and synthesizing joint QA over each discovered document pair. Concretely, WRAP++ discovers high-confidence relational motifs including dual-links and co-mentions, and synthesizes QA that requires reasoning across both documents. This produces relational knowledge absent from either source document alone, creating diverse entry points to the same facts. Because the number of valid entity pairs grows combinatorially, this discovery-driven synthesis also amplifies data scale far beyond single-document rewriting. Instantiating WRAP++ on Wikipedia, we amplify ~8.4B tokens of raw text into 80B tokens of cross-document QA data. On SimpleQA, OLMo-based models at both 7B and 32B scales trained with WRAP++ substantially outperform single-document approaches and exhibit sustained scaling gains, underscoring the advantage of cross-document knowledge discovery and amplification.

[73] Fast-dVLM: Efficient Block-Diffusion VLM via Direct Conversion from Autoregressive VLM

Chengyue Wu,Shiyi Lan,Yonggan Fu,Sensen Gao,Jin Wang,Jincheng Yu,Jose M. Alvarez,Pavlo Molchanov,Ping Luo,Song Han,Ligeng Zhu,Enze Xie

Main category: cs.CL

TL;DR: 本文提出Fast-dVLM,一种基于块扩散(block-wise discrete diffusion)的视觉-语言模型,旨在解决传统自回归(AR)解码在边缘设备上吞吐量低、硬件并行性未充分利用的问题;通过直接转换AR-VLM并引入多项多模态适配技术,在保持生成质量的同时实现6倍以上端到端推理加速。

Details Motivation: 现有视觉-语言模型(VLMs)依赖自回归解码,导致在机器人、自动驾驶等物理AI场景中(尤其是单批次边缘部署)受限于内存带宽,无法有效利用硬件并行性。 Method: 提出Fast-dVLM:采用块扩散机制实现KV-cache兼容的并行解码与推测性块解码;比较两阶段(先文本扩散微调再多模态训练)与直接转换两种AR-to-diffusion策略,并选择更高效的直接转换;引入多模态扩散适配技术,包括块大小退火、因果上下文注意力、自动截断掩码和视觉高效拼接。 Result: 在11个多模态基准上,Fast-dVLM生成质量与对应AR模型相当;结合SGLang与FP8量化后,端到端推理速度提升超6倍。 Conclusion: 块扩散可有效替代VLM中的自回归解码,在不牺牲质量前提下大幅提升边缘推理效率;直接转换AR-VLM是更优路径,所提多模态适配技术为VLM扩散化提供了实用框架。 Abstract: Vision-language models (VLMs) predominantly rely on autoregressive decoding, which generates tokens one at a time and fundamentally limits inference throughput. This limitation is especially acute in physical AI scenarios such as robotics and autonomous driving, where VLMs are deployed on edge devices at batch size one, making AR decoding memory-bandwidth-bound and leaving hardware parallelism underutilized. While block-wise discrete diffusion has shown promise for parallel text generation, extending it to VLMs remains challenging due to the need to jointly handle continuous visual representations and discrete text tokens while preserving pretrained multimodal capabilities. We present Fast-dVLM, a block-diffusion-based VLM that enables KV-cache-compatible parallel decoding and speculative block decoding for inference acceleration. We systematically compare two AR-to-diffusion conversion strategies: a two-stage approach that first adapts the LLM backbone with text-only diffusion fine-tuning before multimodal training, and a direct approach that converts the full AR VLM in one stage. Under comparable training budgets, direct conversion proves substantially more efficient by leveraging the already multimodally aligned VLM; we therefore adopt it as our recommended recipe. We introduce a suite of multimodal diffusion adaptations, block size annealing, causal context attention, auto-truncation masking, and vision efficient concatenation, that collectively enable effective block diffusion in the VLM setting. Extensive experiments across 11 multimodal benchmarks show Fast-dVLM matches its autoregressive counterpart in generation quality. With SGLang integration and FP8 quantization, Fast-dVLM achieves over 6x end-to-end inference speedup over the AR baseline.

[74] On the Step Length Confounding in LLM Reasoning Data Selection

Bing Wang,Rui Miao,Chen Shen,Shaotian Yan,Kaiyuan Liu,Ximing Li,Xiaosong Yuan,Sinan Fan,Jun Zhang,Jieping Ye

Main category: cs.CL

TL;DR: 本文发现自然性选择方法在LLM推理数据筛选中存在步长长度混淆问题,即偏好更长推理步骤而非更高质样本;为此提出两种去偏方法(ASLEC-DROP和ASLEC-CASL),实验证明其有效缓解该问题。

Details Motivation: 现有基于自然性的数据筛选方法在LLM推理数据构建中存在系统性偏差——偏好步长更长的样本而非真正高质量样本,即‘步长长度混淆’问题,需加以纠正。 Method: 提出两种去偏方法:ASLEC-DROP(在计算平均对数概率时剔除每步首token概率)和ASLEC-CASL(采用因果去偏回归消除首token的混淆效应)。 Result: 在四个LLM和五个评测基准上的实验表明,所提方法能显著缓解步长长度混淆问题,提升推理数据筛选质量。 Conclusion: 自然性选择并非天然可靠,需考虑生成结构带来的统计偏差;本文提出的去偏策略为高质量推理数据构建提供了更鲁棒的方法论基础。 Abstract: Large reasoning models have recently demonstrated strong performance on complex tasks that require long chain-of-thought reasoning, through supervised fine-tuning on large-scale and high-quality datasets. To construct such datasets, existing pipelines generate long reasoning data from more capable Large Language Models (LLMs) and apply manually heuristic or naturalness-based selection methods to filter high-quality samples. Despite the proven effectiveness of naturalness-based data selection, which ranks data by the average log probability assigned by LLMs, our analysis shows that, when applied to LLM reasoning datasets, it systematically prefers samples with longer reasoning steps (i.e., more tokens per step) rather than higher-quality ones, a phenomenon we term step length confounding. Through quantitative analysis, we attribute this phenomenon to low-probability first tokens in reasoning steps; longer steps dilute their influence, thereby inflating the average log probabilities. To address this issue, we propose two variant methods: ASLEC-DROP, which drops first-token probabilities when computing average log probability, and ASLEC-CASL, which applies a causal debiasing regression to remove the first tokens' confounding effect. Experiments across four LLMs and five evaluation benchmarks demonstrate the effectiveness of our approach in mitigating the step length confounding problem.

[75] HingeMem: Boundary Guided Long-Term Memory with Query Adaptive Retrieval for Scalable Dialogues

Yijie Zhong,Yunfan Gao,Haofen Wang

Main category: cs.CL

TL;DR: 本文提出HingeMem,一种基于事件分割理论的边界引导式长时记忆方法,通过人物、时间、地点、主题四要素触发超边边界来构建可解释索引,并设计查询自适应检索机制,在LOCOMO数据集上相比基线提升约20%,计算开销降低68%。

Details Motivation: 现有对话系统长时记忆方法依赖持续摘要或OpenIE图构建加固定Top-k检索,导致跨查询类别适应性差、计算开销高。 Method: 提出HingeMem:1)基于事件分割理论,以人、时、地、题四要素变化为边界触发超边,构建可解释记忆索引;2)设计查询自适应检索机制,联合决定‘检索什么’(要素路由)和‘检索多少’(基于查询类型估计的深度控制)。 Result: 在LOCOMO数据集上,HingeMem在不同规模LLM(0.6B至Qwen-Flash)上相较强基线取得约20%相对性能提升,且问答token成本下降68%。 Conclusion: HingeMem通过边界引导与查询自适应检索,显著提升了长时记忆的可解释性、适应性与效率,适用于需长期、可信交互的Web应用。 Abstract: Long-term memory is critical for dialogue systems that support continuous, sustainable, and personalized interactions. However, existing methods rely on continuous summarization or OpenIE-based graph construction paired with fixed Top-\textit{k} retrieval, leading to limited adaptability across query categories and high computational overhead. In this paper, we propose HingeMem, a boundary-guided long-term memory that operationalizes event segmentation theory to build an interpretable indexing interface via boundary-triggered hyperedges over four elements: person, time, location, and topic. When any such element changes, HingeMem draws a boundary and writes the current segment, thereby reducing redundant operations and preserving salient context. To enable robust and efficient retrieval under diverse information needs, HingeMem introduces query-adaptive retrieval mechanisms that jointly decide (a) \textit{what to retrieve}: determine the query-conditioned routing over the element-indexed memory; (b) \textit{how much to retrieve}: control the retrieval depth based on the estimated query type. Extensive experiments across LLM scales (from 0.6B to production-tier models; \textit{e.g.}, Qwen3-0.6B to Qwen-Flash) on LOCOMO show that HingeMem achieves approximately $20\%$ relative improvement over strong baselines without query categories specification, while reducing computational cost (68\%$\downarrow$ question answering token cost compared to HippoRAG2). Beyond advancing memory modeling, HingeMem's adaptive retrieval makes it a strong fit for web applications requiring efficient and trustworthy memory over extended interactions.

[76] MedDialBench: Benchmarking LLM Diagnostic Robustness under Parametric Adversarial Patient Behaviors

Xiaotian Luo,Xun Jiang,Jiangcheng Wu

Main category: cs.CL

TL;DR: 本文提出MedDialBench基准,首次实现对患者非合作行为多维度、分级可控的建模,揭示LLM诊断鲁棒性在信息污染(如虚构症状)上远比信息缺失(如隐瞒症状)更脆弱,并发现虚构症状会引发显著的超可加性交互效应,而详尽提问仅能缓解缺失、无法纠正污染。

Details Motivation: 现有医学对话基准无法刻画患者非合作行为的多样性与严重程度梯度,也未分析不同行为维度间的交互影响,导致对LLM诊断鲁棒性的评估不充分。 Method: 构建MedDialBench基准,将患者非合作行为解耦为逻辑一致性、健康认知、表达风格、披露程度和态度五个维度,每维设多级严重度及病例特异性脚本;采用控制因子设计,在7225轮对话中系统评测5个前沿LLM;结合剂量-响应分析、McNemar检验与O/E比值评估交互效应。 Result: 信息污染(虚构症状)导致准确率下降是信息缺失(隐瞒症状)的1.7–3.4倍,且唯其在全部5个模型中达统计显著;含虚构的维度组合均呈现超可加性(O/E=0.70–0.81),而其他组合呈纯可加性(O/E≈1.0);详尽提问可缓解隐瞒但无法纠正虚构;各模型最差下降幅度为38.8–54.1个百分点。 Conclusion: 患者虚构症状是削弱LLM诊断鲁棒性的核心风险源,具有强主导性与超可加交互特性,提示未来研究需优先建模并防御此类行为,而非泛化处理所有非合作类型。 Abstract: Interactive medical dialogue benchmarks have shown that LLM diagnostic accuracy degrades significantly when interacting with non-cooperative patients, yet existing approaches either apply adversarial behaviors without graded severity or case-specific grounding, or reduce patient non-cooperation to a single ungraded axis, and none analyze cross-dimension interactions. We introduce MedDialBench, a benchmark enabling controlled, dose-response characterization of how individual patient behavior dimensions affect LLM diagnostic robustness. It decomposes patient behavior into five dimensions -- Logic Consistency, Health Cognition, Expression Style, Disclosure, and Attitude -- each with graded severity levels and case-specific behavioral scripts. This controlled factorial design enables graded sensitivity analysis, dose-response profiling, and cross-dimension interaction detection. Evaluating five frontier LLMs across 7,225 dialogues (85 cases x 17 configurations x 5 models), we find a fundamental asymmetry: information pollution (fabricating symptoms) produces 1.7-3.4x larger accuracy drops than information deficit (withholding information), and fabricating is the only configuration achieving statistical significance across all five models (McNemar p < 0.05). Among six dimension combinations, fabricating is the sole driver of super-additive interaction: all three fabricating-involving pairs produce O/E ratios of 0.70-0.81 (35-44% of eligible cases fail under the combination despite succeeding under each dimension alone), while all non-fabricating pairs show purely additive effects (O/E ~ 1.0). Inquiry strategy moderates deficit but not pollution: exhaustive questioning recovers withheld information, but cannot compensate for fabricated inputs. Models exhibit distinct vulnerability profiles, with worst-case drops ranging from 38.8 to 54.1 percentage points.

[77] To Adapt or not to Adapt, Rethinking the Value of Medical Knowledge-Aware Large Language Models

Ane G. Domingo-Aldama,Iker De La Iglesia,Maitane Urruela,Aitziber Atutxa,Ander Barrena

Main category: cs.CL

TL;DR: 本文系统评估了临床大语言模型(LLMs)在英语和西班牙语多项选择题任务中的表现,发现其在英语任务中并未稳定优于通用模型,但在西班牙语任务中自研的轻量级Marmoka模型表现更优;研究指出当前短格式医学评测基准可能不足以反映真实医学能力,且模型普遍存在指令遵循与输出格式问题。

Details Motivation: 近期研究表明领域适配的大型语言模型在标准医学基准上并未持续优于通用模型,引发对临床专门化必要性的质疑,因此需系统评估其真实优势与局限。 Method: 构建基于扰动的评测基准,涵盖鲁棒性、指令遵循与对抗敏感性测试,包括单步/双步问题变换、多提示测试与指令引导评估;对比多种SOTA临床模型与通用模型(聚焦Llama 3.1系列);提出轻量级8B参数临床模型Marmoka(支持英/西双语),通过医学语料与指令的持续领域自适应预训练开发。 Result: 临床LLMs在英语临床任务中未一致超越通用模型(即使在扰动基准下);Marmoka在西班牙语子集上显著优于Llama;所有模型在指令遵循与严格输出格式方面均存在明显缺陷;Marmoka验证了低资源语言(如西班牙语)可成功构建稳健临床LLM。 Conclusion: 当前短格式MCQA医学评测框架可能无法充分衡量真实临床能力;临床LLMs在英语场景下仅带来微弱且不稳定的提升;需改进评估范式,并重视指令遵循与格式控制能力;低资源语言临床LLM具备可行性与潜力。 Abstract: BACKGROUND: Recent studies have shown that domain-adapted large language models (LLMs) do not consistently outperform general-purpose counterparts on standard medical benchmarks, raising questions about the need for specialized clinical adaptation. METHODS: We systematically compare general and clinical LLMs on a diverse set of multiple choice clinical question answering tasks in English and Spanish. We introduce a perturbation based evaluation benchmark that probes model robustness, instruction following, and sensitivity to adversarial variations. Our evaluation includes, one-step and two-step question transformations, multi prompt testing and instruction guided assessment. We analyze a range of state-of-the-art clinical models and their general-purpose counterparts, focusing on Llama 3.1-based models. Additionally, we introduce Marmoka, a family of lightweight 8B-parameter clinical LLMs for English and Spanish, developed via continual domain-adaptive pretraining on medical corpora and instructions. RESULTS: The experiments show that clinical LLMs do not consistently outperform their general purpose counterparts on English clinical tasks, even under the proposed perturbation based benchmark. However, for the Spanish subsets the proposed Marmoka models obtain better results compared to Llama. CONCLUSIONS: Our results show that, under current short-form MCQA benchmarks, clinical LLMs offer only marginal and unstable improvements over general-purpose models in English, suggesting that existing evaluation frameworks may be insufficient to capture genuine medical expertise. We further find that both general and clinical models exhibit substantial limitations in instruction following and strict output formatting. Finally, we demonstrate that robust medical LLMs can be successfully developed for low-resource languages such as Spanish, as evidenced by the Marmoka models.

[78] Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models

Bajian Xiang,Tingwei Guo,Xuan Chen,Yang Han

Main category: cs.CL

TL;DR: 本文提出Affinity Pooling方法,在不牺牲语义信息的前提下,通过相似性驱动的无训练token合并机制压缩语音表征,显著降低大语音语言模型(LSLMs)的推理开销。

Details Motivation: LSLMs高token率导致序列过长、推理成本高昂,而深层表征存在大量冗余,亟需高效压缩方法。 Method: 基于层间oracle干预发现冗余层次结构,提出无需训练的、基于相似性的Affinity Pooling机制,在输入层和深层进行token合并。 Result: 在三类任务上预填充FLOPs降低27.48%,保持竞争力精度;实际部署实现约1.7×内存节省和约1.1×首字延迟加速。 Conclusion: 语音token表征无需完全独立,Affinity Pooling为提升LSLM效率提供了新思路。 Abstract: Large Speech Language Models (LSLMs) typically operate at high token rates (tokens/s) to ensure acoustic fidelity, yet this results in sequence lengths that far exceed the underlying semantic content, incurring prohibitive inference costs. In this paper, we empirically revisit the necessity of such granular token-level processing. Through layer-wise oracle interventions, we unveil a structured redundancy hierarchy: while shallow layers encode essential acoustic details, deep layers exhibit extreme redundancy, allowing for aggressive compression. Motivated by these findings, we introduce Affinity Pooling, a training-free, similarity-based token merging mechanism. By strategically applying this method at both input and deep layers, we effectively compress speech representations without compromising semantic information. Extensive evaluations across three tasks demonstrate that our approach reduces prefilling FLOPs by 27.48\% while maintaining competitive accuracy. Practical deployment further confirms significant efficiency gains, yielding up to $\sim$1.7$\times$ memory savings and $\sim$1.1$\times$ faster time-to-first-token on long utterances. Our results challenge the necessity of fully distinct token representations, providing new perspectives on LSLM efficiency.

[79] iTAG: Inverse Design for Natural Text Generation with Accurate Causal Graph Annotations

Wenshuo Wang,Boyu Cao,Nan Zhuang,Wei Li

Main category: cs.CL

TL;DR: 本文提出iTAG方法,通过在LLM生成文本前进行真实世界概念分配并结合链式思维推理迭代优化,实现了高因果图标注准确率与文本自然性的兼顾,生成的数据可作为文本因果发现算法的可扩展基准替代方案。

Details Motivation: 缺乏高质量因果标注文本数据,导致文本因果发现研究受限;现有模板法牺牲自然性换准确性,LLM直接生成法则难以保证标注准确性。 Method: iTAG将因果图到文本的生成建模为逆问题,在LLM生成前对图中节点进行真实世界概念赋值,并利用链式思维(CoT)迭代检验和优化概念选择,使概念间推断关系最大程度匹配目标因果图。 Result: iTAG在标注准确率和文本自然性两方面均表现极佳;用其生成数据评测文本因果发现算法,结果与真实数据高度统计相关。 Conclusion: iTAG生成的数据可作为真实因果数据的有效、可扩展代理,适用于文本因果发现算法的大规模基准评测。 Abstract: A fundamental obstacle to causal discovery from text is the lack of causally annotated text data for use as ground truth, due to high annotation costs. This motivates an important task of generating text with causal graph annotations. Early template-based generation methods sacrifice text naturalness in exchange for high causal graph annotation accuracy. Recent Large Language Model (LLM)-dependent methods directly generate natural text from target graphs through LLMs, but do not guarantee causal graph annotation accuracy. Therefore, we propose iTAG, which performs real-world concept assignment to nodes before converting causal graphs into text in existing LLM-dependent methods. iTAG frames this process as an inverse problem with the causal graph as the target, iteratively examining and refining concept selection through Chain-of-Thought (CoT) reasoning so that the induced relations between concepts are as consistent as possible with the target causal relationships described by the causal graph. iTAG demonstrates both extremely high annotation accuracy and naturalness across extensive tests, and the results of testing text-based causal discovery algorithms with the generated data show high statistical correlation with real-world data. This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.

[80] Is Biomedical Specialization Still Worth It? Insights from Domain-Adaptive Language Modelling with a New French Health Corpus

Aidan Mannion,Cécile Macaire,Armand Violle,Stéphane Ohayon,Xavier Tannier,Didier Schwab,Lorraine Goeuriot,François Portet

Main category: cs.CL

TL;DR: 本文研究了领域自适应预训练(DAPT)在法语生物医学领域中小规模大语言模型上的应用,发现其效果存疑,但在资源受限场景下仍具可行性;提出模型融合可缓解泛化能力下降问题,甚至提升特定任务性能。

Details Motivation: 大型语言模型在非英语专业领域(尤其是法语生物医学)的适配仍具挑战,需探索高效、可行的领域自适应方法。 Method: 构建开源法语生物医学语料库,开展持续预训练(DAPT)实验,采用因果语言建模,并进行广泛对比评估;引入模型融合策略以平衡领域性能与通用能力。 Result: DAPT在多数情况下效果不佳,但在小规模、资源受限条件下可行;模型融合能有效缓解通用能力退化,部分情况下反而提升生物医学任务性能。 Conclusion: DAPT并非普适高效方案,但结合模型融合与合适条件,在法语生物医学等低资源专业领域仍有实用价值。 Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, yet their adaptation to specialized fields remains challenging, particularly for non-English languages. This study investigates domain-adaptive pre-training (DAPT) as a strategy for specializing small to mid-sized LLMs in the French biomedical domain through continued pre-training. We address two key research questions: the viability of specialized continued pre-training for domain adaptation and the relationship between domain-specific performance gains and general capability degradation. Our contributions include the release of a fully open-licensed French biomedical corpus suitable for commercial and open-source applications, the training and release of specialized French biomedical LLMs, and novel insights for DAPT implementation. Our methodology encompasses the collection and refinement of high-quality French biomedical texts, the exploration of causal language modeling approaches using DAPT, and conducting extensive comparative evaluations. Our results cast doubt on the efficacy of DAPT, in contrast to previous works, but we highlight its viability in smaller-scale, resource-constrained scenarios under the right conditions. Findings in this paper further suggest that model merging post-DAPT is essential to mitigate generalization trade-offs, and in some cases even improves performance on specialized tasks at which the DAPT was directed.

[81] The AI Skills Shift: Mapping Skill Obsolescence, Emergence, and Transition Pathways in the LLM Era

Rudra Jadhav,Janhavi Danve

Main category: cs.CL

TL;DR: 本文提出了技能自动化可行性指数(SAFI),通过评估四个前沿大语言模型在O*NET 35项技能共263个文本任务上的表现,结合真实AI采用数据构建AI影响矩阵,识别不同技能的自动化风险与增强潜力,并发现数学和编程最易被自动化,而主动倾听和阅读理解最难,且多数AI应用为辅助而非替代。

Details Motivation: 为应对大语言模型重塑全球劳动力市场带来的挑战,亟需实证数据来识别哪些职业能力最易受自动化影响,以支持政策制定与劳动者技能转型。 Method: 构建Skill Automation Feasibility Index(SAFI),在263个文本任务(覆盖O*NET全部35项技能)上系统评测LLaMA 3.3 70B、Mistral Large、Qwen 2.5 72B和Gemini 2.5 Flash四个前沿LLM;结合Anthropic Economic Index中756个职业、17998项任务的真实AI采用数据,提出四象限AI Impact Matrix进行解释性分析。 Result: (1)数学(73.2)和编程(71.8)自动化可行性最高,主动倾听(42.2)和阅读理解(45.5)最低;(2)存在‘能力-需求倒置’现象:AI高暴露岗位最需的技能恰是LLM表现最差的;(3)78.7%的AI交互为增强型而非替代型;(4)四模型结果高度一致(得分跨度仅3.6分),表明文本级自动化可行性更取决于技能本身而非模型选择。 Conclusion: SAFI为评估技能层级的LLM自动化潜力提供了可复现、细粒度的实证基准;AI影响矩阵揭示多数职业转型方向是人机协同增强而非简单替代;研究强调应优先投资提升LLM当前薄弱但劳动市场高度需求的技能(如主动倾听),并开源全部数据与代码以促进透明化研究。 Abstract: As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.

[82] Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal,Ricardo Rei,André F. T. Martins

Main category: cs.CL

TL;DR: 本文首次研究了LLM-as-a-judge在基于评分标准(rubric-based)评估中的自我偏好偏差(SPB),发现该偏差即使在完全客观的评分标准下依然显著存在,并影响模型排名与开发;通过IFEval和HealthBench实证,揭示SPB成因(如否定性、极端长度、主观性 rubrics),并验证多法官集成可缓解但不能消除SPB。

Details Motivation: LLM-as-a-judge已成为主流评估范式,但其固有的自我偏好偏差(SPB)会扭曲评估结果,尤其阻碍递归自改进等关键场景;而rubric-based评估日益流行,其SPB尚未被系统研究。 Method: 在IFEval(客观可验证rubrics)和HealthBench(主观医学rubrics)两个基准上定量测量SPB;分析不同rubric特征(如正/负向、长度、主题)对SPB的影响;评估多法官集成对缓解SPB的效果。 Result: 1)SPB在rubric-based评估中普遍存在,即使rubric完全客观,法官对自身生成输出的误判率高达50%;2)多法官集成可缓解但无法根除SPB;3)在HealthBench中SPB导致模型得分偏差达10分;4)否定性rubric、极端长度及主观话题(如急诊转诊)更易诱发SPB。 Conclusion: SPB是rubric-based评估中不可忽视的系统性偏差,需在评估设计中针对性建模与缓解,否则将误导模型迭代与前沿模型排序。 Abstract: LLM-as-a-judge has become the de facto approach for evaluating LLM outputs. However, judges are known to exhibit self-preference bias (SPB): they tend to favor outputs produced by themselves or by models from their own family. This skews evaluations and, thus, hinders model development, especially in settings of recursive self-improvement. We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings. Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly mark them as satisfied when the output is their own. We also find that, similarly to other evaluation paradigms, ensembling multiple judges helps mitigate SPB, but without fully eliminating it. On HealthBench, a medical chat benchmark with subjective rubrics, we observe that SPB skews model scores by up to 10 points, a potentially decisive margin when ranking frontier models. We analyze the factors that drive SPB in this setting, finding that negative rubrics, extreme rubric lengths, and subjective topics like emergency referrals are particularly susceptible.

[83] ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

Yihao Wang,Zijian He,Jie Ren,Keze Wang

Main category: cs.CL

TL;DR: 本文提出ChunQiuTR时间键控检索基准和CTD时间感知双编码器,以解决古典中文编年史中基于时间精确检索的挑战,强调历史RAG中检索时序一致性的重要性。

Details Motivation: 历史研究中需精准检索特定朝代月份的原始记录,而古典中文编年体文献中时间表达简略、隐含且非公历,易导致语义相关但时间错误的检索结果。 Method: 构建基于《春秋》及其注疏传统的时间键控检索基准ChunQiuTR,按月级朝代时间键组织记录,并引入时序邻近干扰项;提出CTD(历法时间双编码器),融合傅里叶绝对历法编码与相对偏移偏差机制。 Result: CTD在时间键控评估下显著优于强语义双编码器基线,验证了检索阶段时间一致性对历史RAG忠实性的关键作用。 Conclusion: 时间感知检索是历史领域RAG中保障知识接地准确性的前提,ChunQiuTR与CTD为时间敏感型语言模型检索提供了新基准与方法。 Abstract: Retrieval shapes how language models access and ground knowledge in retrieval-augmented generation (RAG). In historical research, the target is often not an arbitrary relevant passage, but the exact record for a specific regnal month, where temporal consistency matters as much as topical relevance. This is especially challenging for Classical Chinese annals, where time is expressed through terse, implicit, non-Gregorian reign phrases that must be interpreted from surrounding context, so semantically plausible evidence can still be temporally invalid. We introduce \textbf{ChunQiuTR}, a time-keyed retrieval benchmark built from the \textit{Spring and Autumn Annals} and its exegetical tradition. ChunQiuTR organizes records by month-level reign keys and includes chrono-near confounders that mirror realistic retrieval failures. We further propose \textbf{CTD} (Calendrical Temporal Dual-encoder), a time-aware dual-encoder that combines Fourier-based absolute calendrical context with relative offset biasing. Experiments show consistent gains over strong semantic dual-encoder baselines under time-keyed evaluation, supporting retrieval-time temporal consistency as a key prerequisite for faithful downstream historical RAG. Our code and datasets are available at \href{https://github.com/xbdxwyh/ChunQiuTR}{\texttt{github.com/xbdxwyh/ChunQiuTR}}.

[84] Continuous Interpretive Steering for Scalar Diversity

Ye-eun Cho

Main category: cs.CL

TL;DR: 本文提出Continuous Interpretive Steering (CIS)方法和GraSD数据集,用于评估大语言模型对渐进式语用推理(如标量含义强度差异)的敏感性,发现分级激活引导能有效恢复模型中固有的标量多样性表征。

Details Motivation: 现有对大语言模型语用推理能力的评估多依赖提示工程,难以刻画其内在的、渐进式的推理特性(如标量含义的强弱差异),亟需更精细的探测方法。 Method: 提出Continuous Interpretive Steering (CIS),将激活层引导强度作为连续变量;构建新数据集GraSD以编码标量多样性等级;在四个LLM上对比均匀与分级激活引导的效果。 Result: 均匀激活引导虽提升整体语用解释,但抹平了不同标量项间的差异;而分级激活引导能产生与标量多样性等级一致的差异化解释变化,表明模型表征空间中已编码该 graded sensitivity。 Conclusion: CIS与GraSD共同构成评估大语言模型渐进式语用敏感性的新范式,证实其表征空间中存在可被系统干预提取的细粒度语用知识。 Abstract: Pragmatic inference is inherently graded. Different lexical items give rise to pragmatic enrichment to different degrees. Scalar implicature exemplifies this property through scalar diversity, where implicature strength varies across scalar items. However, evaluations of pragmatic inference in large language models (LLMs) often rely on prompt-based manipulations. Beyond prompt-level effects, this study introduces Continuous Interpretive Steering (CIS), a method that probes graded pragmatic interpretation by treating activation-level steering strength as a continuous experimental variable. To support this analysis, this study introduces a new dataset, GraSD, which encodes graded scalar diversity. Experiments on four LLMs show that uniform activation steering increases pragmatic interpretations globally but collapses item-level variation, whereas graded activation steering yields differentiated interpretive shifts aligned with scalar diversity grades. It indicates that graded sensitivity is encoded in the representation space and can be systematically recovered through controlled intervention. Together, CIS and GraSD provide a principled framework for evaluating graded pragmatic sensitivity in LLMs.

[85] DTCRS: Dynamic Tree Construction for Recursive Summarization

Guanran Luo,Zhongquan Jian,Wentao Qiu,Meihong Wang,Qingqiang Wu

Main category: cs.CL

TL;DR: 本文提出DTCRS方法,通过动态生成基于文档结构和查询语义的摘要树,减少冗余摘要节点,提升问答相关性与效率。

Details Motivation: 现有递归摘要方法存在摘要节点冗余、构建耗时且不适用于所有问题类型的问题,导致RAG系统在多步推理问答中效果受限。 Method: DTCRS方法根据问题类型动态判断是否需要构建摘要树,并将子问题嵌入作为初始聚类中心进行文档分块聚类,从而实现结构感知、查询驱动的摘要树构建。 Result: 显著降低摘要树构建时间,并在三个问答任务上取得显著性能提升;同时揭示了递归摘要对不同问题类型的适用性差异。 Conclusion: DTCRS通过动态、结构化与语义驱动的摘要树构建,有效提升了RAG系统在复杂问答任务中的准确性与效率,为后续研究提供了关于问题类型适配性的重要启示。 Abstract: Retrieval-Augmented Generation (RAG) mitigates the hallucination problem of Large Language Models (LLMs) by incorporating external knowledge. Recursive summarization constructs a hierarchical summary tree by clustering text chunks, integrating information from multiple parts of a document to provide evidence for abstractive questions involving multi-step reasoning. However, summary trees often contain a large number of redundant summary nodes, which not only increase construction time but may also negatively impact question answering. Moreover, recursive summarization is not suitable for all types of questions. We introduce DTCRS, a method that dynamically generates summary trees based on document structure and query semantics. DTCRS determines whether a summary tree is necessary by analyzing the question type. It then decomposes the question and uses the embeddings of sub-questions as initial cluster centers, reducing redundant summaries while improving the relevance between summaries and the question. Our approach significantly reduces summary tree construction time and achieves substantial improvements across three QA tasks. Additionally, we investigate the applicability of recursive summarization to different question types, providing valuable insights for future research.

[86] Corpora deduplication or duplication in Natural Language Processing of few resourced languages ? A case of study: The Mexico's Nahuatl

Juan-José Guzman-Landa,Juan-Manuel Torres-Moreno,Graham Ranger,Miguel Figueroa-Saavedra,Martha-Lorena Avendaño-Garrido,Elvys Linhares-Pontes,Luis-Gil Moreno-Jiménez

Main category: cs.CL

TL;DR: 本文探讨了在计算资源有限的语言(π-语言)中,数据重复(增量式语料库扩展)是否有助于提升NLP性能,以纳瓦特尔语(Nawatl)为例,在小规模语料π-yalli上应用增量重复方法训练静态词嵌入,并在句级语义相似度任务中验证其有效性,结果表明该方法带来中等程度性能提升,且属首次提出。

Details Motivation: 针对缺乏大规模训练语料的低资源语言(如纳瓦特尔语),探索低成本、可行的数据增强策略,以缓解语料稀缺问题。 Method: 采用增量式语料库重复(incremental duplication)技术扩展小规模纳瓦特尔语语料π-yalli,并基于扩展后语料训练静态词嵌入,用于句级语义相似度任务评估。 Result: 相比未扩展原始语料,增量重复方法带来中等程度的性能提升;该技术在NLP文献中尚未被报道。 Conclusion: 在低资源语言NLP中,受控的数据重复(特别是增量式)是一种有潜力的语料扩展策略,可有效提升静态嵌入质量,值得进一步研究与推广。 Abstract: In this article, we seek to answer the following question: could data duplication be useful in Natural Language Processing (NLP) for languages with limited computational resources? In this type of languages (or $π$-languages), corpora available for training Large Language Models are virtually non-existent. In particular, we will study the impact of corpora expansion in Nawatl, an agglutinative and polysynthetic $π$-language spoken by over 2 million people, with a large number of dialectal varieties. The aim is to expand the new $π$-yalli corpus, which contains a limited number of Nawatl texts, by duplicating it in a controlled way. In our experiments, we will use the incremental duplication technique. The aim is to learn embeddings that are well-suited to NLP tasks. Thus, static embeddings were trained and evaluated in a sentence-level semantic similarity task. Our results show a moderate improvement in performance when using incremental duplication compared to the results obtained using only the corpus without expansion. Furthermore, to our knowledge, this technique has not yet been used in the literature.

[87] MARS: Enabling Autoregressive Models Multi-Token Generation

Ziqi Jin,Lei Wang,Ziwei Luo,Aixin Sun

Main category: cs.CL

TL;DR: MARS是一种轻量级微调方法,使自回归语言模型能在单次前向传播中预测多个token,无需修改架构或增加参数,在保持原有性能的同时显著提升生成吞吐量和推理速度。

Details Motivation: 传统自回归语言模型逐token生成文本,效率低,尤其在连续token高度可预测时仍无法并行;现有加速方法(如推测解码、多头预测)需额外模型或结构修改,部署复杂。 Method: 提出MARS(Mask AutoRegreSsion),仅通过在已有指令数据上继续微调,引入掩码机制让模型学习多token预测;同时设计块级KV缓存策略,并支持基于置信度阈值的实时速度调节。 Result: MARS在单token/步时匹敌或超越基线;允许多token/步时保持准确率,吞吐量提升1.5–1.7倍;块级KV缓存实现最高1.71倍端到端加速;支持运行时动态调整速度-质量权衡。 Conclusion: MARS是一种高效、即插即用的AR模型加速方案,兼顾精度、吞吐、部署灵活性与工程实用性。 Abstract: Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

[88] Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models

Md Motaleb Hossen Manik,Ge Wang

Main category: cs.CL

TL;DR: 本文对七种密集型和MoE架构的推理导向指令微调语言模型进行了受控实证基准测试,评估其在四种推理任务和三种提示策略下的准确性、延迟、显存占用和计算量,发现稀疏激活本身并不能保证最优的实际性能-效率权衡,结果依赖于架构、提示方式和任务组合。

Details Motivation: 探究MoE语言模型在真实推理约束下(如显存、延迟、计算量)的实际质量-效率权衡是否优于密集模型,而非仅依赖理论上的稀疏激活优势。 Method: 对Gemma、Phi、Qwen3共7个近期推理导向模型,在ARC-Challenge、GSM8K、Math Level 1-3、TruthfulQA MC1四个基准上,采用零样本、思维链、少样本思维链三种提示策略进行系统评估;共8400次模型-数据集-提示组合实验,测量准确率、延迟、峰值VRAM和FLOPs/Token代理指标,并发布可复现基准流程与统计分析。 Result: Gemma-4-E4B在少样本思维链下取得最佳加权准确率(0.675),显存14.9GB;Gemma-4-26B-A4B准确率相近(0.663)但显存高达48.1GB;各模型表现因任务而异(Gemma擅ARC/Math,Phi擅TruthfulQA),GSM8K对提示策略最敏感(如Phi-4-reasoning在少样本CoT下准确率骤降至0.11)。 Conclusion: 稀疏激活不自动带来更优的实际部署效益;模型的实际性能-效率权衡是架构设计、提示工程与具体任务分布共同决定的,需在真实资源约束下进行端到端评估。 Abstract: Mixture-of-experts (MoE) language models are often expected to offer better quality-efficiency tradeoffs than dense models because only a subset of parameters is activated per token, but the practical value of that advantage depends on end-to-end behavior under realistic inference constraints. We present a controlled empirical benchmark of seven recent reasoning-oriented instruction-tuned models spanning dense and MoE designs, namely Gemma-4-E2B, Gemma-4-E4B, Gemma-4-26B-A4B, Phi-4-mini-reasoning, Phi-4-reasoning, Qwen3-8B, and Qwen3-30B-A3B, evaluated on four benchmarks -- ARC-Challenge, GSM8K, Math Level 1-3, and TruthfulQA MC1 -- under three prompting strategies: zero-shot, chain-of-thought, and few-shot chain-of-thought. The study covers 8,400 total model-dataset-prompt evaluations and records accuracy, latency, peak GPU memory usage (VRAM), and an approximate floating-point operations (FLOPs)-per-token proxy. Across the weighted multi-task summary, Gemma-4-E4B with few-shot chain-of-thought achieved the best overall result, reaching weighted accuracy 0.675 with mean VRAM 14.9 GB, while Gemma-4-26B-A4B was close in accuracy at 0.663 but substantially more memory intensive at 48.1 GB. At the task level, Gemma models dominated ARC and Math, Phi models were strongest on TruthfulQA, and GSM8K showed the largest prompt sensitivity, including a sharp drop for Phi-4-reasoning from 0.67 under chain-of-thought to 0.11 under few-shot chain-of-thought. These results show that sparse activation alone does not guarantee the best practical operating point: observed accuracy-efficiency tradeoffs depend jointly on architecture, prompting protocol, and task composition. We release a reproducible benchmark pipeline, aggregated results, and paired statistical analyses to support deployment-oriented evaluation of reasoning LLMs under real resource constraints.

[89] ReDAct: Uncertainty-Aware Deferral for LLM Agents

Dzianis Piatrashyn,Nikita Kotelevskii,Kirill Grishchenkov,Nikita Glazkov,Ivan Nasonov,Ilya Makarov,Timothy Baldwin,Preslav Nakov,Roman Vashurin,Maxim Panov

Main category: cs.CL

TL;DR: 本文提出ReDAct框架,通过在小模型预测不确定性高时动态切换至大模型,以平衡推理成本与决策可靠性,在文本型具身环境中验证了其有效性。

Details Motivation: LLM代理在序贯决策中易因幻觉导致错误,且单次错误可能不可逆地损害整体轨迹;虽然大模型幻觉更少,但推理成本显著更高,需权衡可靠性与成本。 Method: 提出ReDAct(Reason-Defer-Act)方法:代理配备大小两个LLM,小模型默认运行,当其预测不确定性超过校准阈值时,将决策权移交大模型。 Result: 在ALFWorld和MiniGrid等文本具身环境中实验表明,仅约15%的决策交由大模型处理,即可达到全程使用大模型的性能水平,同时大幅降低推理成本。 Conclusion: ReDAct有效缓解了LLM幻觉在序贯决策中的危害,在保持高性能的同时显著提升了推理效率,为低成本高可靠性代理设计提供了新思路。 Abstract: Recently, LLM-based agents have become increasingly popular across many applications, including complex sequential decision-making problems. However, they inherit the tendency of LLMs to hallucinate, leading to incorrect decisions. In sequential settings, even a single mistake can irreversibly degrade the trajectory, making hallucinations an even bigger problem. Although larger LLMs hallucinate less, they incur a significantly higher per-token cost. In this paper, we address this tradeoff by proposing ReDAct (Reason-Defer-Act). In ReDAct, an agent is equipped with two LLMs: a small, cheap model used by default, and a large, more reliable but expensive model. When the predictive uncertainty of the small model exceeds a calibrated threshold, the decision is deferred to the large model. We evaluate our approach in text-based embodied environments such as ALFWorld and MiniGrid and show that deferring only about 15% of decisions to the large model can match the quality of using it exclusively, while significantly reducing inference costs.

[90] Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

Xuanbo Su,Wenhao Hu,Le Zhan,Yanqi Yang,Leo Huang

Main category: cs.CL

TL;DR: 本文提出SalesLLM——一个面向销售对话的双语(中/英)基准,包含真实场景数据、自动评估流程及用户模型CustomerLM,显著提升销售对话模拟的真实性与评估可靠性。

Details Motivation: 现有对话基准缺乏对销售过程中交易进展和最终结果的衡量,而销售对话本身具有多轮、目标导向、激励不对称等挑战性特点,亟需专用基准支持LLM在销售任务中的评估与优化。 Method: 构建SalesLLM基准:涵盖金融与消费品领域,含30,074个脚本配置和1,805个多轮场景;设计自动评估流水线(LLM评分器+微调BERT分类器);训练用户模型CustomerLM(SFT+DPO),基于8,000条众包销售对话以降低角色错位率。 Result: SalesLLM评分与专家人工评分高度相关(Pearson r=0.98);CustomerLM将角色错位率从17.44%(GPT-4o)降至8.8%;15个主流LLM在该基准上表现差异显著,最优模型达人类水平,最差者不及人类。 Conclusion: SalesLLM是一个可扩展、高保真、结果导向的销售对话基准,为销售智能体的开发与评估提供了可靠工具。 Abstract: Sales dialogues require multi-turn, goal-directed persuasion under asymmetric incentives, which makes them a challenging setting for large language models (LLMs). Yet existing dialogue benchmarks rarely measure deal progression and outcomes. We introduce SalesLLM, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with controllable difficulty and personas. We propose a fully automatic evaluation pipeline that combines (i) an LLM-based rater for sales-process progress, and (ii) fine-tuned BERT classifiers for end-of-dialogue buying intent. To improve simulation fidelity, we train a user model, CustomerLM, with SFT and DPO on 8,000 crowdworker-involved sales conversations, reducing role inversion from 17.44% (GPT-4o) to 8.8%. SalesLLM scores correlate strongly with expert human ratings (Pearson r=0.98). Experiments across 15 mainstream LLMs reveal substantial variability: top-performance LLMs are competitive with human-level performance while the less capable ones are worse than human. SalesLLM serves as a scalable benchmark for developing and evaluating outcome-oriented sales agents.

[91] IndoBERT-Sentiment: Context-Conditioned Sentiment Classification for Indonesian Text

Muhammad Apriandito Arya Saputra,Andry Alamsyah,Dian Puteri Ramadhani,Thomhert Suprapto Siadari,Hanif Fakhrurroja

Main category: cs.CL

TL;DR: 本文提出了一种上下文感知的印尼语情感分析模型IndoBERT-Sentiment,通过引入话题上下文提升情感分类准确性,在多个指标上显著优于现有通用模型。

Details Motivation: 现有印尼语情感分析模型忽略文本所处的话题上下文,而该上下文常决定情感极性,导致分类不准。 Method: 基于IndoBERT Large构建上下文条件化的情感分类器,联合输入话题与文本;在涵盖188个话题、共31,360对样本的数据集上训练。 Result: 在测试集上达到宏观F1值0.856、准确率88.1%,较最优基线模型提升35.6 F1点。 Conclusion: 话题上下文建模可有效提升印尼语情感分析性能,尤其能纠正无上下文模型的系统性错误。 Abstract: Existing Indonesian sentiment analysis models classify text in isolation, ignoring the topical context that often determines whether a statement is positive, negative, or neutral. We introduce IndoBERT-Sentiment, a context-conditioned sentiment classifier that takes both a topical context and a text as input, producing sentiment predictions grounded in the topic being discussed. Built on IndoBERT Large (335M parameters) and trained on 31,360 context-text pairs labeled across 188 topics, the model achieves an F1 macro of 0.856 and accuracy of 88.1%. In a head-to-head evaluation against three widely used general-purpose Indonesian sentiment models on the same test set, IndoBERT-Sentiment outperforms the best baseline by 35.6 F1 points. We show that context-conditioning, previously demonstrated for relevancy classification, transfers effectively to sentiment analysis and enables the model to correctly classify texts that are systematically misclassified by context-free approaches.

[92] SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)

Liang-Chih Yu,Jonas Becker,Shamsuddeen Hassan Muhammad,Idris Abdulmumin,Lung-Hao Lee,Ying-Lung Lin,Jin Wang,Jan Philip Wahle,Terry Ruas,Natalia Loukachevitch,Alexander Panchenko,Ilseyar Alimova,Lilian Wanzare,Nelson Odhiambo,Bela Gipp,Kai-Wei Chang,Saif M. Mohammad

Main category: cs.CL

TL;DR: 本文介绍了SemEval-2026共享任务DimABSA与DimStance,将传统基于方面的情感分析(ABSA)和立场分析扩展至效价-唤醒(VA)连续维度建模,并提出连续F1(cF1)评估指标。

Details Motivation: 传统ABSA使用离散极性标签,难以刻画情感的细微连续变化;且现有方法主要面向消费评论,缺乏对公共议题(如政治、能源、气候)中立场与情感建模的支持。 Method: 提出两个新任务:Dimensional Aspect-Based Sentiment Analysis(DimABSA)和Dimensional Stance Analysis(DimStance),均在VA空间中建模;定义三个结构化子任务(回归、三元组、四元组抽取)及新评估指标cF1;组织大规模国际评测。 Result: 吸引超400名参与者,收到112份最终提交与42篇系统描述论文;报告了基线结果,分析了优胜系统及其关键设计。 Conclusion: 将ABSA与立场分析统一到VA连续空间是可行且富有前景的方向;cF1为联合评估回归与结构抽取提供了新范式;资源全部开源,推动该领域发展。 Abstract: We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence-arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression. The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.

[93] Is Cross-Lingual Transfer in Bilingual Models Human-Like? A Study with Overlapping Word Forms in Dutch and English

Iza Škrjanec,Irene Elisabeth Winther,Vera Demberg,Stefan L. Frank

Main category: cs.CL

TL;DR: 本文研究双语语言模型在阅读过程中是否表现出与人类双语者相似的跨语言激活模式,发现其效果高度依赖于词汇共享方式,仅在同源词共享嵌入时才能复现人类行为模式。

Details Motivation: 探究双语语言模型是否能模拟人类双语者在阅读中对同源词(friends)和跨语言同形异义词(false friends)的跨语言激活模式,以评估其作为双语认知模型的合理性。 Method: 训练四种不同词汇共享设置的荷兰语-英语因果Transformer模型,使用心理语言学实验材料,通过惊奇度(surprisal)和词嵌入相似性分析评估模型,并进行回归分析考察频率与形式-意义一致性的影响。 Result: 模型总体保持语言分离;仅当嵌入共享时才出现跨语言效应,且同源词和假朋友均表现出促进效应(而非人类中的干扰);回归显示该效应主要由词频驱动;仅在仅同源词共享嵌入时,才复现人类的定性模式。 Conclusion: 双语语言模型能在一定程度上捕捉跨语言激活现象,但其与人类加工的一致性关键取决于词汇重叠的编码方式,这可能限制其作为双语阅读认知模型的解释力。 Abstract: Bilingual speakers show cross-lingual activation during reading, especially for words with shared surface form. Cognates (friends) typically lead to facilitation, whereas interlingual homographs (false friends) cause interference or no effect. We examine whether cross-lingual activation in bilingual language models mirrors these patterns. We train Dutch-English causal Transformers under four vocabulary-sharing conditions that manipulate whether (false) friends receive shared or language-specific embeddings. Using psycholinguistic stimuli from bilingual reading studies, we evaluate the models through surprisal and embedding similarity analyses. The models largely maintain language separation, and cross-lingual effects arise primarily when embeddings are shared. In these cases, both friends and false friends show facilitation relative to controls. Regression analyses reveal that these effects are mainly driven by frequency rather than consistency in form-meaning mapping. Only when just friends share embeddings are the qualitative patterns of bilinguals reproduced. Overall, bilingual language models capture some cross-linguistic activation effects. However, their alignment with human processing seems to critically depend on how lexical overlap is encoded, possibly limiting their explanatory adequacy as models of bilingual reading.

[94] Multilingual Embedding Probes Fail to Generalize Across Learner Corpora

Laurits Lyngbaek,Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: 本文探究多语言嵌入模型是否编码了跨语言通用的熟练度表征,通过在Qwen3-Embedding系列模型隐藏层上训练多种探针预测CEFR熟练度等级;结果表明模型在同分布数据上表现良好,但在跨语料库(即分布外)设置下性能急剧下降,说明当前嵌入主要捕获语料特异性特征而非抽象、可迁移的熟练度维度。

Details Motivation: 检验多语言嵌入模型是否隐含编码一种与语言无关、可泛化的语言熟练度表征,以支撑基于表征的适应性语言技术。 Method: 在Qwen3-Embedding(0.6B/4B/8B)各层隐藏状态上训练线性和非线性探针,预测九个语料库、七种语言的CEFR熟练度等级;对比五种探针结构与基于表面文本特征的基线;采用同分布和跨语料库两种评估范式,并进行残差分析。 Result: 同分布评估下探针QWK≈0.7,显著优于基线,中层表征最优;但跨语料库评估下所有探针性能崩溃,残差分析显示其趋向于均匀预测标签,表明所学映射依赖语料特异性因素(主题、语言、任务类型、评分方式)而非通用熟练度。 Conclusion: 当前多语言嵌入模型并未直接编码语言无关的通用熟练度表征;基于此类嵌入的熟练度自适应技术需谨慎对待分布偏移问题,并应探索解耦语料特异性偏差与真正能力信号的方法。 Abstract: Do multilingual embedding models encode a language-general representation of proficiency? We investigate this by training linear and non-linear probes on hidden-state activations from Qwen3-Embedding (0.6B, 4B, 8B) to predict CEFR proficiency levels from learner texts across nine corpora and seven languages. We compare five probing architectures against a baseline trained on surface-level text features. Under in-distribution evaluation, probes achieve strong performance ($QWK\approx0.7$), substantially outperforming the surface baseline, with middle layers consistently yielding the best predictions. However, in cross-corpus evaluation performance collapses across all probe types and model sizes. Residual analysis reveals that out-of-distribution probes converge towards predicting uniformly distributed labels, indicating that the learned mappings capture corpus-specific distributional properties (topic, language, task type, rating methodology) rather than an abstract, transferable proficiency dimension. These results suggest that current multilingual embeddings do not straightforwardly encode language-general proficiency, with implications for representation-based approaches to proficiency-adaptive language technology.

[95] STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

Hongru Ji,Yuyin Fan,Meng Zhao,Xianghua Li,Lianwei Wu,Chao Gao

Main category: cs.CL

TL;DR: 本文提出STRIDE-ED框架,通过策略驱动、可解释、深度推理建模共情对话,结合策略感知数据构建与两阶段训练范式,显著提升共情响应生成效果。

Details Motivation: 现有共情对话模型受限于缺乏统一的共情策略框架、显式的多阶段推理机制以及高质量策略标注数据,难以建模其复杂的认知与决策过程。 Method: 提出STRIDE-ED框架:1)基于共情策略的结构化推理;2)融合LLM标注、多模型一致性评估与动态采样的策略感知数据精炼流程;3)监督微调+多目标强化学习的两阶段训练范式。 Result: 在多种开源大语言模型上验证有效,自动指标与人工评估均显著优于现有方法。 Conclusion: STRIDE-ED将共情对话建模为策略条件下的多阶段推理任务,提升了模型的可解释性、策略对齐性与生成质量,为共情对话系统提供了新范式。 Abstract: Empathetic dialogue requires not only recognizing a user's emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

[96] The Impact of Steering Large Language Models with Persona Vectors in Educational Applications

Yongchao Wu,Aron Henriksson

Main category: cs.CL

TL;DR: 本文研究了激活导向的个性向量在教育场景中(短答案生成与自动评分)的影响,发现其会降低答案质量,尤其对开放性英语语言艺术(ELA)任务影响显著;在评分方面引发可预测的情绪一致性校准偏移,且ELA任务和混合专家模型更敏感。

Details Motivation: 探究激活导向的个性化方法在教育应用(如短答案生成与自动评分)中的实际效果与潜在风险,填补该方向系统性研究的空白。 Method: 在ASAP-SAS基准上,针对七个性格特质,对三种不同架构的大语言模型进行激活导向的个性向量干预,评估其在短答案生成和自动评分任务上的表现差异。 Result: 个性引导整体降低答案质量,ELA任务比科学任务敏感11倍;评分呈现情绪一致性偏移(如‘邪恶’评分者更严苛),ELA任务比科学任务敏感2.5–3倍,MoE模型校准偏移是稠密模型的约6倍。 Conclusion: 激活导向的个性引导在教育应用中存在显著任务依赖性和架构依赖性,部署时需采用任务感知与架构感知的校准策略。 Abstract: Activation-based steering can personalize large language models at inference time, but its effects in educational settings remain unclear. We study persona vectors for seven character traits in short-answer generation and automated scoring on the ASAP-SAS benchmark across three models spanning two architectures. Persona steering lowers answer quality overall, with much larger effects on open-ended English Language Arts (ELA) prompts than on factual science prompts; interpretive and argumentative tasks are up to 11x more sensitive. On the scoring side, we observe predictable valence-aligned calibration shifts: evil and impolite scorers grade more harshly, while good and optimistic scorers grade more leniently. ELA tasks are 2.5-3x more susceptible to scorer personalization than science tasks, and the Mixture-of-Experts model shows roughly 6x larger calibration shifts than the dense models. To our knowledge, this is the first study to systematically examine the effects of activation-steered persona traits in educational generation and scoring, and the results highlight the need for task-aware and architecture-aware calibration when deploying steered models in educational settings.

[97] Yale-DM-Lab at ArchEHR-QA 2026: Deterministic Grounding and Multi-Pass Evidence Alignment for EHR Question Answering

Elyas Irankhah,Samah Fodeh

Main category: cs.CL

TL;DR: 本文介绍了Yale-DM-Lab团队在ArchEHR-QA 2026共享任务中的系统,采用多模型协同与集成策略完成四个子任务,在开发集上取得较优性能。

Details Motivation: 解决患者撰写的住院记录相关问题的理解与回答难题,提升临床语义转化、证据检索、答案生成及对齐的准确性。 Method: ST1采用Claude Sonnet 4与GPT-4o双模型流水线进行问题重述;ST2-ST4使用Azure托管的o3、GPT-5.2、GPT-5.1和DeepSeek-R1模型集成,结合少样本提示与投票机制;ST4中额外引入完整临床答案段落作为对齐提示上下文。 Result: 开发集上ST4微F1达88.81,ST2宏F1为65.72,ST3为34.01,ST1为33.05;验证了模型多样性、集成投票及上下文增强的有效性,同时发现对齐性能主要受限于推理能力。 Conclusion: 多模型集成与上下文增强策略显著提升各子任务性能,但证据-答案对齐仍面临深层推理瓶颈,需进一步优化推理建模能力。 Abstract: We describe the Yale-DM-Lab system for the ArchEHR-QA 2026 shared task. The task studies patient-authored questions about hospitalization records and contains four subtasks (ST): clinician-interpreted question reformulation, evidence sentence identification, answer generation, and evidence-answer alignment. ST1 uses a dual-model pipeline with Claude Sonnet 4 and GPT-4o to reformulate patient questions into clinician-interpreted questions. ST2-ST4 rely on Azure-hosted model ensembles (o3, GPT-5.2, GPT-5.1, and DeepSeek-R1) combined with few-shot prompting and voting strategies. Our experiments show three main findings. First, model diversity and ensemble voting consistently improve performance compared to single-model baselines. Second, the full clinician answer paragraph is provided as additional prompt context for evidence alignment. Third, results on the development set show that alignment accuracy is mainly limited by reasoning. The best scores on the development set reach 88.81 micro F1 on ST4, 65.72 macro F1 on ST2, 34.01 on ST3, and 33.05 on ST1.

[98] Are Non-English Papers Reviewed Fairly? Language-of-Study Bias in NLP Peer Reviews

Ehsan Barkhordar,Abdulfattah Safa,Verena Blaschke,Erika Lombart,Marie-Catherine de Marneffe,Gözde Gül Şahin

Main category: cs.CL

TL;DR: 本文首次系统研究了NLP同行评审中的语言研究偏见(LoS bias),构建了人工标注数据集LOBSTER,并提出检测方法,发现非英语论文遭受更严重的负面偏见,其中‘要求不合理的跨语言泛化’是最主要形式。

Details Motivation: 同行评审中存在语言研究偏见(LoS bias),但该偏见未被明确定义和系统研究,现有工作将其混入宽泛的低质量评审类别中。 Method: 构建人工标注数据集LOBSTER,提出LoS偏见检测方法(达87.37宏F1),并对15,645条评审进行定量与定性分析,识别偏见子类。 Result: 非英语论文比仅英语论文遭受更高偏见率;负面偏见始终强于正面偏见;‘要求不合理的跨语言泛化’是最主导的负面偏见子类。 Conclusion: LoS偏见是真实且显著的问题,需在评审指南、审稿培训与评审工具中针对性干预,本文资源公开以推动公平评审实践。 Abstract: Peer review plays a central role in the NLP publication process, but is susceptible to various biases. Here, we study language-of-study (LoS) bias: the tendency for reviewers to evaluate a paper differently based on the language(s) it studies, rather than its scientific merit. Despite being explicitly flagged in reviewing guidelines, such biases are poorly understood. Prior work treats such comments as part of broader categories of weak or unconstructive reviews without defining them as a distinct form of bias. We present the first systematic characterization of LoS bias, distinguishing negative and positive forms, and introduce the human-annotated dataset LOBSTER (Language-Of-study Bias in ScienTific pEer Review) and a method achieving 87.37 macro F1 for detection. We analyze 15,645 reviews to estimate how negative and positive biases differ with respect to the LoS, and find that non-English papers face substantially higher bias rates than English-only ones, with negative bias consistently outweighing positive bias. Finally, we identify four subcategories of negative bias, and find that demanding unjustified cross-lingual generalization is the most dominant form. We publicly release all resources to support work on fairer reviewing practices in NLP and beyond.

[99] Language Bias under Conflicting Information in Multilingual LLMs

Robert Östling,Murathan Kurfalı

Main category: cs.CL

TL;DR: 本文研究了多语言大语言模型(LLMs)在处理不同语言提供的冲突信息时是否存在语言偏好偏差,发现所有测试模型均倾向于忽略冲突、自信地选择单一答案,并表现出对俄语的系统性偏见及对中文(尤其长上下文)的偏好,该现象在国内外训练的模型中均存在,但在国内训练模型中更显著。

Details Motivation: 探究大语言模型在整合不同语言提供的冲突信息时是否存在语言相关的认知偏差,扩展传统‘矛盾针在 haystack’范式至多语言场景。 Method: 将‘矛盾针在 haystack’范式拓展至多语言设置,使用五种语言(含俄语、中文等)的真实新闻数据,在多种规模的多语言大模型(包括GPT-5.2)上开展系统性评估。 Result: 所有被测模型(含GPT-5.2)在绝大多数情况下忽略语言间的信息冲突,仅自信输出一个答案;普遍存在对俄语的系统性忽视,且在长上下文下倾向偏好中文;该语言偏好模式在国内外训练模型中一致,但中国大陆训练模型表现更强。 Conclusion: 多语言大语言模型存在稳定的跨模型语言偏好偏差,表明其推理过程受语言因素干扰,而非纯粹基于语义内容;该偏差可能源于训练数据分布或对齐策略差异,需在多语言AI公平性与鲁棒性设计中予以重视。 Abstract: Large Language Models (LLMs) have been shown to contain biases in the process of integrating conflicting information when answering questions. Here we ask whether such biases also exist with respect to which language is used for each conflicting piece of information. To answer this question, we extend the conflicting needles in a haystack paradigm to a multilingual setting and perform a comprehensive set of evaluations with naturalistic news domain data in five different languages, for a range of multilingual LLMs of different sizes. We find that all LLMs tested, including GPT-5.2, ignore the conflict and confidently assert only one of the possible answers in the large majority of cases. Furthermore, there is a consistent bias across models in which languages are preferred, with a general bias against Russian and, for the longest context lengths, in favor of Chinese. Both of these patterns are consistent between models trained inside and outside of mainland China, though somewhat stronger in the former category.

[100] Dynamic Context Evolution for Scalable Synthetic Data Generation

Ryan Lingo,Rajeev Chhajer

Main category: cs.CL

TL;DR: 本文提出Dynamic Context Evolution (DCE)框架,通过 verbalized tail sampling、semantic memory 和 adaptive prompt evolution 三种机制,有效缓解大语言模型在多批次独立生成时出现的跨批次模式坍缩(cross-batch mode collapse)问题,在多个任务和模型上显著提升输出多样性且无需微调。

Details Motivation: 大语言模型在多批次独立提示下易出现跨批次模式坍缩(即输出重复性上升、多样性下降),现有实践依赖经验性去重和种子轮换,缺乏系统性解决方案。 Method: 提出Dynamic Context Evolution(DCE)框架,包含:(1) verbalized tail sampling(模型自我评估并过滤明显想法);(2) semantic memory(基于嵌入索引跨批次去重);(3) adaptive prompt evolution(依据记忆状态与多样性策略动态重构提示)。 Result: 在三个领域和两个模型家族上的实验表明,DCE实现0.0%±0.0%模式坍缩(基线为5.6%±2.0%),HDBSCAN聚类数更稳定(17–18 vs. 2–17),且成本仅约$0.50/1000候选,无需微调或定制架构。 Conclusion: DCE是一种轻量、通用、API原生的多样性增强框架,其组件需协同作用才有效,为批量生成场景提供了首个原则性解决方案。 Abstract: Large language models produce repetitive output when prompted independently across many batches, a phenomenon we term cross-batch mode collapse: the progressive loss of output diversity when a language model is prompted repeatedly without access to its prior generations. Practitioners have long mitigated this with ad hoc deduplication and seed rotation, but no principled framework exists. We introduce Dynamic Context Evolution (DCE), comprising three mechanisms: (1) verbalized tail sampling (the model labels each idea with a guess about how obvious it is, and obvious ideas are discarded), which filters high-probability candidates via model self-assessment; (2) semantic memory, which maintains a persistent embedding index to reject near-duplicates across batches; and (3) adaptive prompt evolution, which reconstructs the generation prompt each batch using memory state and rotating diversity strategies. In experiments across three domains (sustainable packaging concepts, educational exam questions, and creative writing prompts) and two model families (gpt-5-mini and claude-haiku-4-5), a component ablation across 2-3 random seeds per method shows that DCE achieves 0.0 +/- 0.0% collapse versus 5.6 +/- 2.0% for naive prompting, while producing 17-18 HDBSCAN clusters per seed versus naive's volatile 2-17, indicating reliably richer conceptual structure. These results are validated with an independent embedding model (all-MiniLM-L6-v2) and hold across sensitivity sweeps of the VTS threshold tau and dedup threshold delta. Deduplication and prompt evolution are individually insufficient but jointly effective, at approximately $0.50 per 1,000 candidates using only standard API calls, with no fine-tuning or custom architectures required.

[101] Agent-Driven Corpus Linguistics: A Framework for Autonomous Linguistic Discovery

Jia Yu,Weiwei Yu,Pengfei Xiao,Fukun Xing

Main category: cs.CL

TL;DR: 本文提出了一种名为'Agent-Driven Corpus Linguistics'的新范式,利用大语言模型(LLM)连接语料库查询引擎,自动完成假设生成、语料查询、结果解释与分析迭代全过程,同时确保所有发现均有可验证的语料证据支撑。

Details Motivation: 传统语料库语言学依赖人工提出假设、构建查询和解释结果,耗时且需要专业技能;本文旨在降低技术门槛,提升分析效率与实证性。 Method: 构建一个LLM代理,通过结构化工具调用接口(Model Context Protocol, MCP)连接CQP索引的古腾堡语料库(500万词次),实现多轮假设-查询-解释-精炼的闭环分析;并开展基线实验与外部效度验证(复现两篇已发表研究)。 Result: 代理成功识别出英语强化词的历时传递链(so+ADJ > very > really)、三条语义演变路径及语域敏感分布;基线实验表明语料锚定提升了量化能力与可证伪性;在CLMET语料库上复现实验结果与原研究高度一致。 Conclusion: Agent-Driven Corpus Linguistics并非取代传统语料库方法,而是新增一维‘探究主体’维度,能以机器速度产出经验扎实的研究发现,拓展语料库语言学的可及性与实证严谨性。 Abstract: Corpus linguistics has traditionally relied on human researchers to formulate hypotheses, construct queries, and interpret results - a process demanding specialized technical skills and considerable time. We propose Agent-Driven Corpus Linguistics, an approach in which a large language model (LLM), connected to a corpus query engine via a structured tool-use interface, takes over the investigative cycle: generating hypotheses, querying the corpus, interpreting results, and refining analysis across multiple rounds. The human researcher sets direction and evaluates final output. Unlike unconstrained LLM generation, every finding is anchored in verifiable corpus evidence. We treat this not as a replacement for the corpus-based/corpus-driven distinction but as a complementary dimension: it concerns who conducts the inquiry, not the epistemological relationship between theory and data. We demonstrate the framework by linking an LLM agent to a CQP-indexed Gutenberg corpus (5 million tokens) via the Model Context Protocol (MCP). Given only "investigate English intensifiers," the agent identified a diachronic relay chain (so+ADJ > very > really), three pathways of semantic change (delexicalization, polarity fixation, metaphorical constraint), and register-sensitive distributions. A controlled baseline experiment shows that corpus grounding contributes quantification and falsifiability that the model cannot produce from training data alone. To test external validity, the agent replicated two published studies on the CLMET corpus (40 million tokens) - Claridge (2025) and De Smet (2013) - with close quantitative agreement. Agent-driven corpus research can thus produce empirically grounded findings at machine speed, lowering the technical barrier for a broader range of researchers.

[102] LaScA: Language-Conditioned Scalable Modelling of Affective Dynamics

Kosmas Pinitas,Ilias Maglogiannis

Main category: cs.CL

TL;DR: 本文提出了一种利用大语言模型(LM)作为语义上下文调节器、结合手工设计情感描述符来预测效价(Valence)和唤醒度(Arousal)变化的新框架,兼顾可解释性与预测性能。

Details Motivation: 现有基于深度神经嵌入的情感预测方法缺乏可解释性,难以支持专家驱动的改进;而在无约束环境中实现可解释且高性能的情感建模仍是挑战。 Method: 提取可解释的面部几何与声学特征,将其转化为自然语言形式的情感语义描述;利用预训练语言模型处理这些描述,生成语义上下文嵌入,作为情感动态的高层先验;以此指导效价与唤醒度变化建模。 Result: 在Aff-Wild2和SEWA数据集上,该方法在效价与唤醒度变化预测任务中均显著优于纯手工特征和端到端深度嵌入基线。 Conclusion: 语义条件化可在不牺牲预测性能的前提下实现可解释的情感建模,为端到端黑箱架构提供了透明、高效的替代方案。 Abstract: Predicting affect in unconstrained environments remains a fundamental challenge in human-centered AI. While deep neural embeddings dominate contemporary approaches, they often lack interpretability and limit expert-driven refinement. We propose a novel framework that uses Language Models (LMs) as semantic context conditioners over handcrafted affect descriptors to model changes in Valence and Arousal. Our approach begins with interpretable facial geometry and acoustic features derived from structured domain knowledge. These features are transformed into symbolic natural-language descriptions encoding their affective implications. A pretrained LM processes these descriptions to generate semantic context embeddings that act as high-level priors over affective dynamics. Unlike end-to-end black-box pipelines, our framework preserves feature transparency while leveraging the contextual abstraction capabilities of LMs. We evaluate the proposed method on the Aff-Wild2 and SEWA datasets for affect change prediction. Experimental results show consistent improvements in accuracy for both Valence and Arousal compared to handcrafted-only and deep-embedding baselines. Our findings demonstrate that semantic conditioning enables interpretable affect modelling without sacrificing predictive performance, offering a transparent and computationally efficient alternative to fully end-to-end architectures

[103] Efficient Learned Data Compression via Dual-Stream Feature Decoupling

Huidong Ma,Xinyan Shi,Hui Sun,Xiaofei Yue,Xiaoguang Liu,Gang Wang,Wentong Cai

Main category: cs.CL

TL;DR: 本文提出了一种双流多尺度解耦器与分层门控精炼器,结合并发流并行流水线,以解决学习型数据压缩中概率建模精度与系统效率难以兼顾的问题,在压缩率、吞吐量、延迟和内存占用上均达到SOTA。

Details Motivation: 现有单流架构难以同时建模微观句法与宏观语义特征,深串行堆叠导致高延迟;异构系统受设备速度不匹配限制,串行处理使吞吐受限于Amdahl定律。 Method: 提出双流多尺度解耦器(分离局部与全局上下文,用浅层并行替代深层串行)、分层门控精炼器(自适应特征精炼与精确概率建模)及并发流并行流水线(实现全流水线并行)。 Result: 在压缩率和吞吐量上达到SOTA,同时保持最低延迟和内存占用。 Conclusion: 通过解耦建模与并行化设计,可有效突破学习型数据压缩中精度与效率的权衡瓶颈。 Abstract: While Learned Data Compression (LDC) has achieved superior compression ratios, balancing precise probability modeling with system efficiency remains challenging. Crucially, uniform single-stream architectures struggle to simultaneously capture micro-syntactic and macro-semantic features, necessitating deep serial stacking that exacerbates latency. Compounding this, heterogeneous systems are constrained by device speed mismatches, where throughput is capped by Amdahl's Law due to serial processing. To this end, we propose a Dual-Stream Multi-Scale Decoupler that disentangles local and global contexts to replace deep serial processing with shallow parallel streams, and incorporate a Hierarchical Gated Refiner for adaptive feature refinement and precise probability modeling. Furthermore, we design a Concurrent Stream-Parallel Pipeline, which overcomes systemic bottlenecks to achieve full-pipeline parallelism. Extensive experiments demonstrate that our method achieves state-of-the-art performance in both compression ratio and throughput, while maintaining the lowest latency and memory usage. The code is available at https://github.com/huidong-ma/FADE.

[104] Joint Optimization of Reasoning and Dual-Memory for Self-Learning Diagnostic Agent

Bingxuan Li,Simo Du,Yue Guo

Main category: cs.CL

TL;DR: 本文提出SEA,一种具有认知启发式双记忆模块的自学习诊断代理,通过强化训练框架联合优化推理与记忆管理,在标准和长周期临床推理任务中均显著优于基线方法,并得到专家验证其生成规则具备临床正确性与实用性。

Details Motivation: 现有基于大语言模型的诊断代理多独立处理病例,难以复用经验并持续适应,限制了临床推理能力的提升。 Method: 提出SEA诊断代理,包含认知启发的双记忆模块,并设计专用的强化训练框架以联合优化推理过程与记忆管理。 Result: 在MedCaseReasoning数据集上准确率达92.46%(+19.6%),在ER-Reason长周期任务中最终准确率0.7214且Acc@100提升达+0.35;专家评估证实其提取规则具备临床正确性、实用性和可信度。 Conclusion: SEA通过将临床经验有效转化为可复用知识,同步提升了诊断推理能力与持续学习能力。 Abstract: Clinical expertise improves not only by acquiring medical knowledge, but by accumulating experience that yields reusable diagnostic patterns. Recent LLMs-based diagnostic agents have shown promising progress in clinical reasoning for decision support. However, most approaches treat cases independently, limiting experience reuse and continual adaptation. We propose SEA, a self-learning diagnostic agent with cognitively inspired dual-memory module. We design a reinforcement training framework tailored to our designed agent for joint optimization of reasoning and memory management. We evaluate SEA in two complementary settings. On standard evaluation with MedCaseReasoning dataset, SEA achieves 92.46% accuracy, outperforming the strongest baseline by +19.6%, demonstrating the benefit of jointly optimizing reasoning and memory. On the long-horizon with ER-Reason dataset, SEA attains the best final accuracy (0.7214) and the largest improvement (+0.35 Acc@100), while baseline methods show limited or unstable gains. Expert evaluation further indicates that rules consolidated from SEA show strong clinical correctness, usefulness and trust, suggesting that the induced rules in dual-memory module are reliable and practically meaningful. Overall, SEA improves both diagnostic reasoning ability and continual learning by effectively transforming experience into reusable knowledge.

[105] ClickGuard: A Trustworthy Adaptive Fusion Framework for Clickbait Detection

Chhavi Dhiman,Naman Chawla,Riya Dhami,Gaurav Kumar,Ganesh Naik

Main category: cs.CL

TL;DR: 本文提出ClickGuard框架,通过结合BERT嵌入与结构特征,并利用SSAFB动态融合及CNN-BiLSTM建模,实现高精度(96.93%)且可解释、鲁棒的点击诱饵检测。

Details Motivation: 点击诱饵标题泛滥,以耸人听闻、误导性陈述和模糊语言损害在线内容可信度,亟需高效、可信的检测方法。 Method: 提出ClickGuard框架:采用Syntactic-Semantic Adaptive Fusion Block(SSAFB)自适应融合BERT语义嵌入与句法结构特征,并结合CNN-BiLSTM捕获局部模式与长程依赖;使用LIME和Permutation Feature Importance(PFI)进行可解释性与鲁棒性分析。 Result: 在测试集上达到96.93%准确率,超越现有最优方法;消融实验验证SSAFB有效性;模型在多个数据集上表现稳健,具备可扩展性与可靠性。 Conclusion: ClickGuard为解决点击诱饵检测中的句法-语义建模难题提供了可信、自适应、可解释的端到端解决方案,有助于提升数字内容生态的可信度。 Abstract: The widespread use of clickbait headlines, crafted to mislead and maximize engagement, poses a significant challenge to online credibility. These headlines employ sensationalism, misleading claims, and vague language, underscoring the need for effective detection to ensure trustworthy digital content. The paper introduces, ClickGuard: a trustworthy adaptive fusion framework for clickbait detection. It combines BERT embeddings and structural features using a Syntactic-Semantic Adaptive Fusion Block (SSAFB) for dynamic integration. The framework incorporates a hybrid CNN-BiLSTM to capture patterns and dependencies. The model achieved 96.93% testing accuracy, outperforming state-of-the-art approaches. The model's trustworthiness is evaluated using LIME and Permutation Feature Importance (PFI) for interpretability and perturbation analysis. These methods assess the model's robustness and sensitivity to feature changes by measuring the average prediction variation. Ablation studies validated the SSAFB's effectiveness in optimizing feature fusion. The model demonstrated robust performance across diverse datasets, providing a scalable, reliable solution for enhancing online content credibility by addressing syntactic-semantic modelling challenges. Code of the work is available at: https://github.com/palindromeRice/ClickBait_Detection_Architecture

[106] A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

Nusrat Sultana,Abdullah Muhammad Moosa,Kazi Afzalur Rahman,Sajal Chandra Banik

Main category: cs.CL

TL;DR: This paper systematically evaluates retrieval-augmented generation (RAG) for medical question answering on MedQA USMLE, identifying dense retrieval with query reformulation and reranking as best (60.49% accuracy), highlighting domain-specialized LMs’ superior evidence utilization and a tradeoff between retrieval effectiveness and computational cost.

Details Motivation: Purely parametric LLMs suffer from knowledge gaps and poor factual grounding in medical QA; RAG helps, but the impact of individual retrieval components remains unclear. Method: Systematic evaluation across 40 configurations on MedQA USMLE using a textbook-based knowledge corpus, varying language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking. Result: Dense retrieval with query reformulation and reranking achieved 60.49% accuracy; domain-specialized LMs better leverage retrieved evidence; simpler dense retrieval offers strong performance–cost tradeoffs; all experiments run on a single consumer GPU. Conclusion: Retrieval augmentation significantly improves zero-shot medical QA; component choices critically affect performance and efficiency; systematic RAG evaluation is feasible with modest resources. Abstract: Large language models (LLMs) have demonstrated strong capabilities in medical question answering; however, purely parametric models often suffer from knowledge gaps and limited factual grounding. Retrieval-augmented generation (RAG) addresses this limitation by integrating external knowledge retrieval into the reasoning process. Despite increasing interest in RAG-based medical systems, the impact of individual retrieval components on performance remains insufficiently understood. This study presents a systematic evaluation of retrieval-augmented medical question answering using the MedQA USMLE benchmark and a structured textbook-based knowledge corpus. We analyze the interaction between language models, embedding models, retrieval strategies, query reformulation, and cross-encoder reranking within a unified experimental framework comprising forty configurations. Results show that retrieval augmentation significantly improves zero-shot medical question answering performance. The best-performing configuration was dense retrieval with query reformulation and reranking achieved 60.49% accuracy. Domain-specialized language models were also found to better utilize retrieved medical evidence than general-purpose models. The analysis further reveals a clear tradeoff between retrieval effectiveness and computational cost, with simpler dense retrieval configurations providing strong performance while maintaining higher throughput. All experiments were conducted on a single consumer-grade GPU, demonstrating that systematic evaluation of retrieval-augmented medical QA systems can be performed under modest computational resources.

[107] Why teaching resists automation in an AI-inundated era: Human judgment, non-modular work, and the limits of delegation

Songhee Han

Main category: cs.CL

TL;DR: 本文质疑了将教学视为可模块化、程序化并因此可被AI自动化的观点,强调教学本质上是解释性、关系性和基于专业判断的,难以被真正自动化。

Details Motivation: 反驳当前教育中关于AI可替代教师的过度乐观主张,指出这些主张忽视了教学实践的不可分割性和人类认知的复杂性。 Method: 基于近期文献与大语言模型、检索增强生成系统实证研究的理论分析。 Result: 论证了AI虽可支持部分有限教学功能,但无法替代教学中关键的解释性、关系性及专业判断成分。 Conclusion: 教学作为一种依赖人类认知、动机与社会互动的专业工作,从根本上抗拒自动化;AI仅能辅助,不能取代教师的人类判断与关系责任。 Abstract: Debates about artificial intelligence (AI) in education often portray teaching as a modular and procedural job that can increasingly be automated or delegated to technology. This brief communication paper argues that such claims depend on treating teaching as more separable than it is in practice. Drawing on recent literature and empirical studies of large language models and retrieval-augmented generation systems, I argue that although AI can support some bounded functions, instructional work remains difficult to automate in meaningful ways because it is inherently interpretive, relational, and grounded in professional judgment. More fundamentally, teaching and learning are shaped by human cognition, behavior, motivation, and social interaction in ways that cannot be fully specified, predicted, or exhaustively modeled. Tasks that may appear separable in principle derive their instructional value in practice from ongoing contextual interpretation across learners, situations, and relationships. As long as educational practice relies on emergent understanding of human cognition and learning, teaching remains a form of professional work that resists automation. AI may improve access to information and support selected instructional activities, but it does not remove the need for human judgment and relational accountability that effective teaching requires.

[108] OpenSpatial: A Principled Data Engine for Empowering Spatial Intelligence

Jianhui Liu,Haoze Sun,Wenbo Li,Yanbing Zhang,Rui Yang,Zhiliang Zhu,Yijun Yang,Shenghe Zheng,Nan Jiang,Jiaxiu Jiang,Haoyang Huang,Tien-Tsin Wong,Nan Duan,Xiaojuan Qi

Main category: cs.CL

TL;DR: 本文提出OpenSpatial——一个开源的空间数据生成引擎,旨在解决高质量空间数据缺乏系统性生成工具的问题。它以3D边界框为基本单元,覆盖五大空间任务,构建了包含300万样本的OpenSpatial-3M数据集,并验证其显著提升空间推理模型性能。

Details Motivation: 当前空间智能研究受限于缺乏原则性强、开源、可扩展的高质量空间数据生成系统,导致空间数据潜力未被充分释放。 Method: 提出OpenSpatial开源数据引擎,以3D bounding boxes为基本单元,构建涵盖空间测量、空间关系、相机感知、多视角一致性与场景感知推理五大任务的数据层级;基于该引擎生成大规模高保真数据集OpenSpatial-3M(300万样本)。 Result: 在多个空间推理基准上,使用OpenSpatial-3M训练的模型达到SOTA性能,最优模型平均相对提升达19%;并系统分析了数据属性对空间感知的影响。 Conclusion: OpenSpatial及其配套数据集为推动空间智能研究提供了坚实、开放、可复现的基础,有望加速该领域发展。 Abstract: Spatial understanding is a fundamental cornerstone of human-level intelligence. Nonetheless, current research predominantly focuses on domain-specific data production, leaving a critical void: the absence of a principled, open-source engine capable of fully unleashing the potential of high-quality spatial data. To bridge this gap, we elucidate the design principles of a robust data generation system and introduce OpenSpatial -- an open-source data engine engineered for high quality, extensive scalability, broad task diversity, and optimized efficiency. OpenSpatial adopts 3D bounding boxes as the fundamental primitive to construct a comprehensive data hierarchy across five foundational tasks: Spatial Measurement (SM), Spatial Relationship (SR), Camera Perception (CP), Multi-view Consistency (MC), and Scene-Aware Reasoning (SAR). Leveraging this scalable infrastructure, we curate OpenSpatial-3M, a large-scale dataset comprising 3 million high-fidelity samples. Extensive evaluations demonstrate that versatile models trained on our dataset achieve state-of-the-art performance across a wide spectrum of spatial reasoning benchmarks. Notably, the best-performing model exhibits a substantial average improvement of 19 percent, relatively. Furthermore, we provide a systematic analysis of how data attributes influence spatial perception. By open-sourcing both the engine and the 3M-scale dataset, we provide a robust foundation to accelerate future research in spatial intelligence.

[109] Evaluating In-Context Translation with Synchronous Context-Free Grammar Transduction

Jackson Petty,Jaulie Goe,Tal Linzen

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLM)在低资源语言机器翻译中利用上下文内形式语法进行字符串转换的能力,发现其性能随语法规则数量和句子长度增加而显著下降,并易出现词汇误回忆、幻觉生成或未翻译等问题。

Details Motivation: 低资源语言缺乏足够训练数据,难以用传统方法训练大语言模型进行机器翻译;本文探索利用LLM对上下文内语言描述(如语法教材、词典)的理解能力来缓解数据依赖问题。 Method: 构建同步上下文无关文法(SCFG)来建模自然语言的语法、形态及书写系统特征,并在给定文法和源语言句子条件下,评测LLM将形式语言字符串从源语言转换为目标语言的能力,同时系统性地改变文法规模、句子长度、形态与书写差异等变量。 Result: 1)LLM翻译准确率随文法规模和句子长度增大而显著下降;2)源-目标语言间形态与书写差异会严重削弱性能;3)主要错误类型为错误召回目标词、幻觉生成新词、或遗漏翻译源词。 Conclusion: 当前LLM虽具备一定基于形式语法的跨语言转换能力,但受限于上下文处理容量与语言结构建模精度,尚难可靠支持低资源语言的语法驱动翻译任务。 Abstract: Low-resource languages pose a challenge for machine translation with large language models (LLMs), which require large amounts of training data. One potential way to circumvent this data dependence is to rely on LLMs' ability to use in-context descriptions of languages, like textbooks and dictionaries. To do so, LLMs must be able to infer the link between the languages' grammatical descriptions and the sentences in question. Here we isolate this skill using a formal analogue of the task: string transduction based on a formal grammar provided in-context. We construct synchronous context-free grammars which define pairs of formal languages designed to model particular aspects of natural language grammar, morphology, and written representation. Using these grammars, we measure how well LLMs can translate sentences from one formal language into another when given both the grammar and the source-language sentence. We vary the size of the grammar, the lengths of the sentences, the syntactic and morphological properties of the languages, and their written script. We note three key findings. First, LLMs' translation accuracy decreases markedly as a function of grammar size and sentence length. Second, differences in morphology and written representation between the source and target languages can strongly diminish model performance. Third, we examine the types of errors committed by models and find they are most prone to recall the wrong words from the target language vocabulary, hallucinate new words, or leave source-language words untranslated.

[110] Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

Qiyao Ma,Dechen Gao,Rui Cai,Boqi Zhao,Hanchu Zhou,Junshan Zhang,Zhe Zhao

Main category: cs.CL

TL;DR: 本文提出了Personalized RewardBench,一个用于评估奖励模型个性化能力的新基准,发现现有SOTA奖励模型在此任务上表现不佳(最高准确率仅75.94%),且该基准与下游任务(如BoN和PPO)性能具有更高相关性。

Details Motivation: 现有奖励模型基准主要关注通用响应质量,缺乏对个体用户偏好的有效评估,亟需构建能衡量个性化建模能力的专用基准。 Method: 构建Personalized RewardBench:基于用户定制化评分标准生成‘优选-拒选’响应对,并通过人工验证确保区分依据仅为个人偏好而非通用质量;同时开展下游任务(BoN、PPO)相关性实验验证基准有效性。 Result: 现有SOTA奖励模型在Personalized RewardBench上最高准确率仅75.94%;该基准与BoN和PPO等下游任务性能的相关性显著高于现有基准。 Conclusion: Personalized RewardBench是首个面向个性化偏好的奖励模型评估基准,具备高判别力与强下游预测能力,为推动个性化对齐研究提供了可靠评测工具。 Abstract: Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values. While benchmarks for general response quality are prevalent, evaluating how well reward models account for individual user preferences remains an open challenge. To bridge this gap, we introduce Personalized RewardBench, a novel benchmark designed to rigorously assess reward models' capacity to model personalized preferences. We construct chosen and rejected response pairs based on strict adherence to (or violation of) user-specific rubrics, ensuring that preference distinctions are uniquely tailored to the individual. In particular, human evaluations confirm that the primary discriminative factor between pairs is strictly personal preference, with both responses maintaining high general quality (e.g., correctness, relevance and helpfulness). Extensive testing reveals that existing state-of-the-art reward models struggle significantly with personalization, peaking at an accuracy of just 75.94%. Crucially, because an effective reward model benchmark should predict a reward model's performance on downstream tasks, we conduct experiments demonstrating that our benchmark exhibits a significantly higher correlation with downstream performance in both Best-of-N (BoN) sampling and Proximal Policy Optimization (PPO) compared to existing baselines. These findings establish Personalized RewardBench as a robust and accurate proxy for evaluating reward models' performance in downstream applications.

cs.CV [Back]

[111] CraterBench-R: Instance-Level Crater Retrieval for Planetary Scale

Jichao Fang,Lei Zhang,Michael Phillips,Wei Luo

Main category: cs.CV

TL;DR: 本文将撞击坑分析建模为实例级图像检索任务,提出新基准CraterBench-R,并设计高效实例令牌聚合方法,在精度与存储效率间取得良好平衡。

Details Motivation: 现有深度学习方法多将撞击坑视为检测问题,但实际科学任务(如目录去重、跨观测匹配、形态类比发现)本质是检索任务,缺乏适配的基准与方法。 Method: 构建CraterBench-R基准;评估多种ViT架构,强调域内自监督预训练与多patch token late-interaction匹配的优势;提出无需训练的实例令牌聚合方法(K个种子token聚类+聚合),并设计两阶段检索流水线。 Result: 实例令牌聚合在K=16时mAP提升17.9点;K=64时精度媲美全部196 tokens且大幅节省存储;两阶段 pipeline 在仅搜索小候选集时恢复89–94%的全late-interaction精度。 Conclusion: 实例级检索更契合撞击坑科学分析需求;轻量、可扩展、免训练的令牌聚合策略显著提升行星尺度检索的实用性与效率。 Abstract: Impact craters are a cornerstone of planetary surface analysis. However, while most deep learning pipelines treat craters solely as a detection problem, critical scientific workflows such as catalog deduplication, cross-observation matching, and morphological analog discovery are inherently retrieval tasks. To address this, we formulate crater analysis as an instance-level image retrieval problem and introduce CraterBench-R, a curated benchmark featuring about 25,000 crater identities with multi-scale gallery views and manually verified queries spanning diverse scales and contexts. Our baseline evaluations across various architectures reveal that self-supervised Vision Transformers (ViTs), particularly those with in-domain pretraining, dominate the task, outperforming generic models with significantly more parameters. Furthermore, we demonstrate that retaining multiple ViT patch tokens for late-interaction matching dramatically improves accuracy over standard single-vector pooling. However, storing all tokens per image is operationally inefficient at a planetary scale. To close this efficiency gap, we propose instance-token aggregation, a scalable, training-free method that selects K seed tokens, assigns the remaining tokens to these seeds via cosine similarity, and aggregates each cluster into a single representative token. This approach yields substantial gains: at K=16, aggregation improves mAP by 17.9 points over raw token selection, and at K=64, it matches the accuracy of using all 196 tokens with significantly less storage. Finally, we demonstrate that a practical two-stage pipeline, with single-vector shortlisting followed by instance-token reranking, recovers 89-94% of the full late-interaction accuracy while searching only a small candidate set. The benchmark is publicly available at hf.co/datasets/jfang/CraterBench-R.

[112] No-reference based automatic parameter optimization for iterative reconstruction using a novel search space aware crow search algorithm

Poorya MohammadiNasab,Ander Biguri,Philipp Steininger,Peter Keuschnigg,Lukas Lamminger,Agnieszka Lach,S M Ragib Shahriar Islam,Anna Breger,Clemens Karner,Carola-Bibiane Schönlieb,Wolfgang Birkfellner,Sepideh Hatamikia

Main category: cs.CV

TL;DR: 本文提出了一种全自动的参数优化框架,用于CBCT迭代重建算法,无需参考图像即可自动确定最优超参数,显著提升重建质量并减少人工调参负担。

Details Motivation: 迭代重建方法虽能降低辐射剂量,但其性能高度依赖人工调参,耗时且易出错,亟需自动、鲁棒、无需参考的优化方案。 Method: 提出改进的乌鸦搜索算法(CSA),包含基于优势集的局部搜索、搜索空间感知的全局策略及目标驱动的局部-全局平衡;并设计混沌对角线线性均匀初始化方案以加速收敛。 Result: 在三台设备、四个真实数据集及三种高维参数重建算法上验证,平均适应度提升4.19%,CHILL@UK和RPI_AXIS无参考质量指标分别提升4.89%和3.82%,细节保持更优。 Conclusion: 所提全自动优化框架具有强有效性与跨设备/算法鲁棒性,为临床CBCT迭代重建提供了实用、可靠的参数优化新范式。 Abstract: Iterative reconstruction technique's ability to reduce radiation exposure by using fewer projections has attracted significant attention. However, these methods typically require a precise tuning of several hyperparameters, which can have a major impact on reconstruction quality. Manually setting these parameters is time-consuming and increases the workload for human operators. In this paper, we introduce a novel fully automatic parameter optimization framework that can be applied to a wide range of Cone-beam computed tomography (CBCT) iterative reconstruction algorithms to determine optimal parameters without requiring a reference reconstruction. The proposed method incorporates a modified crow search algorithm (CSA) featuring a superior set-dependent local search mechanism, a search-space-aware global search strategy, and an objective-driven balance between local and global search. Additionally, to ensure an effective initial population, we propose a chaotic diagonal linear uniform initialization scheme that accelerates algorithm convergence. The performance of the proposed framework was evaluated on three imaging machines and four real datasets, as well as three different iterative reconstruction methods with the highest number of tunable parameters, representing the most challenging senario. The results indicate that the proposed method could outperform manual settings and CSA, with an 4.19% improvement in average fitness and 4.89% and 3.82% improvements on CHILL@UK and RPI_AXIS, respectively, which are two benchmark no-reference learning-based quality metrics. In addition, the qualitative results clearly show the superiority of the proposed method by maintaining fine details sharply. The overall performance of the proposed framework across different comparison scenarios demonstrates its effectiveness and robustness across all cases.

[113] DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs

Dikshant Kukreja,Kshitij Sah,Karan Goyal,Mukesh Mohania,Vikram Goyal

Main category: cs.CV

TL;DR: 本文提出DISSECT基准,用于诊断视觉-语言模型(VLM)在分子图像理解中“感知-整合”能力的脱节问题,发现开源模型在从自身生成的文本描述推理时表现更好,而闭源模型则无此差距,表明整合能力是当前关键瓶颈。

Details Motivation: 现有单配置基准将感知与整合混为一谈,无法揭示VLM虽能准确识别图像内容(如‘带-OH的苯环’),却在后续推理中失败的问题,即‘感知-整合缺口’。 Method: 构建包含12,000题的DISSECT诊断基准(化学7,000题、生物5,000题),设计五种输入模式(Vision+Text、Text-Only、Vision-Only、Human Oracle、Model Oracle),通过对比性能差距分解评估语言先验利用、视觉提取、感知保真度和整合有效性。 Result: 在18个VLM上评测发现:(1) 化学任务更难依赖语言先验,更能检验真实视觉推理;(2) 开源模型在Model Oracle模式下显著优于Vision+Text模式,暴露系统性整合瓶颈;(3) 闭源模型无此差距,显示其已部分解决感知-整合衔接问题。 Conclusion: 感知与整合的脱节是当前VLM的核心缺陷;Model Oracle协议是一种通用、后置可插拔的诊断工具,应成为多模态模型评估新标准。 Abstract: When asked to describe a molecular diagram, a Vision-Language Model correctly identifies ``a benzene ring with an -OH group.'' When asked to reason about the same image, it answers incorrectly. The model can see but it cannot think about what it sees. We term this the perception-integration gap: a failure where visual information is successfully extracted but lost during downstream reasoning, invisible to single-configuration benchmarks that conflate perception with integration under one accuracy number. To systematically expose such failures, we introduce DISSECT, a 12,000-question diagnostic benchmark spanning Chemistry (7,000) and Biology (5,000). Every question is evaluated under five input modes -- Vision+Text, Text-Only, Vision-Only, Human Oracle, and a novel Model Oracle in which the VLM first verbalizes the image and then reasons from its own description -- yielding diagnostic gaps that decompose performance into language-prior exploitation, visual extraction, perception fidelity, and integration effectiveness. Evaluating 18~VLMs, we find that: (1) Chemistry exhibits substantially lower language-prior exploitability than Biology, confirming molecular visual content as a harder test of genuine visual reasoning; (2) Open-source models consistently score higher when reasoning from their own verbalized descriptions than from raw images, exposing a systematic integration bottleneck; and (3) Closed-source models show no such gap, indicating that bridging perception and integration is the frontier separating open-source from closed-source multimodal capability. The Model Oracle protocol is both model and benchmark agnostic, applicable post-hoc to any VLM evaluation to diagnose integration failures.

[114] Telescope: Learnable Hyperbolic Foveation for Ultra-Long-Range Object Detection

Parker Ewen,Dmitriy Rivkin,Mario Bijelic,Felix Heide

Main category: cs.CV

TL;DR: 本文提出Telescope模型,通过新型重采样层和图像变换解决超远距离(>250米)小目标检测难题,在长距mAP上相对提升76%,适用于高速自动驾驶重卡。

Details Motivation: 现有目标检测器在500米以上超远距离因目标仅占少数像素而失效;商用LiDAR受分辨率随距离平方衰减限制,难以满足超远距需求,故需更实用的图像检测方案。 Method: 提出两阶段检测模型Telescope,包含强检测骨干网络、新型重采样层及图像变换,专为超远距小目标检测设计。 Result: 在250米以上距离,mAP从0.185提升至0.326(相对提升76%),计算开销极小,且全距离范围性能稳健。 Conclusion: Telescope是面向超远距自动驾驶的有效图像检测方案,在精度、效率与泛化性上均显著优于当前SOTA方法。 Abstract: Autonomous highway driving, especially for long-haul heavy trucks, requires detecting objects at long ranges beyond 500 meters to satisfy braking distance requirements at high speeds. At long distances, vehicles and other critical objects occupy only a few pixels in high-resolution images, causing state-of-the-art object detectors to fail. This challenge is compounded by the limited effective range of commercially available LiDAR sensors, which fall short of ultra-long range thresholds because of quadratic loss of resolution with distance, making image-based detection the most practically scalable solution given commercially available sensor constraints. We introduce Telescope, a two-stage detection model designed for ultra-long range autonomous driving. Alongside a powerful detection backbone, this model contains a novel re-sampling layer and image transformation to address the fundamental challenges of detecting small, distant objects. Telescope achieves $76\%$ relative improvement in mAP in ultra-long range detection compared to state-of-the-art methods (improving from an absolute mAP of 0.185 to 0.326 at distances beyond 250 meters), requires minimal computational overhead, and maintains strong performance across all detection ranges.

[115] Evolution of Video Generative Foundations

Teng Hu,Jiangning Zhang,Hongrui Huang,Ran Yi,Zihan Su,Jieyu Weng,Zhucun Xue,Lizhuang Ma,Ming-Hsuan Yang,Dacheng Tao

Main category: cs.CV

TL;DR: 本文是一篇关于视频生成技术的综述论文,系统梳理了从早期GAN到主流扩散模型,再到新兴自回归(AR)与多模态方法的发展脉络,并分析其原理、进展与优劣;同时探讨多模态融合趋势,展望其在VR/AR、教育、自动驾驶仿真等领域的应用前景。

Details Motivation: 现有视频生成综述多聚焦于单一技术(如GAN或扩散模型)或特定任务(如视频编辑),缺乏对技术演进全貌、尤其是自回归模型和多模态融合的系统性梳理。 Method: 采用系统性文献综述方法,按时间与技术路线划分发展阶段,对比分析各范式(GAN、扩散、AR、多模态)的基础原理、关键进展与优缺点,并归纳新兴趋势与应用场景。 Result: 构建了覆盖视频生成技术发展主线的结构化知识图谱,明确了不同模型范式的适用边界与局限,提炼出多模态协同建模为关键演进方向,并指明其在世界模型、虚拟现实、智能教育等领域的落地潜力。 Conclusion: 视频生成正迈向以多模态感知与自回归时序建模为特征的新阶段;未来研究需加强跨模态对齐、长时序一致性建模与物理合理性约束,以支撑更可靠、可扩展的世界模拟能力。 Abstract: The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.

[116] Evidence-Based Actor-Verifier Reasoning for Echocardiographic Agents

Peng Huang,Yiming Wang,Yineng Chen,Liangqiao Gui,Hui Guo,Bo Peng,Shu Hu,Xi Wu,Tsao Connie,Hongtu Zhu,Balakrishnan Prabhakaran,Xin Wang

Main category: cs.CV

TL;DR: 本文提出EchoTrust框架,通过证据驱动的Actor-Verifier架构提升超声心动图视觉语言模型在临床决策中的可信推理能力。

Details Motivation: 现有方法将视频问答直接映射为答案,易受模板捷径和虚假解释影响,难以满足高风险临床应用对可靠性与可解释性的要求。 Method: 提出EchoTrust框架,采用证据驱动的Actor-Verifier架构,生成结构化中间表示,并由不同角色分别处理,实现可解释、可验证的推理过程。 Result: 提升了超声心动图VLM在复杂心脏动态和视角异质性下的分析鲁棒性与可信度,增强了临床决策支持系统的可靠性与可解释性。 Conclusion: EchoTrust为构建可信、可解释的医学影像VLM系统提供了新范式,尤其适用于高风险临床场景。 Abstract: Echocardiography plays an important role in the screening and diagnosis of cardiovascular diseases. However, automated intelligent analysis of echocardiographic data remains challenging due to complex cardiac dynamics and strong view heterogeneity. In recent years, visual language models (VLM) have opened a new avenue for building ultrasound understanding systems for clinical decision support. Nevertheless, most existing methods formulate this task as a direct mapping from video and question to answer, making them vulnerable to template shortcuts and spurious explanations. To address these issues, we propose EchoTrust, an evidence-driven Actor-Verifier framework for trustworthy reasoning in echocardiography VLM-based agents. EchoTrust produces a structured intermediate representation that is subsequently analyzed by distinct roles, enabling more reliable and interpretable decision-making for high-stakes clinical applications.

[117] DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images

Gautham Vinod,Siddeshwar Raghavan,Bruce Coburn,Fengqing Zhu

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言的简单框架,利用用餐前后的成对RGB图像实现食物项级别的营养分析,无需深度传感或多视角图像,通过自然语言提示定位食物并估计重量,再通过两阶段训练预测重量差以估算实际摄入量。

Details Motivation: 现有基于图像的膳食评估方法大多依赖单张餐前图像,仅提供粗略的餐级估计,无法准确判断实际摄入量,且常需深度传感、多视角图像或显式分割等限制性输入。 Method: 提出一种视觉-语言框架,利用用餐前后成对RGB图像;通过自然语言提示而非刚性分割掩码来定位特定食物项并直接从单张RGB图像估计其重量;采用两阶段训练策略预测配对图像间的重量差异以估计食物摄入量。 Result: 在三个公开数据集上评估,结果表明该方法持续优于现有方法,为前后图像膳食分析建立了强基线。 Conclusion: 该框架克服了传统方法对复杂硬件或人工标注的依赖,实现了更灵活、精准的食物项级营养分析,推动了精准营养中膳食评估的实际应用。 Abstract: Accurate dietary assessment is critical for precision nutrition, yet most image-based methods rely on a single pre-consumption image and provide only coarse, meal-level estimates. These approaches cannot determine what was actually consumed and often require restrictive inputs such as depth sensing, multi-view imagery, or explicit segmentation. In this paper, we propose a simple vision-language framework for food-item-level nutritional analysis using paired before-and-after eating images. Instead of relying on rigid segmentation masks, our method leverages natural language prompts to localize specific food items and estimate their weight directly from a single RGB image. We further estimate food consumption by predicting weight differences between paired images using a two-stage training strategy. We evaluate our method on three publicly available datasets and demonstrate consistent improvements over existing approaches, establishing a strong baseline for before-and-after dietary image analysis.

[118] MTA-Agent: An Open Recipe for Multimodal Deep Search Agents

Xiangyu Peng,Can Qin,An Yan,Xinyi Yang,Zeyuan Chen,Ran Xu,Chien-Sheng Wu

Main category: cs.CV

TL;DR: 本文提出了一种多跳工具增强型视觉-语言智能体(MTA-Agent),通过构建高质量、经验证的多跳视觉-语言训练数据集MTA-Vision-DeepSearch(21K样本),显著提升多模态大模型在复杂多步推理与跨模态证据整合方面的能力;所训练的32B开源智能体在六项基准上达到54.63%平均准确率,超越GPT-5、Gemini等闭源模型,并开源全部数据、轨迹与实现。

Details Motivation: 现有多模态大语言模型(MLLMs)在需深度搜索与融合视觉证据与外部知识的复杂多步推理任务中仍存在明显局限。 Method: 提出MTA-Agent框架,自动选择并调用工具从视觉和文本源中检索与验证证据,生成结构化多跳问答轨迹;基于多样化VQA种子数据,构建大规模、多阶段验证的训练数据集MTA-Vision-DeepSearch;支持离线缓存回放训练以降低成本。 Result: 基于该数据训练的32B开源多模态搜索智能体在六项挑战性基准上平均达54.63%,超越GPT-5(51.86%)、Gemini-2.5-Pro(50.98%)及Gemini-3-Pro(54.46%);推理步数从2.27提升至4.28,搜索策略更系统、持久;训练成本大幅降低。 Conclusion: MTA-Agent提供了一套完全开源的多模态深度搜索范式,包含数据集、训练轨迹与完整实现,推动开放、可复现的多模态搜索智能体研究。 Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities in visual understanding, yet they remain limited in complex, multi-step reasoning that requires deep searching and integrating visual evidence with external knowledge. In this work, we address this challenge by constructing high-quality, verified multi-hop vision-language training data for multimodal deep-search agents. We propose a Multi-hop Tool-Augmented Agent for Evidence-based QA Synthesis (MTA-Agent), which automatically selects tools and their parameters to retrieve and validate evidence from both visual and textual sources and generates structured multi-hop question-answer trajectories. Starting from diverse VQA seed datasets, our pipeline produces a large-scale training dataset, MTA-Vision-DeepSearch, containing 21K high-quality multi-hop examples. The data is filtered through a multi-stage verification process to ensure factual consistency and answer uniqueness. Using MTA-Vision-DeepSearch, a 32B open-source multimodal search agent achieves state-of-the-art performance, reaching an average of 54.63\% across six challenging benchmarks, outperforming GPT-5 (51.86\%), Gemini-2.5-Pro (50.98\%), and Gemini-3-Pro (54.46\%) under the same tool settings. We further show that training on our data improves both reasoning depth and tool-use behavior, increasing the average number of steps from 2.27 to 4.28, and leading to more systematic and persistent search strategies. Additionally, we demonstrate that training can be performed without real-time tool calls by replaying cached interactions, significantly reducing training cost. Importantly, we present MTA-Agent as a fully open recipe for multimodal deep search: we release the entire dataset, training trajectories, and implementation details to enable reproducibility and future research on open multimodal search agents.

[119] MorphDistill: Distilling Unified Morphological Knowledge from Pathology Foundation Models for Colorectal Cancer Survival Prediction

Hikmat Khan,Usama Sajjad,Metin N. Gurcan,Anil Parwani,Wendy L. Frankel,Wei Chen,Muhammad Khalid Khan Niazi

Main category: cs.CV

TL;DR: 本文提出MorphDistill,一种两阶段知识蒸馏框架,将多个病理基础模型的知识整合为紧凑的结直肠癌(CRC)特异性编码器,显著提升5年生存预测性能。

Details Motivation: 现有病理基础模型常忽略结直肠癌器官特异性特征,导致生存预测精度不足,亟需针对CRC优化的表示学习方法。 Method: MorphDistill采用两阶段策略:第一阶段通过无维度约束的多教师关系蒸馏与监督对比正则化,在大规模CRC数据上训练学生编码器;第二阶段结合注意力机制的多实例学习,对全切片图像进行生存预测。 Result: 在Alliance/CALGB 89803队列中AUC达0.68(较最佳基线提升约8%),C-index为0.661,HR=2.52;在外部队列TCGA中C-index达0.628,泛化性强。 Conclusion: MorphDistill通过融合多模型知识实现任务特异性表征学习,为计算病理学预后建模提供了高效新范式,具备向其他肿瘤拓展的潜力。 Abstract: Background: Colorectal cancer (CRC) remains a leading cause of cancer-related mortality worldwide. Accurate survival prediction is essential for treatment stratification, yet existing pathology foundation models often overlook organ-specific features critical for CRC prognostication. Methods: We propose MorphDistill, a two-stage framework that distills complementary knowledge from multiple pathology foundation models into a compact CRC-specific encoder. In Stage I, a student encoder is trained using dimension-agnostic multi-teacher relational distillation with supervised contrastive regularization on large-scale colorectal datasets. This preserves inter-sample relationships from ten foundation models without explicit feature alignment. In Stage II, the encoder extracts patch-level features from whole-slide images, which are aggregated via attention-based multiple instance learning to predict five-year survival. Results: On the Alliance/CALGB 89803 cohort (n=424, stage III CRC), MorphDistill achieves an AUC of 0.68 (SD 0.08), an approximately 8% relative improvement over the strongest baseline (AUC 0.63). It also attains a C-index of 0.661 and a hazard ratio of 2.52 (95% CI: 1.73-3.65), outperforming all baselines. On an external TCGA cohort (n=562), it achieves a C-index of 0.628, demonstrating strong generalization across datasets and robustness across clinical subgroups. Conclusion: MorphDistill enables task-specific representation learning by integrating knowledge from multiple foundation models into a unified encoder. This approach provides an efficient strategy for prognostic modeling in computational pathology, with potential for broader oncology applications. Further validation across additional cohorts and disease stages is warranted.

[120] Continual Visual Anomaly Detection on the Edge: Benchmark and Efficient Solutions

Manuel Barusco,Francesco Borsatti,David Petrovic,Davide Dalle Pezze,Gian Antonio Susto

Main category: cs.CV

TL;DR: 本文提出了首个面向边缘设备的视觉异常检测(VAD)持续学习基准,并提出轻量级模型Tiny-Dinomaly及对PatchCore和PaDiM的效率优化,兼顾低资源消耗与持续适应能力。

Details Motivation: 现有VAD研究未同时解决边缘部署(资源受限)与持续学习(避免灾难性遗忘)两大挑战,孤立研究无法反映二者联合约束下的真实性能权衡。 Method: 构建涵盖7种VAD方法、3种轻量骨干网络的综合基准;提出基于DINO的轻量模型Tiny-Dinomaly;针对性改进PatchCore和PaDiM以适配持续学习场景。 Result: Tiny-Dinomaly实现13倍内存缩减、20倍计算成本下降,且Pixel F1提升5个百分点;改进后的PatchCore和PaDiM在持续学习下效率显著提升。 Conclusion: 联合考虑边缘效率与持续适应性的基准和方法设计至关重要,Tiny-Dinomaly等方案为实际部署提供了可行路径。 Abstract: Visual Anomaly Detection (VAD) is a critical task for many applications including industrial inspection and healthcare. While VAD has been extensively studied, two key challenges remain largely unaddressed in conjunction: edge deployment, where computational resources are severely constrained, and continual learning, where models must adapt to evolving data distributions without forgetting previously acquired knowledge. Our benchmark provides guidance for the selection of the optimal backbone and VAD method under joint efficiency and adaptability constraints, characterizing the trade-offs between memory footprint, inference cost, and detection performance. Studying these challenges in isolation is insufficient, as methods designed for one setting make assumptions that break down when the other constraint is simultaneously imposed. In this work, we propose the first comprehensive benchmark for VAD on the edge in the continual learning scenario, evaluating seven VAD models across three lightweight backbone architectures. Furthermore, we propose Tiny-Dinomaly, a lightweight adaptation of the Dinomaly model built on the DINO foundation model that achieves 13x smaller memory footprint and 20x lower computational cost while improving Pixel F1 by 5 percentage points. Finally, we introduce targeted modifications to PatchCore and PaDiM to improve their efficiency in the continual learning setting.

[121] Visual prompting reimagined: The power of the Activation Prompts

Yihua Zhang,Hongkang Li,Yuguang Yao,Aochuan Chen,Shuai Zhang,Pin-Yu Chen,Meng Wang,Sijia Liu

Main category: cs.CV

TL;DR: 本文提出了一种新的视觉提示方法——激活提示(AP),将通用扰动从输入层扩展到中间层的激活图,从而克服传统视觉提示(VP)的性能瓶颈;理论与实验表明AP在准确率、效率(时间/参数/内存/吞吐量)上均优于VP及多种参数高效微调基线。

Details Motivation: 现有视觉提示(VP)方法虽无需修改模型参数,但相比传统微调存在明显性能差距,其理论机制与优化空间尚不清晰,亟需深入理解并提升输入级提示的有效性。 Method: 提出激活提示(AP)概念,将通用扰动施加于模型中间层的激活图而非仅输入图像;结合归一化调优分析,探究不同模型(CNN/ViT)对提示所在层的偏好,并通过跨层全局特征理论分析解释该偏好;在29个数据集和多种架构上系统评估AP性能。 Result: AP在精度、推理时间、可训练参数量、内存占用和吞吐量等方面全面超越VP及主流参数高效微调方法;CNN与ViT表现出不同的最优提示层位置;AP与归一化调优存在紧密关联。 Conclusion: 激活提示(AP)是比输入级视觉提示更通用、更高效的提示范式;其性能优势源于对中间层语义表征的直接调控,且具有模型结构依赖的层偏好规律;该工作为理解与提升提示学习提供了新视角与实用工具。 Abstract: Visual prompting (VP) has emerged as a popular method to repurpose pretrained vision models for adaptation to downstream tasks. Unlike conventional model fine-tuning techniques, VP introduces a universal perturbation directly into the input data to facilitate task-specific fine-tuning rather than modifying model parameters. However, there exists a noticeable performance gap between VP and conventional fine-tuning methods, highlighting an unexplored realm in theory and practice to understand and advance the input-level VP to reduce its current performance gap. Towards this end, we introduce a generalized concept, termed activation prompt (AP), which extends the scope of the input-level VP by enabling universal perturbations to be applied to activation maps within the intermediate layers of the model. By using AP to revisit the problem of VP and employing it as an analytical tool, we demonstrate the intrinsic limitations of VP in both performance and efficiency, revealing why input-level prompting may lack effectiveness compared to AP, which exhibits a model-dependent layer preference. We show that AP is closely related to normalization tuning in convolutional neural networks and vision transformers, although each model type has distinct layer preferences for prompting. We also theoretically elucidate the rationale behind such a preference by analyzing global features across layers. Through extensive experiments across 29 datasets and various model architectures, we provide a comprehensive performance analysis of AP, comparing it with VP and parameter-efficient fine-tuning baselines. Our results demonstrate AP's superiority in both accuracy and efficiency, considering factors such as time, parameters, memory usage, and throughput.

[122] PhysHead: Simulation-Ready Gaussian Head Avatars

Berna Kabadayi,Vanessa Sklyarova,Wojciech Zielonka,Justus Thies,Gerard Pons-Moll

Main category: cs.CV

TL;DR: 本文提出PhysHead,一种结合3D高斯分层表示与物理驱动发丝建模的混合方法,实现具有真实动态发丝行为的可动画头部数字人。

Details Motivation: 现有头部数字人方法将头发视为刚性外壳,无法解耦头发与头部、难以建模其自然体积感和动态行为。 Method: 提出PhysHead:基于3D高斯的分层表示(参数化头部网格+基于发丝的物理可模拟结构),外观用绑定在头网与发丝上的高斯原语建模;引入VLM生成动态序列中遮挡区域的外观。 Result: 在定量与定性实验中,PhysHead能合成符合物理规律的头发运动(如风吹效果),同时支持表情与相机控制,优于现有基线方法。 Conclusion: PhysHead成功实现了头部与动态发丝的解耦建模,突破了刚性发丝限制,为真实感可驱动头部数字人提供了新范式。 Abstract: Realistic digital avatars require expressive and dynamic hair motion; however, most existing head avatar methods assume rigid hair movement. These methods often fail to disentangle hair from the head, representing it as a simple outer shell and failing to capture its natural volumetric behavior. In this paper, we address these limitations by introducing PhysHead, a hybrid representation for animatable head avatars with realistic hair dynamics learned from multi-view video. At the core is a 3D Gaussian-based layered representation of the head. Our approach combines a 3D parametric mesh for the head with strand-based hair, which can be directly simulated using physics engines. For the appearance model, we employ Gaussian primitives attached to both the head mesh and hair segments. This representation enables the creation of photorealistic head avatars with dynamic hair behavior, such as wind-blown motion, overcoming the constraints of rigid hair in existing methods. However, these animation capabilities also require new training schemes. In particular, we propose the use of VLM-based models to generate appearance of regions that are occluded in the dynamic training sequences. In quantitative and qualitative studies, we demonstrate the capabilities of the proposed model and compare it with existing baselines. We show that our method can synthesize physically plausible hair motion besides expression and camera control.

[123] Predicting Alzheimer's disease progression using rs-fMRI and a history-aware graph neural network

Mahdi Moghaddami,Mohammad-Reza Siadat,Austin Toma,Connor Laming,Huirong Fu

Main category: cs.CV

TL;DR: 本文提出了一种结合图神经网络(GNN)与循环神经网络(RNN)的模型,利用rs-fMRI构建的功能连接图预测阿尔茨海默病(AD)患者认知障碍阶段的进展,尤其在CN→MCI转化预测上达到68.8%准确率。

Details Motivation: 阿尔茨海默病尚无治愈手段,但早期干预可延缓进展;当前缺乏对个体认知状态动态演化的有效预测方法,尤其针对不规则随访时间及缺失数据场景。 Method: 基于rs-fMRI构建个体功能连接图,设计融合RNN的GNN模型,将就诊时间间隔编码为输入特征,支持处理不规则时序和缺失随访数据。 Result: 模型整体准确率达82.9%,CN→MCI转化预测准确率为68.8%,显著优于现有方法,且对缺失访问具有鲁棒性。 Conclusion: rs-fMRI衍生的功能连接图结合时序GNN建模,可有效预测认知障碍进展;该方法有望成为辅助早期干预的关键工具,并可拓展至多模态融合框架。 Abstract: Alzheimer's disease (AD) is a neurodegenerative disorder that affects more than seven million people in the United States alone. AD currently has no cure, but there are ways to potentially slow its progression if caught early enough. In this study, we propose a graph neural network (GNN)-based model for predicting whether a subject will transition to a more severe stage of cognitive impairment at their next clinical visit. We consider three stages of cognitive impairment in order of severity: cognitively normal (CN), mild cognitive impairment (MCI), and AD. We use functional connectivity graphs derived from resting-state functional magnetic resonance imaging (rs-fMRI) scans of 303 subjects, each with a different number of visits. Our GNN-based model incorporates a recurrent neural network (RNN) block, enabling it to process data from the subject's entire visit history. It can also work with irregular time gaps between visits by incorporating visit distance information into our input features. Our model demonstrates robust predictive performance, even with missing visits in the subjects' visit histories. It achieves an accuracy of 82.9%, with an especially impressive accuracy of 68.8% on CN to MCI conversions - a task that poses a substantial challenge in the field. Our results highlight the effectiveness of rs-fMRI in predicting the onset of MCI or AD and, in conjunction with other modalities, could offer a viable method for enabling timely interventions to slow the progression of cognitive impairment.

[124] Hybrid ResNet-1D-BiGRU with Multi-Head Attention for Cyberattack Detection in Industrial IoT Environments

Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari

Main category: cs.CV

TL;DR: 本文提出了一种结合ResNet-1D、BiGRU和多头注意力机制的混合深度学习模型,用于工业物联网(IIoT)入侵检测,并在EdgeHoTset和CICIoV2024数据集上验证了其高精度、低延迟与强泛化能力。

Details Motivation: 解决工业物联网(IIoT)中实时入侵检测面临的时空特征提取不足、类别不平衡以及模型泛化能力弱等问题。 Method: 构建ResNet-1D-BiGRU-MHA混合模型,融合一维残差卷积、双向门控循环单元与多头注意力机制;采用SMOTE处理训练数据中的类别不平衡问题;在EdgeHoTset和CICIoV2024两个数据集上进行训练与跨数据集验证。 Result: 在EdgeHoTset上达98.71%准确率、0.0417%损失、0.0001秒/样本推理延迟;在CICIoV2024上达99.99%准确率与F1分数、0.0028损失、0%误报率、0.00014秒/样本延迟;全面优于现有方法。 Conclusion: 该混合模型兼具高效时空建模能力、注意力驱动的特征加权机制与优异的实时性和泛化性,适用于实际部署的IIoT安全防护系统。 Abstract: This study introduces a hybrid deep learning model for intrusion detection in Industrial IoT (IIoT) systems, combining ResNet-1D, BiGRU, and Multi-Head Attention (MHA) for effective spatial-temporal feature extraction and attention-based feature weighting. To address class imbalance, SMOTE was applied during training on the EdgeHoTset dataset. The model achieved 98.71% accuracy, a loss of 0.0417%, and low inference latency (0.0001 sec /instance), demonstrating strong real-time capability. To assess generalizability, the model was also tested on the CICIoV2024 dataset, where it reached 99.99% accuracy and F1-score, with a loss of 0.0028, 0 % FPR, and 0.00014 sec/instance inference time. Across all metrics and datasets, the proposed model outperformed existing methods, confirming its robustness and effectiveness for real-time IoT intrusion detection.

[125] DesigNet: Learning to Draw Vector Graphics as Designers Do

Tomas Guija-Valiente,Iago Suárez

Main category: cs.CV

TL;DR: 本文提出DesigNet,一种用于SVG内容生成的层次化Transformer-VAE模型,通过引入可微分的连续性与对齐自优化模块,弥合AI生成与专业设计工具之间的鸿沟,提升生成结果的可编辑性与设计可用性。

Details Motivation: 神经网络与人类设计师在创作方式上存在根本差异,尤其在SVG等矢量图形协作中缺乏对设计常识(如轴对齐、曲线连续性)的支持,导致生成结果难以融入专业工作流。 Method: 提出DesigNet:基于SVG序列的层次化Transformer-VAE;设计两个可微分模块——连续性自优化模块(支持C⁰/G¹/C¹连续性并调整贝塞尔控制点)和对齐自优化模块(支持水平/垂直吸附)。 Result: 在SVG生成任务上达到SOTA水平,尤其在连续性和对齐精度上显著优于现有方法;生成结果为可编辑矢量轮廓,更易被设计师后续修改与集成。 Conclusion: 将设计常识显式建模为可微分几何约束,能有效提升生成模型与人类设计实践的协同能力,为AI辅助设计提供了新范式。 Abstract: AI-driven content generation has made remarkable progress in recent years. However, neural networks and human designers operate in fundamentally different ways, making collaboration between them challenging. We address this gap for Scalable Vector Graphics (SVG) by equipping neural networks with tools commonly used by designers, such as axis alignment and explicit continuity control at command junctions. We introduce DesigNet, a hierarchical Transformer-VAE that operates directly on SVG sequences with a continuous command parameterization. Our main contributions are two differentiable modules: a continuity self-refinement module that predicts $C^0$, $G^1$, and $C^1$ continuity for each curve point and enforces it by modifying Bézier control points, and an alignment self-refinement module with snapping capabilities for horizontal or vertical lines. DesigNet produces editable outlines and achieves competitive results against state-of-the-art methods, with notably higher accuracy in continuity and alignment. These properties ensure the outputs are easier to refine and integrate into professional design workflows. Source Code: https://github.com/TomasGuija/DesigNet.

[126] LiftFormer: Lifting and Frame Theory Based Monocular Depth Estimation Using Depth and Edge Oriented Subspace Representation

Shuai Li,Huibin Bai,Yanbo Gao,Chong Lv,Hui Yuan,Chuankun Li,Wei Hua,Tian Xie

Main category: cs.CV

TL;DR: 本文提出了一种基于提升理论拓扑的LiftFormer模型,通过构建深度导向几何表示(DGR)子空间和边缘感知表示(ER)子空间,将单目图像颜色特征映射到深度值,显著提升了单目深度估计性能。

Details Motivation: 单目深度估计(MDE)是一个高度不适定问题,现有方法难以准确建模图像颜色特征与几何深度值之间的映射关系,尤其在深度图边缘区域易出现预测错误。 Method: 提出LiftFormer框架:1)基于框架理论构建深度导向几何表示(DGR)子空间,利用线性相关向量对深度区间进行冗余鲁棒表征;2)构建边缘感知表示(ER)子空间,增强边缘附近局部特征;3)将图像空间特征映射至DGR子空间,实现从颜色值到深度值的直接对应。 Result: 在多个主流数据集上达到SOTA性能,并通过消融实验验证了DGR和ER两个提升模块的有效性。 Conclusion: LiftFormer通过引入提升理论指导的双子空间建模机制,有效缓解了MDE任务的不适定性,尤其改善了边缘区域的深度预测精度,为单目深度估计提供了新思路。 Abstract: Monocular depth estimation (MDE) has attracted increasing interest in the past few years, owing to its important role in 3D vision. MDE is the estimation of a depth map from a monocular image/video to represent the 3D structure of a scene, which is a highly ill-posed problem. To solve this problem, in this paper, we propose a LiftFormer based on lifting theory topology, for constructing an intermediate subspace that bridges the image color features and depth values, and a subspace that enhances the depth prediction around edges. MDE is formulated by transforming the depth value prediction problem into depth-oriented geometric representation (DGR) subspace feature representation, thus bridging the learning from color values to geometric depth values. A DGR subspace is constructed based on frame theory by using linearly dependent vectors in accordance with depth bins to provide a redundant and robust representation. The image spatial features are transformed into the DGR subspace, where these features correspond directly to the depth values. Moreover, considering that edges usually present sharp changes in a depth map and tend to be erroneously predicted, an edge-aware representation (ER) subspace is constructed, where depth features are transformed and further used to enhance the local features around edges. The experimental results demonstrate that our LiftFormer achieves state-of-the-art performance on widely used datasets, and an ablation study validates the effectiveness of both proposed lifting modules in our LiftFormer.

[127] VAMAE: Vessel-Aware Masked Autoencoders for OCT Angiography

Ilerioluwakiiye Abolade,Prince Mireku,Kelechi Chibundu,Peace Ododo,Emmanuel Idoko,Promise Omoigui,Solomon Odelola

Main category: cs.CV

TL;DR: 本文提出VAMAE,一种针对OCTA图像的血管感知掩码自编码框架,通过解剖学引导的掩码策略和多目标重建,提升血管分割性能,尤其在标注数据有限时效果显著。

Details Motivation: OCTA图像中血管结构稀疏且具有强拓扑约束,传统为自然图像设计的自监督方法(如均匀掩码+像素重建)难以有效建模血管几何特性。 Method: 提出VAMAE:1)基于血管ness和骨架的解剖学引导掩码,聚焦血管丰富区域;2)联合重建多种互补目标(外观、结构、拓扑信息),增强血管连通性与分支模式学习。 Result: 在OCTA-500基准上,VAMAE在不同监督水平的血管分割任务中均优于标准掩码自编码基线,尤其在少标签场景下提升稳定且显著。 Conclusion: 血管几何感知的自监督预训练能有效提升OCTA图像分析性能,为医学影像中稀疏结构建模提供了新思路。 Abstract: Optical coherence tomography angiography (OCTA) provides non-invasive visualization of retinal microvasculature, but learning robust representations remains challenging due to sparse vessel structures and strong topological constraints. Many existing self-supervised learning approaches, including masked autoencoders, are primarily designed for dense natural images and rely on uniform masking and pixel-level reconstruction, which may inadequately capture vascular geometry. We propose VAMAE, a vessel-aware masked autoencoding framework for self-supervised pretraining on OCTA images. The approach incorporates anatomically informed masking that emphasizes vessel-rich regions using vesselness and skeleton-based cues, encouraging the model to focus on vascular connectivity and branching patterns. In addition, the pretraining objective includes reconstructing multiple complementary targets, enabling the model to capture appearance, structural, and topological information. We evaluate the proposed pretraining strategy on the OCTA-500 benchmark for several vessel segmentation tasks under varying levels of supervision. The results indicate that vessel-aware masking and multi-target reconstruction provide consistent improvements over standard masked autoencoding baselines, particularly in limited-label settings, suggesting the potential of geometry-aware self-supervised learning for OCTA analysis.

[128] Holistic Optimal Label Selection for Robust Prompt Learning under Partial Labels

Yaqi Zhao,Haoliang Sun,Yating Wang,Yongshun Gong,Yilong Yin

Main category: cs.CV

TL;DR: 本文提出了一种名为HopS的全视角最优标签选择方法,通过局部密度筛选和基于最优传输的全局选择目标,在部分标签可用的情况下提升视觉-语言模型提示学习的性能。

Details Motivation: 在仅有部分标签可用时,提示学习因标签模糊性和监督信息不足而性能受限。 Method: 提出HopS方法,包含两个互补策略:一是基于局部密度的过滤器,从最近邻候选集中选取高频标签并用softmax分数识别最可能标签;二是基于最优传输的全局选择目标,将均匀采样分布映射到批次内候选标签分布,并最小化期望传输成本以确定最可能的标签分配。 Result: 在八个基准数据集上的大量实验表明,HopS在部分监督下持续提升性能,并优于所有基线方法。 Conclusion: HopS通过结合局部与全局视角实现鲁棒的标签选择,验证了全视角标签选择的有效性,为弱监督下的提示学习提供了实用解决方案。 Abstract: Prompt learning has gained significant attention as a parameter-efficient approach for adapting large pre-trained vision-language models to downstream tasks. However, when only partial labels are available, its performance is often limited by label ambiguity and insufficient supervisory information. To address this issue, we propose Holistic Optimal Label Selection (HopS), leveraging the generalization ability of pre-trained feature encoders through two complementary strategies. First, we design a local density-based filter that selects the top frequent labels from the nearest neighbors' candidate sets and uses the softmax scores to identify the most plausible label, capturing structural regularities in the feature space. Second, we introduce a global selection objective based on optimal transport that maps the uniform sampling distribution to the candidate label distributions across a batch. By minimizing the expected transport cost, it can determine the most likely label assignments. These two strategies work together to provide robust label selection from both local and global perspectives. Extensive experiments on eight benchmark datasets show that HopS consistently improves performance under partial supervision and outperforms all baselines. Those results highlight the merit of holistic label selection and offer a practical solution for prompt learning in weakly supervised settings.

[129] Balancing Efficiency and Restoration: Lightweight Mamba-Based Model for CT Metal Artifact Reduction

Weikai Qu,Sijun Liang,Xianfeng Li,Cheng Pan,An Yan,Ahmed Elazab,Shanzhou Niu,Dong Zeng,Xiang Wan,Changmiao Wang

Main category: cs.CV

TL;DR: 本文提出MARMamba模型,一种基于简化UNet架构和多尺度Mamba(MS-Mamba)模块的CT金属伪影去除方法,仅需含伪影CT图像作为输入,无需sinogram数据,有效保持解剖结构完整性并兼顾计算效率。

Details Motivation: 现有金属伪影校正方法存在器官组织结构退化、依赖sinogram数据、资源消耗与恢复效率失衡三大问题。 Method: 提出MARMamba模型:采用简化UNet架构,核心为多尺度Mamba(MS-Mamba);其中flip mamba块从多方向捕获上下文信息,平均最大前馈网络融合关键与平均特征以抑制伪影。 Result: 实验表明该模型在金属伪影去除效果上优于其他模型,且在计算开销、内存占用和参数量之间取得良好平衡,具备实际应用价值。 Conclusion: MARMamba是一种高效、轻量、仅需CT图像输入的金属伪影校正方法,在保持解剖结构完整性的同时实现了优异的性能与实用性。 Abstract: In computed tomography imaging, metal implants frequently generate severe artifacts that compromise image quality and hinder diagnostic accuracy. There are three main challenges in the existing methods: the deterioration of organ and tissue structures, dependence on sinogram data, and an imbalance between resource use and restoration efficiency. Addressing these issues, we introduce MARMamba, which effectively eliminates artifacts caused by metals of different sizes while maintaining the integrity of the original anatomical structures of the image. Furthermore, this model only focuses on CT images affected by metal artifacts, thus negating the requirement for additional input data. The model is a streamlined UNet architecture, which incorporates multi-scale Mamba (MS-Mamba) as its core module. Within MS-Mamba, a flip mamba block captures comprehensive contextual information by analyzing images from multiple orientations. Subsequently, the average maximum feed-forward network integrates critical features with average features to suppress the artifacts. This combination allows MARMamba to reduce artifacts efficiently. The experimental results demonstrate that our model excels in reducing metal artifacts, offering distinct advantages over other models. It also strikes an optimal balance between computational demands, memory usage, and the number of parameters, highlighting its practical utility in the real world. The code of the presented model is available at: https://github.com/RICKand-MORTY/MARMamba.

[130] WeatherRemover: All-in-one Adverse Weather Removal with Multi-scale Feature Map Compression

Weikai Qu,Sijun Liang,Cheng Pan,Zikuan Yang,Guanchi Zhou,Xianjun Fu,Bo Liu,Changmiao Wang,Ahmed Elazab

Main category: cs.CV

TL;DR: 本文提出WeatherRemover模型,一种轻量级、多天气图像去干扰方法,结合UNet结构、门控机制与多尺度金字塔视觉Transformer,在保证高质量恢复的同时显著降低参数量、计算开销和内存占用。

Details Motivation: 现有去天气方法多针对单一天气、泛化性差,且主流模型参数大、推理慢、内存消耗高,难以满足实际应用需求。 Method: 采用UNet-like结构,引入通道注意力(CNN提取)与线性空间缩减策略降低注意力计算成本;在前馈和下采样阶段嵌入门控机制,自适应抑制冗余信息;融合多尺度金字塔视觉Transformer增强特征表达能力。 Result: 在多个天气退化数据集上实现高质量图像恢复,同时显著降低参数量、FLOPs和内存占用,推理速度更快,优于现有主流多天气模型。 Conclusion: WeatherRemover在多天气图像恢复任务中实现了性能与效率的最优平衡,具备良好的实用性和可部署性。 Abstract: Photographs taken in adverse weather conditions often suffer from blurriness, occlusion, and low brightness due to interference from rain, snow, and fog. These issues can significantly hinder the performance of subsequent computer vision tasks, making the removal of weather effects a crucial step in image enhancement. Existing methods primarily target specific weather conditions, with only a few capable of handling multiple weather scenarios. However, mainstream approaches often overlook performance considerations, resulting in large parameter sizes, long inference times, and high memory costs. In this study, we introduce the WeatherRemover model, designed to enhance the restoration of images affected by various weather conditions while balancing performance. Our model adopts a UNet-like structure with a gating mechanism and a multi-scale pyramid vision Transformer. It employs channel-wise attention derived from convolutional neural networks to optimize feature extraction, while linear spatial reduction helps curtail the computational demands of attention. The gating mechanisms, strategically placed within the feed-forward and downsampling phases, refine the processing of information by selectively addressing redundancy and mitigating its influence on learning. This approach facilitates the adaptive selection of essential data, ensuring superior restoration and maximizing efficiency. Additionally, our lightweight model achieves an optimal balance between restoration quality, parameter efficiency, computational overhead, and memory usage, distinguishing it from other multi-weather models, thereby meeting practical application demands effectively. The source code is available at https://github.com/RICKand-MORTY/WeatherRemover.

[131] Variational Feature Compression for Model-Specific Representations

Zinan Guo,Zihan Wang,Chuan Yan,Liuhuo Wan,Ethan Ma,Guangdong Bai

Main category: cs.CV

TL;DR: 本文提出一种特征提取框架,通过变分潜在瓶颈和动态二值掩码,在保持目标分类器高准确率的同时,显著抑制特征对未授权模型的跨模型迁移能力。

Details Motivation: 深度学习推理在共享和云环境中部署时,存在输入数据被未授权模型复用于其他任务的风险;现有隐私保护方法难以控制特征表示对下游任务的支持能力。 Method: 采用无像素级重建损失的变分潜在瓶颈编码输入,并结合基于KL散度与目标模型梯度显著性的动态二值掩码来抑制对指定任务无信息量的潜在维度;训练为白盒设置(需梯度访问),推理仅需前向传播。 Result: 在CIFAR-100上,处理后的表示对指定分类器保持高精度,而对所有非指定分类器的准确率降至2%以下,抑制比超45倍;在CIFAR-10、Tiny ImageNet和Pascal VOC上初步验证了跨任务泛化性。 Conclusion: 该框架能在保障目标任务性能的同时,有效限制特征的跨模型滥用,为共享环境下的模型输入隐私控制提供了新思路,但对自适应攻击者的鲁棒性仍需进一步评估。 Abstract: As deep learning inference is increasingly deployed in shared and cloud-based settings, a growing concern is input repurposing, in which data submitted for one task is reused by unauthorized models for another. Existing privacy defenses largely focus on restricting data access, but provide limited control over what downstream uses a released representation can still support. We propose a feature extraction framework that suppresses cross-model transfer while preserving accuracy for a designated classifier. The framework employs a variational latent bottleneck, trained with a task-driven cross-entropy objective and KL regularization, but without any pixel-level reconstruction loss, to encode inputs into a compact latent space. A dynamic binary mask, computed from per-dimension KL divergence and gradient-based saliency with respect to the frozen target model, suppresses latent dimensions that are uninformative for the intended task. Because saliency computation requires gradient access, the encoder is trained in a white-box setting, whereas inference requires only a forward pass through the frozen target model. On CIFAR-100, the processed representations retain strong utility for the designated classifier while reducing the accuracy of all unintended classifiers to below 2%, yielding a suppression ratio exceeding 45 times relative to unintended models. Preliminary experiments on CIFAR-10, Tiny ImageNet, and Pascal VOC provide exploratory evidence that the approach extends across task settings, although further evaluation is needed to assess robustness against adaptive adversaries.

[132] Controllable Generative Video Compression

Ding Ding,Daowen Li,Ying Chen,Yixin Gao,Ruixiao Dong,Kai Li,Li Li

Main category: cs.CV

TL;DR: 本文提出可控生成式视频压缩(CGVC)范式,通过结构先验与帧级控制先验引导非关键帧生成,在保持高感知质量的同时显著提升信号保真度。

Details Motivation: 现有感知视频压缩方法在提升感知真实性时往往牺牲信号保真度,违背视频压缩需忠实还原视觉信号的根本目标。 Method: 提出CGVC范式:编码代表性关键帧提供场景结构先验;额外编码稠密帧级控制先验以保留每帧的精细结构与语义;利用这些先验驱动可控视频生成模型进行非关键帧重建,并保证时空与内容一致性;设计颜色距离引导的关键帧选择算法以更准确恢复视频色彩信息。 Result: 实验表明CGVC在信号保真度和感知质量两方面均优于先前的感知视频压缩方法。 Conclusion: CGVC有效缓解了感知质量与信号保真度之间的权衡困境,为兼顾二者提供了新范式。 Abstract: Perceptual video compression adopts generative video modeling to improve perceptual realism but frequently sacrifices signal fidelity, diverging from the goal of video compression to faithfully reproduce visual signal. To alleviate the dilemma between perception and fidelity, in this paper we propose Controllable Generative Video Compression (CGVC) paradigm to faithfully generate details guided by multiple visual conditions. Under the paradigm, representative keyframes of the scene are coded and used to provide structural priors for non-keyframe generation. Dense per-frame control prior is additionally coded to better preserve finer structure and semantics of each non-keyframe. Guided by these priors, non-keyframes are reconstructed by controllable video generation model with temporal and content consistency. Furthermore, to accurately recover color information of the video, we develop a color-distance-guided keyframe selection algorithm to adaptively choose keyframes. Experimental results show CGVC outperforms previous perceptual video compression method in terms of both signal fidelity and perceptual quality.

[133] GPAFormer: Graph-guided Patch Aggregation Transformer for Efficient 3D Medical Image Segmentation

Chung-Ming Lo,I-Yun Liu,Wei-Yang Lin

Main category: cs.CV

TL;DR: 本文提出GPAFormer,一种轻量级3D医学图像分割网络,在保持高精度的同时显著提升计算效率,适用于多器官、多模态场景。

Details Motivation: 解决3D医学图像多器官分割中因成像模态多样、数据维度高、解剖结构异质性强导致的精度与效率难以兼顾的问题。 Method: 设计了两个核心模块:多尺度注意力引导堆叠聚合(MASA)模块,通过三路径并行与平面聚合增强多尺度结构建模能力;互感知补丁图聚合器(MPGA)模块,基于特征相似性与空间邻接关系动态聚合补丁,提升器官内部与边界的判别力。 Result: 在BTCV、Synapse、ACDC和BraTS四个公开数据集上取得最高平均DSC(分别为75.70%、81.20%、89.32%、82.74%),仅含1.81M参数,消费级GPU单例推理耗时<1秒。 Conclusion: GPAFormer在多器官、多模态3D医学图像分割任务中实现了精度与效率的良好平衡,特别适用于资源受限和时效敏感的临床环境。 Abstract: Deep learning has been widely applied to 3D medical image segmentation tasks. However, due to the diversity of imaging modalities, the high-dimensional nature of the data, and the heterogeneity of anatomical structures, achieving both segmentation accuracy and computational efficiency in multi-organ segmentation remains a challenge. This study proposed GPAFormer, a lightweight network architecture specifically designed for 3D medical image segmentation, emphasizing efficiency while keeping high accuracy. GPAFormer incorporated two core modules: the multi-scale attention-guided stacked aggregation (MASA) and the mutual-aware patch graph aggregator (MPGA). MASA utilized three parallel paths with different receptive fields, combined through planar aggregation, to enhance the network's capability in handling structures of varying sizes. MPGA employed a graph-guided approach to dynamically aggregate regions with similar feature distributions based on inter-patch feature similarity and spatial adjacency, thereby improving the discrimination of both internal and boundary structures of organs. Experiments were performed on public whole-body CT and MRI datasets including BTCV, Synapse, ACDC, and BraTS. Compared to the existed 3D segmentation networkd, GPAFormer using only 1.81 M parameters achieved overall highest DSC on BTCV (75.70%), Synapse (81.20%), ACDC (89.32%), and BraTS (82.74%). Using consumer level GPU, the inference time for one validation case of BTCV spent less than one second. The results demonstrated that GPAFormer balanced accuracy and efficiency in multi-organ, multi-modality 3D segmentation tasks across various clinical scenarios especially for resource-constrained and time-sensitive clinical environments.

[134] Towards Robust Content Watermarking Against Removal and Forgery Attacks

Yifan Zhu,Yihan Wang,Xiao-Shan Gao

Main category: cs.CV

TL;DR: 本文提出了一种名为ISTS的新型水印范式,通过动态控制水印注入时机和模式,并采用双侧检测方法,有效抵抗移除和伪造攻击。

Details Motivation: 现有文本到图像扩散模型的水印技术易受移除和伪造等对抗攻击影响,亟需更鲁棒的水印方案。 Method: 提出实例特定水印与双侧检测(ISTS)范式:基于用户提示语义动态调控水印注入时机与模式,并设计双侧检测机制提升鲁棒性。 Result: 实验表明ISTS在抵抗移除与伪造攻击方面显著优于现有水印方法。 Conclusion: ISTS为文本到图像生成内容的版权保护与溯源提供了更安全、鲁棒的水印解决方案。 Abstract: Generated contents have raised serious concerns about copyright protection, image provenance, and credit attribution. A potential solution for these problems is watermarking. Recently, content watermarking for text-to-image diffusion models has been studied extensively for its effective detection utility and robustness. However, these watermarking techniques are vulnerable to potential adversarial attacks, such as removal attacks and forgery attacks. In this paper, we build a novel watermarking paradigm called Instance-Specific watermarking with Two-Sided detection (ISTS) to resist removal and forgery attacks. Specifically, we introduce a strategy that dynamically controls the injection time and watermarking patterns based on the semantics of users' prompts. Furthermore, we propose a new two-sided detection approach to enhance robustness in watermark detection. Experiments have demonstrated the superiority of our watermarking against removal and forgery attacks.

[135] VDPP: Video Depth Post-Processing for Speed and Scalability

Daewon Yoon,Injun Baek,Sangyu Han,Yearim Kim,Nojun Kwak

Main category: cs.CV

TL;DR: 本文提出VDPP(Video Depth Post-Processing)框架,通过低分辨率空间中的几何精修替代高开销的场景重建,实现高效、准确、RGB-free的视频深度后处理,支持即插即用集成任意单图深度模型,适用于边缘实时部署。

Details Motivation: 现有端到端视频深度估计模型虽性能先进,但难以快速适配新发布的单图深度模型;而现有后处理方法(如NVDS)在速度、精度和RGB依赖性方面仍逊于端到端系统。 Method: 提出VDPP框架,摒弃全场景重建,转而在低分辨率空间中进行目标导向的几何精修;采用稠密残差学习驱动几何表征,并设计RGB-free架构以提升可扩展性与即插即用能力。 Result: VDPP在NVIDIA Jetson Orin Nano上达43.5+ FPS,精度与时间一致性媲美端到端模型,且内存效率高、无需RGB输入,可无缝集成任意单图深度估计器。 Conclusion: VDPP重新定义了视频深度后处理的实用性,在速度、精度、内存效率与模型兼容性之间取得最优平衡,是面向边缘设备实时部署的最实用解决方案。 Abstract: Video depth estimation is essential for providing 3D scene structure in applications ranging from autonomous driving to mixed reality. Current end-to-end video depth models have established state-of-the-art performance. Although current end-to-end (E2E) models have achieved state-of-the-art performance, they function as tightly coupled systems that suffer from a significant adaptation lag whenever superior single-image depth estimators are released. To mitigate this issue, post-processing methods such as NVDS offer a modular plug-and-play alternative to incorporate any evolving image depth model without retraining. However, existing post-processing methods still struggle to match the efficiency and practicality of E2E systems due to limited speed, accuracy, and RGB reliance. In this work, we revitalize the role of post-processing by proposing VDPP (Video Depth Post-Processing), a framework that improves the speed and accuracy of post-processing methods for video depth estimation. By shifting the paradigm from computationally expensive scene reconstruction to targeted geometric refinement, VDPP operates purely on geometric refinements in low-resolution space. This design achieves exceptional speed (>43.5 FPS on NVIDIA Jetson Orin Nano) while matching the temporal coherence of E2E systems, with dense residual learning driving geometric representations rather than full reconstructions. Furthermore, our VDPP's RGB-free architecture ensures true scalability, enabling immediate integration with any evolving image depth model. Our results demonstrate that VDPP provides a superior balance of speed, accuracy, and memory efficiency, making it the most practical solution for real-time edge deployment. Our project page is at https://github.com/injun-baek/VDPP

[136] RASR: Retrieval-Augmented Semantic Reasoning for Fake News Video Detection

Hui Li,Peien Ding,Jun Li,Guoqi Ma,Zhanyu Liu,Ge Xu,Junfeng Yao,Jinsong Su

Main category: cs.CV

TL;DR: 本文提出了一种检索增强的语义推理框架(RASR),用于多模态虚假新闻视频检测,通过跨实例语义解析、领域引导的多模态推理和多视角特征解耦融合,显著提升了检测准确率和跨域泛化能力。

Details Motivation: 现有方法缺乏跨实例全局语义关联和领域专家知识引导,难以有效利用历史关联证据和应对跨域语义差异。 Method: 提出RASR框架,包括:1)跨实例语义解析与检索器(CSPR),分解视频并从动态记忆库中检索关联证据;2)领域引导的多模态推理模块(DGMP),融合领域先验驱动专家多模态大模型生成深度分析报告;3)多视角特征解耦与融合模块(MVDFF),通过自适应门控机制整合多维特征。 Result: 在FakeSV和FakeTT数据集上显著超越SOTA方法,跨域泛化能力更强,整体检测准确率最高提升0.93%。 Conclusion: RASR框架有效缓解了跨实例语义关联缺失和领域知识不足问题,为多模态虚假新闻视频检测提供了新思路。 Abstract: Multimodal fake news video detection is a crucial research direction for maintaining the credibility of online information. Existing studies primarily verify content authenticity by constructing multimodal feature fusion representations or utilizing pre-trained language models to analyze video-text consistency. However, these methods still face the following limitations: (1) lacking cross-instance global semantic correlations, making it difficult to effectively utilize historical associative evidence to verify the current video; (2) semantic discrepancies across domains hinder the transfer of general knowledge, lacking the guidance of domain-specific expert knowledge. To this end, we propose a novel Retrieval-Augmented Semantic Reasoning (RASR) framework. First, a Cross-instance Semantic Parser and Retriever (CSPR) deconstructs the video into high-level semantic primitives and retrieves relevant associative evidence from a dynamic memory bank. Subsequently, a Domain-Guided Multimodal Reasoning (DGMP) module incorporates domain priors to drive an expert multimodal large language model in generating domain-aware, in-depth analysis reports. Finally, a Multi-View Feature Decoupling and Fusion (MVDFF) module integrates multi-dimensional features through an adaptive gating mechanism to achieve robust authenticity determination. Extensive experiments on the FakeSV and FakeTT datasets demonstrate that RASR significantly outperforms state-of-the-art baselines, achieves superior cross-domain generalization, and improves the overall detection accuracy by up to 0.93%.

[137] Specializing Large Models for Oracle Bone Script Interpretation via Component-Grounded Multimodal Knowledge Augmentation

Jianing Zhang,Runan Li,Honglin Pang,Ding Xia,Zhou Zhu,Qian Zhang,Chuntao Li,Xi Yang

Main category: cs.CV

TL;DR: 本文提出了一种基于视觉-语言模型(VLM)与大语言模型(LLM)代理协同的框架,用于破解甲骨文,通过结构化组件分析与知识推理弥合‘解释鸿沟’,并构建了首个专家标注的甲骨文结构语义数据集OB-Radix。

Details Motivation: 现有方法将甲骨文破译视为封闭集图像识别任务,忽视其由少量可迁移语义的象形部件构成的结构规律,导致难以泛化到罕见或未知字符,存在‘解释鸿沟’。 Method: 提出一种智能体驱动的视觉-语言模型框架:VLM负责字符图像的精确视觉定位;LLM智能体自动执行三步推理链——部件识别、基于图谱的知识检索、语义关系推断;并构建了专家标注数据集OB-Radix,含1022个字符图像、1853个细粒度部件图像及478类部件的语义解释。 Result: 在三个不同任务的基准上,该框架相较基线方法生成更细致、更准确的甲骨文释读结果,验证了结构化推理与多模态协同的有效性。 Conclusion: 利用部件级结构先验与LLM驱动的符号推理,可显著提升古文字破译的可解释性与泛化能力,为AI赋能考古语言学提供了新范式。 Abstract: Deciphering ancient Chinese Oracle Bone Script (OBS) is a challenging task that offers insights into the beliefs, systems, and culture of the ancient era. Existing approaches treat decipherment as a closed-set image recognition problem, which fails to bridge the ``interpretation gap'': while individual characters are often unique and rare, they are composed of a limited set of recurring, pictographic components that carry transferable semantic meanings. To leverage this structural logic, we propose an agent-driven Vision-Language Model (VLM) framework that integrates a VLM for precise visual grounding with an LLM-based agent to automate a reasoning chain of component identification, graph-based knowledge retrieval, and relationship inference for linguistically accurate interpretation. To support this, we also introduce OB-Radix, an expert-annotated dataset providing structural and semantic data absent from prior corpora, comprising 1,022 character images (934 unique characters) and 1,853 fine-grained component images across 478 distinct components with verified explanations. By evaluating our system across three benchmarks of different tasks, we demonstrate that our framework yields more detailed and precise decipherments compared to baseline methods.

[138] Improving Local Feature Matching by Entropy-inspired Scale Adaptability and Flow-endowed Local Consistency

Ke Jin,Jiming Chen,Qi Ye

Main category: cs.CV

TL;DR: 本文提出了一种新的半稠密图像匹配流程,通过引入尺度感知匹配模块解决粗匹配阶段的过排除问题,并通过级联光流优化和梯度损失提升细匹配阶段的局部一致性,显著提升了匹配鲁棒性与精度。

Details Motivation: 现有半稠密图像匹配方法在粗匹配阶段存在因MNN导致的过排除问题(尤其在图像间存在尺度差异时),且细匹配阶段忽略匹配结果的局部一致性,影响鲁棒性。 Method: 1)提出尺度感知匹配模块,利用得分矩阵隐含信息估计并补偿尺度比;2)将细匹配重构为级联光流优化问题,并设计梯度损失以增强光流场局部一致性。 Result: 所提方法在多个下游任务上实现了更鲁棒、更准确的匹配性能,且计算开销极小。 Conclusion: 通过分别改进粗、细两个匹配阶段的关键机制,本文有效缓解了长期存在的尺度差异适应性差和局部不一致问题,为半稠密匹配提供了新思路。 Abstract: Recent semi-dense image matching methods have achieved remarkable success, but two long-standing issues still impair their performance. At the coarse stage, the over-exclusion issue of their mutual nearest neighbor (MNN) matching layer makes them struggle to handle cases with scale difference between images. To this end, we comprehensively revisit the matching mechanism and make a key observation that the hint concealed in the score matrix can be exploited to indicate the scale ratio. Based on this, we propose a scale-aware matching module which is exceptionally effective but introduces negligible overhead. At the fine stage, we point out that existing methods neglect the local consistency of final matches, which undermines their robustness. To this end, rather than independently predicting the correspondence for each source pixel, we reformulate the fine stage as a cascaded flow refinement problem and introduce a novel gradient loss to encourage local consistency of the flow field. Extensive experiments demonstrate that our novel matching pipeline, with these proposed modifications, achieves robust and accurate matching performance on downstream tasks.

[139] HQF-Net: A Hybrid Quantum-Classical Multi-Scale Fusion Network for Remote Sensing Image Segmentation

Md Aminur Hossain,Ayush V. Patel,Siddhant Gole,Sanjay K. Singh,Biplab Banerjee

Main category: cs.CV

TL;DR: 本文提出HQF-Net,一种混合量子-经典多尺度融合网络,用于遥感图像语义分割,通过引入DMCAF模块、量子增强跳跃连接(QSkip)和量子MoE瓶颈,在多个基准上取得显著性能提升。

Details Motivation: 传统编码器-解码器结构(如U-Net)在遥感语义分割中难以充分建模全局语义与结构化特征交互,亟需更有效的多尺度与语义融合机制。 Method: 提出HQF-Net:1)利用冻结的DINOv3 ViT-L/16提取多尺度语义引导;2)设计可变形多尺度交叉注意力融合(DMCAF)模块融合特征;3)引入量子增强跳跃连接(QSkip)和含局部/全局/方向量子电路的量子MoE瓶颈(QMoE)。 Result: 在LandCover.ai(0.8568 mIoU,96.87% OA)、OpenEarthMap(71.82% mIoU)和SeasoNet(55.28% mIoU,99.37% OA)上均优于基线;消融实验证明各组件有效性。 Conclusion: 结构化的混合量子-经典特征处理是提升遥感语义分割性能、适配近期量子硬件限制的可行新方向。 Abstract: Remote sensing semantic segmentation requires models that can jointly capture fine spatial details and high-level semantic context across complex scenes. While classical encoder-decoder architectures such as U-Net remain strong baselines, they often struggle to fully exploit global semantics and structured feature interactions. In this work, we propose HQF-Net, a hybrid quantum-classical multi-scale fusion network for remote sensing image segmentation. HQF-Net integrates multi-scale semantic guidance from a frozen DINOv3 ViT-L/16 backbone with a customized U-Net architecture through a Deformable Multiscale Cross-Attention Fusion (DMCAF) module. To enhance feature refinement, the framework further introduces quantum-enhanced skip connections (QSkip) and a Quantum bottleneck with Mixture-of-Experts (QMoE), which combines complementary local, global, and directional quantum circuits within an adaptive routing mechanism. Experiments on three remote sensing benchmarks show consistent improvements with the proposed design. HQF-Net achieves 0.8568 mIoU and 96.87% overall accuracy on LandCover.ai, 71.82% mIoU on OpenEarthMap, and 55.28% mIoU with 99.37% overall accuracy on SeasoNet. An architectural ablation study further confirms the contribution of each major component. These results show that structured hybrid quantum-classical feature processing is a promising direction for improving remote sensing semantic segmentation under near-term quantum constraints.

[140] Exploring 6D Object Pose Estimation with Deformation

Zhiqiang Liu,Rui Song,Duanmu Chuangqi,Jiaojiao Li,David Ferstl,Yinlin Hu

Main category: cs.CV

TL;DR: 本文提出了DeSOPE数据集,用于6DoF(六自由度)变形物体姿态估计,包含26类物体的高保真3D扫描及133K RGB-D帧和665K姿态标注,旨在解决现有方法在非刚性/变形物体上性能下降的问题。

Details Motivation: 现有6D物体姿态估计方法大多假设物体为刚性或关节式结构,但在实际中物体常因磨损、撞击或形变而偏离标准形状,导致方法失效,因此需要专门针对变形物体的数据集和评估基准。 Method: 构建DeSOPE数据集:采集26类常见物体的3D扫描(每类1个标准形态+3个变形形态),并精确配准到标准网格;构建RGB-D数据集(133K帧),通过半自动流程生成665K姿态标注——先标注2D掩码,再用物体姿态估计算法初始化位姿,继而用物体级SLAM优化,最后人工校验。 Result: 在多个6D姿态估计方法上的实验表明,随着物体形变量增大,性能显著下降,验证了处理形变鲁棒性的必要性。 Conclusion: DeSOPE为变形物体6D姿态估计提供了首个大规模、高质量基准数据集,揭示了当前方法在形变场景下的局限性,并推动面向非刚性物体的鲁棒姿态估计研究。 Abstract: We present DeSOPE, a large-scale dataset for 6DoF deformed objects. Most 6D object pose methods assume rigid or articulated objects, an assumption that fails in practice as objects deviate from their canonical shapes due to wear, impact, or deformation. To model this, we introduce the DeSOPE dataset, which features high-fidelity 3D scans of 26 common object categories, each captured in one canonical state and three deformed configurations, with accurate 3D registration to the canonical mesh. Additionally, it features an RGB-D dataset with 133K frames across diverse scenarios and 665K pose annotations produced via a semi-automatic pipeline. We begin by annotating 2D masks for each instance, then compute initial poses using an object pose method, refine them through an object-level SLAM system, and finally perform manual verification to produce the final annotations. We evaluate several object pose methods and find that performance drops sharply with increasing deformation, suggesting that robust handling of such deformations is critical for practical applications. The project page and dataset are available at https://desope-6d.github.io/}{https://desope-6d.github.io/.

[141] Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

Jiahua Chen,Qihong Tang,Weinong Wang,Qi Fan

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的视觉链式推理框架,通过显式3D重建(单图生成高保真3D网格)与基于外部知识库的多视角合成,提升多模态大模型在复杂3D空间推理任务上的性能,显著超越现有模型。

Details Motivation: 现有MLLM依赖2D视觉先验,在复杂3D空间推理上表现不足;已有改进方法存在计算昂贵或缺乏几何理解与视角灵活性的问题。 Method: 提出训练-free框架:1)利用MLLM进行关键词提取和多粒度掩码生成,实现单图像高保真3D网格重建;2)借助外部知识库迭代计算最优相机外参并合成新视角,模拟人类视角转换。 Result: 在3DSRBench和Rel3D等主流基准上,显著优于专用空间模型及通用MLLM(如GPT-5.2、Gemini-2.5-Flash)。 Conclusion: 显式3D重建与视角驱动的视觉链式推理可有效弥补MLLM在3D空间理解上的缺陷,且无需额外训练,具备高效性与泛化性。 Abstract: Although Multimodal Large Language Models have achieved remarkable progress, they still struggle with complex 3D spatial reasoning due to the reliance on 2D visual priors. Existing approaches typically mitigate this limitation either through computationally expensive post-training procedures on limited 3D datasets or through rigid tool-calling mechanisms that lack explicit geometric understanding and viewpoint flexibility. To address these challenges, we propose a \textit{training-free} framework that introduces a Visual Chain-of-Thought mechanism grounded in explicit 3D reconstruction. The proposed pipeline first reconstructs a high-fidelity 3D mesh from a single image using MLLM-guided keyword extraction and mask generation at multiple granularities. Subsequently, the framework leverages an external knowledge base to iteratively compute optimal camera extrinsic parameters and synthesize novel views, thereby emulating human perspective-taking. Extensive experiments demonstrate that the proposed approach significantly enhances spatial comprehension. Specifically, the framework outperforms specialized spatial models and general-purpose MLLMs, including \textit{GPT-5.2} and \textit{Gemini-2.5-Flash}, on major benchmarks such as 3DSRBench and Rel3D.

[142] URMF: Uncertainty-aware Robust Multimodal Fusion for Multimodal Sarcasm Detection

Zhenyu Wang,Weichen Cheng,Weijia Li,Junjie Mou,Zongyou Zhao,Guoying Zhang

Main category: cs.CV

TL;DR: 本文提出了一种不确定性感知的鲁棒多模态融合框架URMF,用于多模态讽刺检测,通过建模文本、图像及交互表征的随机不确定性,动态调节模态贡献,提升模型准确率与鲁棒性。

Details Motivation: 现有方法假设所有模态同等可靠,但在真实社交场景中,文本可能模糊、图像可能弱相关甚至无关,导致确定性融合引入噪声、削弱推理鲁棒性。 Method: URMF采用多头跨模态注意力将视觉证据注入文本表征,再在融合语义空间中用多头自注意力增强不一致感知推理;并统一建模文本、图像及交互感知隐表示的随机不确定性(参数化为可学习高斯后验),利用估计的不确定性动态加权融合;同时设计了融合任务监督、模态先验正则、跨模态分布对齐和不确定性驱动的自采样对比学习的联合训练目标。 Result: 在公开MSD基准上,URMF持续优于强单模态、多模态及MLLM基线模型。 Conclusion: 不确定性感知的融合机制能显著提升多模态讽刺检测的准确性与鲁棒性。 Abstract: Multimodal sarcasm detection (MSD) aims to identify sarcastic intent from semantic incongruity between text and image. Although recent methods have improved MSD through cross-modal interaction and incongruity reasoning, they often assume that all modalities are equally reliable. In real-world social media, however, textual content may be ambiguous and visual content may be weakly relevant or even irrelevant, causing deterministic fusion to introduce noisy evidence and weaken robust reasoning. To address this issue, we propose Uncertainty-aware Robust Multimodal Fusion (URMF), a unified framework that explicitly models modality reliability during interaction and fusion. URMF first employs multi-head cross-attention to inject visual evidence into textual representations, followed by multi-head self-attention in the fused semantic space to enhance incongruity-aware reasoning. It then performs unified unimodal aleatoric uncertainty modeling over text, image, and interaction-aware latent representations by parameterizing each modality as a learnable Gaussian posterior. The estimated uncertainty is further used to dynamically regulate modality contributions during fusion, suppressing unreliable modalities and yielding a more robust joint representation. In addition, we design a joint training objective integrating task supervision, modality prior regularization, cross-modal distribution alignment, and uncertainty-driven self-sampling contrastive learning. Experiments on public MSD benchmarks show that URMF consistently outperforms strong unimodal, multimodal, and MLLM-based baselines, demonstrating the effectiveness of uncertainty-aware fusion for improving both accuracy and robustness.

[143] DOC-GS: Dual-Domain Observation and Calibration for Reliable Sparse-View Gaussian Splatting

Hantang Li,Qiang Zhu,Xiandong Meng,Debin Zhao,Xiaopeng Fan

Main category: cs.CV

TL;DR: 本文提出DOC-GS框架,通过优化域的连续深度引导Dropout与观测域的暗通道先验联合建模高斯原语可靠性,缓解稀疏视角下3D高斯泼溅重建中的雾化伪影与结构失真问题。

Details Motivation: 稀疏视角下3D高斯泼溅重建因几何监督不足而病态,导致过拟合、结构畸变和雾状伪影;现有基于Dropout的方法缺乏对伪影成因的统一理解,核心挑战在于高斯原语可靠性的不可观测性。 Method: 提出双域观测与校准(DOC-GS)框架:在优化域,设计连续深度引导Dropout(CDGD),以Dropout概率显式表征高斯可靠性,施加平滑深度感知的归纳偏置;在观测域,将浮点伪影类比大气散射,利用暗通道先验(DCP)检测异常区域,并据此实现可靠性驱动的几何剪枝。 Result: 有效抑制稀疏视角下的雾状伪影与结构畸变,提升重建质量与优化稳定性,在多个稀疏视角数据集上取得优于现有方法的视觉与量化结果。 Conclusion: 高斯原语可靠性是稀疏视角3DGS重建的关键隐变量;联合优化域归纳偏置与观测域结构先验的双域协同机制,可系统性提升重建鲁棒性与保真度。 Abstract: Sparse-view reconstruction with 3D Gaussian Splatting (3DGS) is fundamentally ill-posed due to insufficient geometric supervision, often leading to severe overfitting and the emergence of structural distortions and translucent haze-like artifacts. While existing approaches attempt to alleviate this issue via dropout-based regularization, they are largely heuristic and lack a unified understanding of artifact formation. In this paper, we revisit sparse-view 3DGS reconstruction from a new perspective and identify the core challenge as the unobservability of Gaussian primitive reliability. Unreliable Gaussians are insufficiently constrained during optimization and accumulate as haze-like degradations in rendered images. Motivated by this observation, we propose a unified Dual-domain Observation and Calibration (DOC-GS) framework that models and corrects Gaussian reliability through the synergy of optimization-domain inductive bias and observation-domain evidence. Specifically, in the optimization domain, we characterize Gaussian reliability by the degree to which each primitive is constrained during training, and instantiate this signal via a Continuous Depth-Guided Dropout (CDGD) strategy, where the dropout probability serves as an explicit proxy for primitive reliability. This imposes a smooth depth-aware inductive bias to suppress weakly constrained Gaussians and improve optimization stability. In the observation domain, we establish a connection between floater artifacts and atmospheric scattering, and leverage the Dark Channel Prior (DCP) as a structural consistency cue to identify and accumulate anomalous regions. Based on cross-view aggregated evidence, we further design a reliability-driven geometric pruning strategy to remove low-confidence Gaussians.

[144] LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

Pedro Quesado,Erkut Akdag,Yasaman Kashefbahrami,Willem Menu,Egor Bondarev

Main category: cs.CV

TL;DR: 本文提出LiveStre4m,一种面向未标定稀疏多视角视频的前馈式实时新视角合成(NVS)方法,通过多视角ViT重建3D场景、扩散-Transformer插值模块保障时序一致性,并设计相机姿态预测器直接从RGB图像估计内外参,实现仅需2路未标定同步视频流、每帧0.07秒(1024×768)的实时稳定直播式NVS。

Details Motivation: 现有动态场景新视角合成方法依赖真值相机参数且优化耗时(约2.67秒),难以满足实时直播需求。 Method: 提出LiveStre4m:1)多视角视觉Transformer用于关键帧3D场景重建;2)扩散-Transformer插值模块保障时序一致性与流稳定性;3)相机姿态预测器直接从RGB图像联合估计相机位姿与内参。 Result: 在1024×768分辨率下实现单帧平均0.07秒重建速度,显著优于基于优化的方法;仅需两路同步未标定输入即可实现时序一致的新视角视频实时流式生成。 Conclusion: LiveStre4m首次实现了无需相机标定、低延迟、高稳定性的实时新视角视频直播,为可部署的NVS系统迈出关键一步。 Abstract: Live-streaming Novel View Synthesis (NVS) from unposed multi-view video remains an open challenge in a wide range of applications. Existing methods for dynamic scene representation typically require ground-truth camera parameters and involve lengthy optimizations ($\approx 2.67$s), which makes them unsuitable for live streaming scenarios. To address this issue, we propose a novel viewpoint video live-streaming method (LiveStre4m), a feed-forward model for real-time NVS from unposed sparse multi-view inputs. LiveStre4m introduces a multi-view vision transformer for keyframe 3D scene reconstruction coupled with a diffusion-transformer interpolation module that ensures temporal consistency and stable streaming. In addition, a Camera Pose Predictor module is proposed to efficiently estimate both poses and intrinsics directly from RGB images, removing the reliance on known camera calibration information. Our approach enables temporally consistent novel-view video streaming in real-time using as few as two synchronized unposed input streams. LiveStre4m attains an average reconstruction time of $ 0.07$s per-frame at $ 1024 \times 768$ resolution, outperforming the optimization-based dynamic scene representation methods by orders of magnitude in runtime. These results demonstrate that LiveStre4m makes real-time NVS streaming feasible in practical settings, marking a substantial step toward deployable live novel-view synthesis systems. Code available at: https://github.com/pedro-quesado/LiveStre4m

[145] From Static to Interactive: Adapting Visual in-Context Learners for User-Driven Tasks

Carlos Schmidt,Simon Reiß

Main category: cs.CV

TL;DR: 本文提出Interactive DeLVM,将静态视觉上下文学习模型(如DeLVM)扩展为支持用户交互(如涂鸦、点击、框选)的可控系统,无需微调即可实现动态引导预测,在交互式分割、超分和对象移除等任务中显著提升性能。

Details Motivation: 现有视觉上下文学习模型虽能通过示例快速适应新任务,但缺乏接收用户交互信号(如涂鸦、点击、边框)以主动引导预测的能力,限制了其在真实场景中的实用性。 Method: 将用户交互信号直接编码进输入-输出示例对中,保持视觉上下文学习范式不变,使模型能零样本响应未见过的交互形式,实现免微调的动态控制。 Result: 相比SOTA视觉上下文学习模型无法利用交互信号,Interactive DeLVM在交互式分割上IoU提升+7.95%,定向超分PSNR提升+2.46,交互式对象移除LPIPS降低3.14%。 Conclusion: 本工作成功弥合了静态任务适配与用户驱动交互之间的鸿沟,为面向用户的视觉上下文学习提供了灵活、可控的新范式。 Abstract: Visual in-context learning models are designed to adapt to new tasks by leveraging a set of example input-output pairs, enabling rapid generalization without task-specific fine-tuning. However, these models operate in a fundamentally static paradigm: while they can adapt to new tasks, they lack any mechanism to incorporate user-provided guidance signals such as scribbles, clicks, or bounding boxes to steer or refine the prediction process. This limitation is particularly restrictive in real-world applications, where users want to actively guide model predictions, e.g., by highlighting the target object for segmentation, indicating a region which should be visually altered, or isolating a specific person in a complex scene to run targeted pose estimation. In this work, we propose a simple method to transform static visual in-context learners, particularly the DeLVM approach, into highly controllable, user-driven systems, i.e., Interactive DeLVM, enabling seamless interaction through natural visual cues such as scribbles, clicks, or drawing boxes. Specifically, by encoding interactions directly into the example input-output pairs, we keep the philosophy of visual in-context learning intact: enabling users to prompt models with unseen interactions without fine-tuning and empowering them to dynamically steer model predictions with personalized interactions. Our experiments demonstrate that SOTA visual in-context learning models fail to effectively leverage interaction cues, often ignoring user guidance entirely. In contrast, our method excels in controllable, user-guided scenarios, achieving improvements of $+7.95%$ IoU for interactive segmentation, $+2.46$ PSNR for directed super-resolution, and $-3.14%$ LPIPS for interactive object removal. With this, our work bridges the gap between rigid static task adaptation and fluid interactivity for user-centric visual in-context learning.

[146] How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Roberto Brusnicki,Mattia Piccinini,Johannes Betz

Main category: cs.CV

TL;DR: 本文提出了VENUSS框架,用于系统评估视觉语言模型(VLMs)在连续驾驶场景中的性能敏感性,揭示了当前VLMs在车辆动态和时序关系理解上的显著不足。

Details Motivation: 现有VLMs在自动驾驶任务中应用增多,但其在连续驾驶场景中的性能尚未被系统刻画,尤其缺乏对输入配置影响的分析。 Method: 构建VENUSS评估框架,基于驾驶视频提取时间序列,设计结构化评测,并对25+个VLM在2600+场景中进行对比实验,分析不同输入配置(分辨率、帧数、时间间隔、空间布局、呈现模式)的影响。 Result: 顶级VLM在该任务上准确率仅57%,低于人类水平(65%);模型擅长静态物体检测,但在车辆动态与时间关系理解上表现薄弱。 Conclusion: VENUSS为VLM在时序驾驶理解任务上的评估提供了首个系统性敏感性分析基准,凸显了当前模型的关键能力缺口,并为后续研究奠定基础。 Abstract: Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance in similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding the vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://V3NU55.github.io

[147] FlowInOne:Unifying Multimodal Generation as Image-in, Image-out Flow Matching

Junchao Yi,Rui Zhao,Jiahao Tang,Weixian Lei,Linjie Li,Qisheng Su,Zhengyuan Yang,Lijuan Wang,Xiaofeng Zhu,Alex Jinpeng Wang

Main category: cs.CV

TL;DR: 本文提出FlowInOne框架,将多模态生成统一为纯视觉流,所有输入(文本、布局、编辑指令)均转化为视觉提示,实现图像输入-图像输出的端到端流程;通过VisPrompt-5M数据集和VP-Bench基准验证,其在多项任务上达到SOTA性能。

Details Motivation: 现有文本驱动的多模态生成范式存在跨模态对齐瓶颈、噪声调度复杂、任务专用架构等问题,难以实现真正统一与视觉中心化的生成。 Method: 提出FlowInOne框架,将各类模态输入统一编码为视觉提示,采用单一的流匹配模型实现图像到图像的生成;构建VisPrompt-5M视觉提示数据集(500万对)和VP-Bench评估基准。 Result: FlowInOne在文本到图像生成、布局引导编辑、视觉指令跟随等统一任务上均达到最先进性能,超越主流开源及商用模型。 Conclusion: 证明了完全视觉中心化的生成范式可行性,为感知与生成共存于同一连续视觉空间奠定了新基础。 Abstract: Multimodal generation has long been dominated by text-driven pipelines where language dictates vision but cannot reason or create within it. We challenge this paradigm by asking whether all modalities, including textual descriptions, spatial layouts, and editing instructions, can be unified into a single visual representation. We present FlowInOne, a framework that reformulates multimodal generation as a purely visual flow, converting all inputs into visual prompts and enabling a clean image-in, image-out pipeline governed by a single flow matching model. This vision-centric formulation naturally eliminates cross-modal alignment bottlenecks, noise scheduling, and task-specific architectural branches, unifying text-to-image generation, layout-guided editing, and visual instruction following under one coherent paradigm. To support this, we introduce VisPrompt-5M, a large-scale dataset of 5 million visual prompt pairs spanning diverse tasks including physics-aware force dynamics and trajectory prediction, alongside VP-Bench, a rigorously curated benchmark assessing instruction faithfulness, spatial precision, visual realism, and content consistency. Extensive experiments demonstrate that FlowInOne achieves state-of-the-art performance across all unified generation tasks, surpassing both open-source models and competitive commercial systems, establishing a new foundation for fully vision-centric generative modeling where perception and creation coexist within a single continuous visual space.

[148] FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts

Guillermo Gil de Avalle,Laura Maruster,Eric Sloot,Christos Emmanouilidis

Main category: cs.CV

TL;DR: FlowExtract 是一种从 ISO 5807 标准化流程图中提取有向图的新方法,通过分离节点检测与连接关系重建,结合 YOLOv8、EasyOCR 和原创的基于箭头朝向与反向追踪的边检测技术,在工业故障排除指南上显著优于现有视觉语言模型。

Details Motivation: 制造设施中的维护流程常以静态 PDF 或扫描图像中的流程图形式存在,其中蕴含的关键程序性知识难以被现代操作支持系统利用;而当前主流的视觉语言模型难以准确恢复流程图中的连接拓扑结构。 Method: 提出 FlowExtract 流程:首先用 YOLOv8 检测标准流程图元素(如节点),用 EasyOCR 提取文本;再设计新型边检测方法——基于箭头朝向识别与连接线反向追踪,以确定源节点与目标节点间的有向连接。 Result: 在工业故障排除指南数据集上评估显示,FlowExtract 实现了极高的节点检测精度,并在边提取任务上大幅超越视觉语言模型基线。 Conclusion: FlowExtract 为将静态流程图转化为可查询、可计算的程序性知识表示提供了实用可行的技术路径,代码已开源。 Abstract: Maintenance procedures in manufacturing facilities are often documented as flowcharts in static PDFs or scanned images. They encode procedural knowledge essential for asset lifecycle management, yet inaccessible to modern operator support systems. Vision-language models, the dominant paradigm for image understanding, struggle to reconstruct connection topology from such diagrams. We present FlowExtract, a pipeline for extracting directed graphs from ISO 5807-standardized flowcharts. The system separates element detection from connectivity reconstruction, using YOLOv8 and EasyOCR for standard domain-aligned node detection and text extraction, combined with a novel edge detection method that analyzes arrowhead orientations and traces connecting lines backward to source nodes. Evaluated on industrial troubleshooting guides, FlowExtract achieves very high node detection and substantially outperforms vision-language model baselines on edge extraction, offering organizations a practical path toward queryable procedural knowledge representations. The implementation is available athttps://github.com/guille-gil/FlowExtract.

[149] Walk the Talk: Bridging the Reasoning-Action Gap for Thinking with Images via Multimodal Agentic Policy Optimization

Wenhao Yang,Yu Xia,Jinlong Huang,Shiyin Lu,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang,Yuchen Zhou,Xiaobo Xia,Yuanyu Wan,Lijun Zhang,Tat-Seng Chua

Main category: cs.CV

TL;DR: 本文提出Multimodal Agentic Policy Optimization (MAPO),通过强制模型对视觉工具获取的内容生成显式文本描述,并结合语义对齐与任务奖励进行优势估计,以解决多模态大语言模型中推理与视觉动作不一致导致的训练噪声和性能下降问题。

Details Motivation: 现有基于结果奖励的强化学习方法忽视了文本推理表观合理但视觉动作执行失败的问题,导致推理-动作不一致、噪声累积甚至训练崩溃。 Method: 提出MAPO方法:1)强制模型在多模态思维链(MCoT)中为视觉工具调用结果生成显式文本描述;2)设计新型优势估计,联合语义对齐(描述与真实观测)与任务奖励;3)提供理论分析证明其可降低梯度方差。 Result: 在多个视觉推理基准上取得优于现有方法的性能,验证了MAPO的有效性与鲁棒性。 Conclusion: MAPO有效弥合了多模态推理中语言思维与视觉动作之间的鸿沟,提升了模型的执行准确性与整体推理能力,为多模态具身智能训练提供了新范式。 Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have incentivized models to ``think with images'' by actively invoking visual tools during multi-turn reasoning. The common Reinforcement Learning (RL) practice of relying on outcome-based rewards ignores the fact that textual plausibility often masks executive failure, meaning that models may exhibit intuitive textual reasoning while executing imprecise or irrelevant visual actions within their agentic reasoning trajectories. This reasoning-action discrepancy introduces noise that accumulates throughout the multi-turn reasoning process, severely degrading the model's multimodal reasoning capabilities and potentially leading to training collapse. In this paper, we introduce Multimodal Agentic Policy Optimization (MAPO), bridging the gap between textual reasoning and visual actions generated by models within their Multimodal Chain-of-Thought (MCoT). Specifically, MAPO mandates the model to generate explicit textual descriptions for the visual content obtained via tool usage. We then employ a novel advantage estimation that couples the semantic alignment between these descriptions and the actual observations with the task reward. Theoretical findings are provided to justify the rationale behind MAPO, which inherently reduces the variance of gradients, and extensive experiments demonstrate that our method achieves superior performance across multiple visual reasoning benchmarks.

[150] EventFace: Event-Based Face Recognition via Structure-Driven Spatiotemporal Modeling

Qingguo Meng,Xingbo Dong,Zhe Jin,Massimo Tistarelli

Main category: cs.CV

TL;DR: 本文提出EventFace框架,通过结合空间结构与时间动态建模事件流中的人脸身份特征,在自建小规模事件人脸数据集EFace上取得优异识别性能,并展现出更强的光照鲁棒性与隐私保护能力。

Details Motivation: 事件相机虽具光照鲁棒性和隐私友好性,但其事件流缺乏RGB图像稳定的光度外观,传统方法难以直接适用;且当前缺乏专用事件人脸数据集,亟需构建新数据集并设计适配的结构驱动时空身份表征方法。 Method: 构建小规模事件人脸数据集EFace;提出EventFace框架:利用LoRA将预训练RGB人脸模型的结构先验迁移到事件域,建立空间基础;引入运动提示编码器(MPE)显式编码时间特征,并通过时空调制器(STM)融合时空特征。 Result: 在EFace上Rank-1识别率达94.19%,EER为5.35%,性能优于所有基线方法;在退化光照下鲁棒性更强;所学表征模板可重构性更低,利于隐私保护。 Conclusion: 事件人脸识别应聚焦于由刚性面部运动和个体几何结构塑造的结构驱动时空身份表征;EventFace通过迁移结构先验与显式建模运动动态,有效克服事件数据稀缺与表征不稳定问题,兼顾性能与隐私。 Abstract: Event cameras offer a promising sensing modality for face recognition due to their inherent advantages in illumination robustness and privacy-friendliness. However, because event streams lack the stable photometric appearance relied upon by conventional RGB-based face recognition systems, we argue that event-based face recognition should model structure-driven spatiotemporal identity representations shaped by rigid facial motion and individual facial geometry. Since dedicated datasets for event-based face recognition remain lacking, we construct EFace, a small-scale event-based face dataset captured under rigid facial motion. To learn effectively from this limited event data, we further propose EventFace, a framework for event-based face recognition that integrates spatial structure and temporal dynamics for identity modeling. Specifically, we employ Low-Rank Adaptation (LoRA) to transfer structural facial priors from pretrained RGB face models to the event domain, thereby establishing a reliable spatial basis for identity modeling. Building on this foundation, we further introduce a Motion Prompt Encoder (MPE) to explicitly encode temporal features and a Spatiotemporal Modulator (STM) to fuse them with spatial features, thereby enhancing the representation of identity-relevant event patterns. Extensive experiments demonstrate that EventFace achieves the best performance among the evaluated baselines, with a Rank-1 identification rate of 94.19% and an equal error rate (EER) of 5.35%. Results further indicate that EventFace exhibits stronger robustness under degraded illumination than the competing methods. In addition, the learned representations exhibit reduced template reconstructability.

[151] Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

Bohao Xing,Deng Li,Rong Gao,Xin Liu,Heikki Kälviäinen

Main category: cs.CV

TL;DR: 本文提出了一种名为OG-ReG的双路径Transformer架构,通过模拟人类视觉系统的‘一瞥’(Glance)与‘凝视’(Gaze)机制,分别建模粗粒度全局时空信息和细粒度局部细节,从而在不牺牲长程时序建模能力的前提下提升视频理解性能。

Details Motivation: 现有视频Transformer多采用因子化或窗口化自注意力,割裂了时空相关性,难以有效建模运动和长程依赖;受人类视觉中时间尺度依赖性及稀疏时序注意机制(如glance/gaze)启发,质疑时空同等重要这一隐含假设。 Method: 提出Overall Glance and Refined Gaze (OG-ReG) Transformer:包含Glance路径(捕获整体粗粒度时空特征)与Gaze路径(补充局部精细信息),二者协同建模多尺度时空动态。 Result: 在Kinetics-400、Something-Something v2和Diving-48三个主流视频动作识别数据集上达到SOTA性能。 Conclusion: 时空信息的重要性具有时间尺度依赖性,显式建模粗细互补的双路径注意机制可更有效地学习视频表征,优于强制均匀处理时空维度的方法。 Abstract: Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.

[152] Video-guided Machine Translation with Global Video Context

Jian Chen,JinZe Lv,Zi Long,XiangHua Fu

Main category: cs.CV

TL;DR: 本文提出了一种全局视频引导的多模态翻译框架,通过预训练语义编码器和向量数据库检索相关字幕片段,结合注意力机制和区域感知跨模态注意力,提升长视频场景下的翻译性能。

Details Motivation: 现有视频引导多模态翻译方法局限于局部对齐的视频片段与字幕一一配对,难以捕捉长视频中跨多个片段的全局叙事上下文。 Method: 提出全局视频引导框架:利用预训练语义编码器和向量数据库进行字幕检索以构建相关视频上下文集;引入注意力机制聚焦高相关视觉内容并保留整体上下文;设计区域感知跨模态注意力增强翻译中的语义对齐。 Result: 在大规模纪录片翻译数据集上的实验表明,该方法显著优于基线模型,尤其在长视频场景下效果突出。 Conclusion: 所提框架有效提升了多模态翻译对全局视频上下文的理解与利用能力,为长视频翻译任务提供了新思路。 Abstract: Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative context across multiple segments in long videos. To overcome this limitation, we propose a globally video-guided multimodal translation framework that leverages a pretrained semantic encoder and vector database-based subtitle retrieval to construct a context set of video segments closely related to the target subtitle semantics. An attention mechanism is employed to focus on highly relevant visual content, while preserving the remaining video features to retain broader contextual information. Furthermore, we design a region-aware cross-modal attention mechanism to enhance semantic alignment during translation. Experiments on a large-scale documentary translation dataset demonstrate that our method significantly outperforms baseline models, highlighting its effectiveness in long-video scenarios.

[153] FedDAP: Domain-Aware Prototype Learning for Federated Learning under Domain Shift

Huy Q. Le,Loc X. Nguyen,Yu Qiao,Seong Tae Kim,Eui-Nam Huh,Choong Seon Hong

Main category: cs.CV

TL;DR: 本文提出FedDAP方法,通过构建域感知的全局原型并实现域内特征-原型对齐与域间分离,以缓解联邦学习中因客户端数据域差异导致的性能下降问题。

Details Motivation: 现实联邦学习中客户端数据常来自不同域,造成严重域偏移和全局模型性能下降;现有基于原型的方法无法保留域信息且对齐方式忽略域差异。 Method: 提出Federated Domain-Aware Prototypes(FedDAP),采用相似性加权融合机制聚合同域客户端本地原型,构建域特异性全局原型,并指导本地训练进行域内对齐与域间分离。 Result: 在DomainNet、Office-10和PACS三个数据集上实验验证了FedDAP在缓解域偏移方面的有效性。 Conclusion: FedDAP通过引入域感知原型机制,提升了联邦学习在非独立同分布(non-IID)跨域场景下的泛化能力与鲁棒性。 Abstract: Federated Learning (FL) enables decentralized model training across multiple clients without exposing private data, making it ideal for privacy-sensitive applications. However, in real-world FL scenarios, clients often hold data from distinct domains, leading to severe domain shift and degraded global model performance. To address this, prototype learning has been emerged as a promising solution, which leverages class-wise feature representations. Yet, existing methods face two key limitations: (1) Existing prototype-based FL methods typically construct a $\textit{single global prototype}$ per class by aggregating local prototypes from all clients without preserving domain information. (2) Current feature-prototype alignment is $\textit{domain-agnostic}$, forcing clients to align with global prototypes regardless of domain origin. To address these challenges, we propose Federated Domain-Aware Prototypes (FedDAP) to construct domain-specific global prototypes by aggregating local client prototypes within the same domain using a similarity-weighted fusion mechanism. These global domain-specific prototypes are then used to guide local training by aligning local features with prototypes from the same domain, while encouraging separation from prototypes of different domains. This dual alignment enhances domain-specific learning at the local level and enables the global model to generalize across diverse domains. Finally, we conduct extensive experiments on three different datasets: DomainNet, Office-10, and PACS to demonstrate the effectiveness of our proposed framework to address the domain shift challenges. The code is available at https://github.com/quanghuy6997/FedDAP.

[154] Generate, Analyze, and Refine: Training-Free Sound Source Localization via MLLM Meta-Reasoning

Subin Park,Jung Uk Kim

Main category: cs.CV

TL;DR: 本文提出了一种无需训练的声源定位(SSL)框架GAR,利用多模态大语言模型(MLLMs)的内在推理能力,通过生成-分析-精炼三阶段流程提升复杂声学场景下的定位效果。

Details Motivation: 现有基于对比学习的声源定位方法缺乏显式的推理与验证机制,在复杂声学场景中性能受限;受人类元认知过程启发,探索利用MLLMs的推理能力实现训练-free的SSL。 Method: 提出Generation-Analysis-Refinement(GAR)三阶段无训练框架:Generation阶段生成初始边界框与音频分类;Analysis阶段通过开放集角色标注与锚点投票量化音视频一致性;Refinement阶段采用自适应门控防止过度调整。 Result: 在单声源与多声源基准上均展现出具有竞争力的性能。 Conclusion: GAR验证了利用MLLMs内在推理能力实现高效、无需训练的声源定位的可行性,为多模态感知提供了新思路。 Abstract: Sound source localization task aims to identify the locations of sound-emitting objects by leveraging correlations between audio and visual modalities. Most existing SSL methods rely on contrastive learning-based feature matching, but lack explicit reasoning and verification, limiting their effectiveness in complex acoustic scenes. Inspired by human meta-cognitive processes, we propose a training-free SSL framework that exploits the intrinsic reasoning capabilities of Multimodal Large Language Models (MLLMs). Our Generation-Analysis-Refinement (GAR) pipeline consists of three stages: Generation produces initial bounding boxes and audio classifications; Analysis quantifies Audio-Visual Consistency via open-set role tagging and anchor voting; and Refinement applies adaptive gating to prevent unnecessary adjustments. Extensive experiments on single-source and multi-source benchmarks demonstrate competitive performance. The source code is available at https://github.com/VisualAIKHU/GAR-SSL.

[155] RePL: Pseudo-label Refinement for Semi-supervised LiDAR Semantic Segmentation

Donghyeon Kwon,Taegyu Park,Suha Kwak

Main category: cs.CV

TL;DR: 本文提出RePL框架,通过掩码重建识别并修正伪标签中的潜在错误,结合专门训练策略提升LiDAR语义分割中半监督学习的伪标签质量,理论与实验均验证其有效性,并在nuScenes-lidarseg和SemanticKITTI上达到SOTA。

Details Motivation: 半监督LiDAR语义分割常因噪声伪标签导致误差传播和确认偏差。 Method: 提出RePL框架,利用掩码重建识别并修正伪标签错误,并设计专用训练策略;同时提供伪标签优化有益性的理论分析。 Result: 在nuScenes-lidarseg和SemanticKITTI数据集上显著提升伪标签质量,实现LiDAR语义分割SOTA性能。 Conclusion: RePL有效缓解伪标签噪声问题,理论条件宽松且易满足,是提升半监督LiDAR语义分割性能的有效方法。 Abstract: Semi-supervised learning for LiDAR semantic segmentation often suffers from error propagation and confirmation bias caused by noisy pseudo-labels. To tackle this chronic issue, we introduce RePL, a novel framework that enhances pseudo-label quality by identifying and correcting potential errors in pseudo-labels through masked reconstruction, along with a dedicated training strategy. We also provide a theoretical analysis demonstrating the condition under which the pseudo-label refinement is beneficial, and empirically confirm that the condition is mild and clearly met by RePL. Extensive evaluations on the nuScenes-lidarseg and SemanticKITTI datasets show that RePL improves pseudo-label quality a lot and, as a result, achieves the state of the art in LiDAR semantic segmentation.

[156] VGGT-SLAM++

Avilasha Mandal,Rajesh Kumar,Sudarshan Sunil Harithas,Chetan Arora

Main category: cs.CV

TL;DR: VGGT-SLAM++ 是一种基于视觉几何接地Transformer(VGGT)的完整SLAM系统,通过引入DEM辅助图构建、DINOv2嵌入驱动的视觉地点识别与高频局部BA,显著降低短期位姿漂移并提升大规模建图精度与效率。

Details Motivation: 解决现有Transformer-based SLAM(如VGGT-SLAM)依赖稀疏闭环或全局Sim(3)约束导致的短时位姿漂移问题,增强局部几何一致性与实时优化能力。 Method: 融合VGGT前馈输出与Sim(3)求解器实现视觉里程计;构建子图级数字高程图(DEM),分块后提取DINOv2特征构建协视图;在协视窗口内用VPR检索空间邻域,触发高频局部束调整(LBA)。 Result: 在标准SLAM基准上达到SOTA精度,显著降低短时漂移,加速图优化收敛,并以紧凑DEM瓦片和次线性检索维持全局一致性。 Conclusion: VGGT-SLAM++通过几何感知的多模态表征与高效空间检索机制,在保持内存有界的同时实现了高精度、高鲁棒性的大规模视觉SLAM。 Abstract: We introduce VGGT-SLAM++, a complete visual SLAM system that leverages the geometry-rich outputs of the Visual Geometry Grounded Transformer (VGGT). The system comprises a visual odometry (front-end) fusing the VGGT feed-forward transformer and a Sim(3) solution, a Digital Elevation Map (DEM)-based graph construction module, and a back-end that jointly enable accurate large-scale mapping with bounded memory. While prior transformer-based SLAM pipelines such as VGGT-SLAM rely primarily on sparse loop closures or global Sim(3) manifold constraints - allowing short-horizon pose drift - VGGT-SLAM++ restores high-cadence local bundle adjustment (LBA) through a spatially corrective back-end. For each VGGT submap, we construct a dense planar-canonical DEM, partition it into patches, and compute their DINOv2 embeddings to integrate the submap into a covisibility graph. Spatial neighbors are retrieved using a Visual Place Recognition (VPR) module within the covisibility window, triggering frequent local optimization that stabilizes trajectories. Across standard SLAM benchmarks, VGGT-SLAM++ achieves state-of-the-art accuracy, substantially reducing short-term drift, accelerating graph convergence, and maintaining global consistency with compact DEM tiles and sublinear retrieval.

[157] CloudMamba: An Uncertainty-Guided Dual-Scale Mamba Network for Cloud Detection in Remote Sensing Imagery

Jiajun Yang,Keyan Chen,Zhengxia Zou,Zhenwei Shi

Main category: cs.CV

TL;DR: 本文提出CloudMamba框架,采用不确定性引导的两阶段策略解决薄云区域模糊性问题,并设计双尺度CNN-Mamba混合网络以高效精准地分割碎云和云边界。

Details Motivation: 现有单阶段云检测方法在薄云区域存在模糊性和不确定性,难以准确处理碎云和边界细节。 Method: 提出不确定性引导的两阶段云检测策略,包含嵌入式不确定性估计模块与第二阶段细化分割;设计基于CNN-Mamba混合架构的双尺度Mamba网络,兼顾线性计算复杂度与多尺度特征建模能力。 Result: 在GF1_WHU和Levir_CS数据集上,该方法在多个分割精度指标上超越现有方法,兼具高效率与过程可解释性。 Conclusion: CloudMamba通过两阶段策略与双尺度Mamba结构,有效提升了云检测在薄云、碎云及边界细节上的准确性与鲁棒性,同时保持计算高效性。 Abstract: Cloud detection in remote sensing imagery is a fundamental, critical, and highly challenging problem. Existing deep learning-based cloud detection methods generally formulate it as a single-stage pixel-wise binary segmentation task with one forward pass. However, such single-stage approaches exhibit ambiguity and uncertainty in thin-cloud regions and struggle to accurately handle fragmented clouds and boundary details. In this paper, we propose a novel deep learning framework termed CloudMamba. To address the ambiguity in thin-cloud regions, we introduce an uncertainty-guided two-stage cloud detection strategy. An embedded uncertainty estimation module is proposed to automatically quantify the confidence of thin-cloud segmentation, and a second-stage refinement segmentation is introduced to improve the accuracy in low-confidence hard regions. To better handle fragmented clouds and fine-grained boundary details, we design a dual-scale Mamba network based on a CNN-Mamba hybrid architecture. Compared with Transformer-based models with quadratic computational complexity, the proposed method maintains linear computational complexity while effectively capturing both large-scale structural characteristics and small-scale boundary details of clouds, enabling accurate delineation of overall cloud morphology and precise boundary segmentation. Extensive experiments conducted on the GF1_WHU and Levir_CS public datasets demonstrate that the proposed method outperforms existing approaches across multiple segmentation accuracy metrics, while offering high efficiency and process transparency. Our code is available at https://github.com/jayoungo/CloudMamba.

[158] Vision-Language Model-Guided Deep Unrolling Enables Personalized, Fast MRI

Fangmao Ju,Yuzhu He,Zhiwen Xue,Chunfeng Lian,Jianhua Ma

Main category: cs.CV

TL;DR: 本文提出PASS框架,利用视觉语言模型(VLM)引导深度展开网络,实现任务导向、个性化且异常感知的快速MRI成像,显著提升图像质量与下游诊断性能。

Details Motivation: 传统加速MRI方法仅优化通用图像质量,缺乏对特定临床任务的适应性;MRI采集时间长也限制了临床应用效率。 Method: 提出PASS框架:(1)基于物理MRI模型构建深度展开重建网络;(2)设计患者特异性k空间采样模块;(3)引入预训练VLM提取异常感知先验,协同指导采样与重建。 Result: PASS在多种解剖结构、对比度、异常类型和加速因子下均获得更优图像质量,并显著提升细粒度异常检测、定位与诊断等下游任务性能。 Conclusion: 将VLM的高层临床推理能力与可解释、物理驱动的网络结合,可有效实现任务导向的个性化快速MRI,推动智能影像向临床实用化迈进。 Abstract: Magnetic Resonance Imaging (MRI) is a cornerstone in medicine and healthcare but suffers from long acquisition times. Traditional accelerated MRI methods optimize for generic image quality, lacking adaptability for specific clinical tasks. To address this, we introduce PASS (Personalized, Anomaly-aware Sampling and reconStruction), an intelligent MRI framework that leverages a Vision-Language Model (VLM) to guide a deep unrolling network for task-oriented, fast imaging. PASS dynamically personalizes the imaging pipeline through three core contributions: (1) a deep unrolled reconstruction network derived from a physics-based MRI model; (2) a sampling module that generates patient-specific $k$-space trajectories; and (3) an anomaly-aware prior, extracted from a pretrained VLM, which steers both sampling and reconstruction toward clinically relevant regions. By integrating the high-level clinical reasoning of a VLM with an interpretable, physics-aware network, PASS achieves superior image quality across diverse anatomies, contrasts, anomalies, and acceleration factors. This enhancement directly translates to improvements in downstream diagnostic tasks, including fine-grained anomaly detection, localization, and diagnosis.

[159] Physical Adversarial Attacks on AI Surveillance Systems:Detection, Tracking, and Visible--Infrared Evasion

Miguel A. DelaCruz,Patricia Mae Santos,Rafael T. Navarro

Main category: cs.CV

TL;DR: 本文从监控系统的实际部署角度出发,系统性地回顾了物理对抗攻击的研究进展,强调了时间持久性、多模态感知、攻击载体真实性和系统级目标等关键问题,并提出了一个四维分类法来组织现有工作。

Details Motivation: 现有的物理对抗攻击研究多基于孤立的图像基准测试,而实际监控系统涉及人员检测、多目标跟踪、可见光-红外传感以及攻击载体的实际形式等多个方面,因此需要从更贴近实际部署的角度重新审视该领域。 Method: 本文采用综述与分类方法,构建了一个涵盖时间持久性、感知模态、载体真实性及系统级目标的四维分类体系,并以此梳理和分析近年来在多目标跟踪、可见光-红外双模态规避、可控服装等方向上的代表性工作。 Result: 揭示了当前评估实践中存在的不足,包括距离鲁棒性、相机处理流程差异、身份级指标缺失和激活感知测试缺乏等问题;指出监控系统的鲁棒性必须作为随时间演化、跨传感器协同、受真实物理约束影响的系统性问题来评估。 Conclusion: 仅依赖孤立帧级基准无法可靠评估监控系统的鲁棒性;必须将对抗攻击置于完整的监控系统上下文中,综合考虑时间维度、多模态输入和物理可行性进行系统性分析与评估。 Abstract: Physical adversarial attacks are increasingly studied in settings that resemble deployed surveillance systems rather than isolated image benchmarks. In these settings, person detection, multi-object tracking, visible--infrared sensing, and the practical form of the attack carrier all matter at once. This changes how the literature should be read. A perturbation that suppresses a detector in one frame may have limited practical effect if identity is recovered over time; an RGB-only result may say little about night-time systems that rely on visible and thermal inputs together; and a conspicuous patch can imply a different threat model from a wearable or selectively activated carrier. This paper reviews physical attacks from that surveillance-oriented viewpoint. Rather than attempting a complete catalogue of all physical attacks in computer vision, we focus on the technical questions that become central in surveillance: temporal persistence, sensing modality, carrier realism, and system-level objective. We organize prior work through a four-part taxonomy and discuss how recent results on multi-object tracking, dual-modal visible--infrared evasion, and controllable clothing reflect a broader change in the field. We also summarize evaluation practices and unresolved gaps, including distance robustness, camera-pipeline variation, identity-level metrics, and activation-aware testing. The resulting picture is that surveillance robustness cannot be judged reliably from isolated per-frame benchmarks alone; it has to be examined as a system problem unfolding over time, across sensors, and under realistic physical deployment constraints.

[160] RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Dewei Zhou,You Li,Zongxin Yang,Yi Yang

Main category: cs.CV

TL;DR: 本文提出了一种针对图像局部区域的精细化修复方法RefineAnything,通过Focus-and-Refine策略和Boundary Consistency Loss,在保持背景严格不变的前提下显著提升编辑区域细节质量。

Details Motivation: 现有图像生成与编辑模型在局部细节(如文字、logo、细结构)修复上常出现坍缩,且难以兼顾小目标区域的精细修复与背景一致性。 Method: 提出RefineAnything模型:1)基于多模态扩散架构;2)采用Focus-and-Refine策略(裁剪-聚焦-修复-融合粘贴)重分配分辨率资源;3)引入边界感知的Boundary Consistency Loss缓解拼接伪影;4)构建Refine-30K数据集与RefineEval评测基准。 Result: 在RefineEval基准上显著优于现有方法,实现高保真编辑区域重建与近乎完美的背景保持。 Conclusion: RefineAnything为高精度、区域可控的图像局部修复提供了实用、高效且鲁棒的新范式。 Abstract: We introduce region-specific image refinement as a dedicated problem setting: given an input image and a user-specified region (e.g., a scribble mask or a bounding box), the goal is to restore fine-grained details while keeping all non-edited pixels strictly unchanged. Despite rapid progress in image generation, modern models still frequently suffer from local detail collapse (e.g., distorted text, logos, and thin structures). Existing instruction-driven editing models emphasize coarse-grained semantic edits and often either overlook subtle local defects or inadvertently change the background, especially when the region of interest occupies only a small portion of a fixed-resolution input. We present RefineAnything, a multimodal diffusion-based refinement model that supports both reference-based and reference-free refinement. Building on a counter-intuitive observation that crop-and-resize can substantially improve local reconstruction under a fixed VAE input resolution, we propose Focus-and-Refine, a region-focused refinement-and-paste-back strategy that improves refinement effectiveness and efficiency by reallocating the resolution budget to the target region, while a blended-mask paste-back guarantees strict background preservation. We further introduce a boundary-aware Boundary Consistency Loss to reduce seam artifacts and improve paste-back naturalness. To support this new setting, we construct Refine-30K (20K reference-based and 10K reference-free samples) and introduce RefineEval, a benchmark that evaluates both edited-region fidelity and background consistency. On RefineEval, RefineAnything achieves strong improvements over competitive baselines and near-perfect background preservation, establishing a practical solution for high-precision local refinement. Project Page: https://limuloo.github.io/RefineAnything/.

[161] SCT-MOT: Enhancing Air-to-Air Multiple UAVs Tracking with Swarm-Coupled Motion and Trajectory Guidance

Zhaochen Chu,Tao Song,Ren Jin,Shaoming He,Defu Lin,Siqing Cheng

Main category: cs.CV

TL;DR: 本文提出SCT-MOT框架,通过群组耦合运动建模与轨迹引导特征融合,提升空中对空小型无人机集群跟踪的准确性与鲁棒性。

Details Motivation: 现有方法独立建模每个目标,忽略集群层面的运动依赖性,且运动预测与外观表征融合不足,难以在视觉线索弱、环境杂乱场景下维持连贯轨迹和可靠关联。 Method: 提出SCT-MOT框架,包含两个核心模块:1)群组运动感知轨迹预测(SMTP)模块,从集群层面联合建模历史轨迹与姿态感知外观特征;2)轨迹引导时空特征融合(TG-STFF)模块,将预测位置与历史视觉线索对齐,并深度融合当前帧特征。 Result: 在AIRMOT、MOT-FLY和UAVSwarm三个公开数据集上实验表明,SMTP模块比EqMotion提升1.21% IDF1;整体SCT-MOT在多项指标上显著优于现有最优跟踪器。 Conclusion: SCT-MOT通过引入群组级运动建模和轨迹引导的特征融合机制,有效解决了小目标、弱线索、强耦合运动下的集群跟踪难题,提升了跟踪的准确性与鲁棒性。 Abstract: Air-to-air tracking of swarm UAVs presents significant challenges due to the complex nonlinear group motion and weak visual cues for small objects, which often cause detection failures, trajectory fragmentation, and identity switches. Although existing methods have attempted to improve performance by incorporating trajectory prediction, they model each object independently, neglecting the swarm-level motion dependencies. Their limited integration between motion prediction and appearance representation also weakens the spatio-temporal consistency required for tracking in visually ambiguous and cluttered environments, making it difficult to maintain coherent trajectories and reliable associations. To address these challenges, we propose SCT-MOT, a tracking framework that integrates Swarm-Coupled motion modeling and Trajectory-guided feature fusion. First, we develop a Swarm Motion-Aware Trajectory Prediction (SMTP) module jointly models historical trajectories and posture-aware appearance features from a swarm-level perspective, enabling more accurate forecasting of the nonlinear, coupled group trajectories. Second, we design a Trajectory-Guided Spatio-Temporal Feature Fusion (TG-STFF) module aligns predicted positions with historical visual cues and deeply integrates them with current frame features, enhancing temporal consistency and spatial discriminability for weak objects. Extensive experiments on three public air-to-air swarm UAV tracking datasets, including AIRMOT, MOT-FLY, and UAVSwarm, demonstrate that SMTP achieves more accurate trajectory forecasts and yields a 1.21\% IDF1 improvement over the state-of-the-art trajectory prediction module EqMotion when integrated into the same MOT framework. Overall, our SCT-MOT consistently achieves superior accuracy and robustness compared to state-of-the-art trackers across multiple metrics under complex swarm scenarios.

[162] Time-driven Survival Analysis from FDG-PET/CT in Non-Small Cell Lung Cancer

Sambit Tarai,Ashish Chauhan,Elin Lundström,Johan Öfverstedt,Therese Sjöholm,Veronica Sanchez Rodriguez,Håkan Ahlström,Joel Kullberg

Main category: cs.CV

TL;DR: 本文提出了一种结合FDG-PET/CT影像与时间信息的深度回归框架,用于非小细胞肺癌患者总体生存期(OS)的时变预测,并在U-CAN队列中验证其优于仅用影像的基线模型。

Details Motivation: 提升患者预后评估和个体化治疗规划,需自动化、精准、时变的医学影像生存预测方法。 Method: 采用ResNet-50提取组织级FDG-PET/CT投影图像特征,融合标量时间窗(天数)输入,建模OS随时间变化的概率分布;在U-CAN队列(n=556)训练,在测试集(n=292)上对比仅影像基线模型。 Result: 融合时序信息后AUC提升4.3%;临床+IDP特征模型性能强;影像+临床+IDP集成模型达最佳AUC 0.788;支持风险分层;显著性热图显示肿瘤区域为关键预测依据。 Conclusion: 该框架实现了OS的自动化时变预测,证实影像与结构化临床数据融合可显著提升生存预测性能。 Abstract: Purpose: Automated medical image-based prediction of clinical outcomes, such as overall survival (OS), has great potential in improving patient prognostics and personalized treatment planning. We developed a deep regression framework using tissue-wise FDG-PET/CT projections as input, along with a temporal input representing a scalar time horizon (in days) to predict OS in patients with Non-Small Cell Lung Cancer (NSCLC). Methods: The proposed framework employed a ResNet-50 backbone to process input images and generate corresponding image embeddings. The embeddings were then combined with temporal data to produce OS probabilities as a function of time, effectively parameterizing the predictions based on time. The overall framework was developed using the U-CAN cohort (n = 556) and evaluated by comparing with a baseline method on the test set (n = 292). The baseline utilized the ResNet-50 architecture, processing only the images as input and providing OS predictions at pre-specified intervals, such as 2- or 5-year. Results: The incorporation of temporal data with image embeddings demonstrated an advantage in predicting OS, outperforming the baseline method with an improvement in AUC of 4.3%. The proposed model using clinical + IDP features achieved strong performance, and an ensemble of imaging and clinical + IDP models achieved the best overall performance (0.788), highlighting the complementary value of multimodal inputs. The proposed method also enabled risk stratification of patients into distinct categories (high vs low risk). Heat maps from the saliency analysis highlighted tumor regions as key structures for the prediction. Conclusion: Our method provided an automated framework for predicting OS as a function of time and demonstrates the potential of combining imaging and tabular data for improved survival prediction.

[163] Energy-Regularized Spatial Masking: A Novel Approach to Enhancing Robustness and Interpretability in Vision Models

Tom Devynck Bilal Faye Djamel Bouchaffra Nadjib Lazaar Hanane Azzag Mustapha Lebbah

Main category: cs.CV

TL;DR: 本文提出了一种名为Energy-Regularized Spatial Masking(ERSM)的新框架,将特征选择建模为可微的能量最小化问题,在不损失准确率的前提下实现输入自适应的稀疏性、鲁棒性和可解释性。

Details Motivation: 现有深度卷积网络因密集处理特征图而存在计算冗余和对虚假背景相关性的依赖,导致模型脆弱且难以解释。 Method: 提出嵌入轻量级Energy-Mask Layer的ERSM框架,每个视觉token的能量由一元重要性代价和成对空间一致性惩罚构成,并通过端到端训练实现能量驱动的自适应空间掩码。 Result: 在CNN上验证表明,ERSM能产生涌现稀疏性、提升对结构遮挡的鲁棒性、生成高可解释空间掩码,且能量排序在删除鲁棒性测试中显著优于幅值剪枝。 Conclusion: ERSM是一种内在的去噪机制,能在无像素级监督下自动定位语义物体区域,兼顾效率、鲁棒性与可解释性。 Abstract: Deep convolutional neural networks achieve remarkable performance by exhaustively processing dense spatial feature maps, yet this brute-force strategy introduces significant computational redundancy and encourages reliance on spurious background correlations. As a result, modern vision models remain brittle and difficult to interpret. We propose Energy-Regularized Spatial Masking (ERSM), a novel framework that reformulates feature selection as a differentiable energy minimization problem. By embedding a lightweight Energy-Mask Layer inside standard convolutional backbones, each visual token is assigned a scalar energy composed of two competing forces: an intrinsic Unary importance cost and a Pairwise spatial coherence penalty. Unlike prior pruning methods that enforce rigid sparsity budgets or rely on heuristic importance scores, ERSM allows the network to autonomously discover an optimal information-density equilibrium tailored to each input. We validate ERSM on convolutional architectures and demonstrate that it produces emergent sparsity, improved robustness to structured occlusion, and highly interpretable spatial masks, while preserving classification accuracy. Furthermore, we show that the learned energy ranking significantly outperforms magnitude-based pruning in deletion-based robustness tests, revealing ERSM as an intrinsic denoising mechanism that isolates semantic object regions without pixel-level supervision.

[164] Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Yuheng Shi,Xiaohuan Pei,Linfeng Wen,Minjing Dong,Chang Xu

Main category: cs.CV

TL;DR: Q-Zoom是一种查询感知的自适应高分辨率感知框架,通过粗到细的机制提升多模态大模型(MLLMs)在文档理解和高分辨率场景中的推理效率与精度。

Details Motivation: 现有全局分辨率缩放范式将大量冗余视觉token输入二次复杂度的自注意力机制,导致推理吞吐量瓶颈,且忽略空间稀疏性和查询意图。 Method: 提出Q-Zoom框架:1)轻量级动态门控网络,根据需求决定是否跳过高分辨率处理;2)自蒸馏区域提议网络(SD-RPN),从中间特征空间精准定位任务相关区域(RoI);3)一致性感知生成策略为门控提供确定性路由标签,SD-RPN采用全自监督蒸馏;4)连续时空对齐与针对性微调融合局部RoI与全局布局。 Result: 在Qwen2.5-VL-7B上,Q-Zoom在文档/OCR基准上推理加速2.52倍、高分辨率场景加速4.39倍,同时保持基线峰值精度;追求最高感知保真度时,分别提升精度1.1%和8.1%;效果可迁移至Qwen3-VL、LLaVA及新兴基于RL的图文推理模型。 Conclusion: Q-Zoom在效率与精度间实现卓越权衡,确立了主导的Pareto前沿,为高分辨率MLLM推理提供了高效、自适应、可扩展的新范式。 Abstract: MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

[165] Multi-modal user interface control detection using cross-attention

Milad Moradi,Ke Yan,David Colwell,Matthias Samwald,Rhona Asgari

Main category: cs.CV

TL;DR: 本文提出了一种融合GPT生成文本描述与YOLOv5的多模态UI控件检测方法,通过跨注意力机制对齐视觉与语义特征,在16000+ UI截图上验证了其在复杂/模糊控件检测上的显著提升。

Details Motivation: UI控件检测面临视觉歧义、设计多变及纯像素方法缺乏上下文等挑战。 Method: 提出YOLOv5的多模态扩展,引入GPT生成的UI图像文本描述,并通过跨注意力模块融合视觉特征与文本嵌入;对比了元素加法、加权求和与卷积融合三种策略。 Result: 卷积融合策略性能最优,尤其在语义复杂或视觉模糊的控件类别上提升显著;整体优于基线YOLOv5。 Conclusion: 视觉与文本模态融合可显著提升UI控件检测鲁棒性与上下文感知能力,为自动化测试、无障碍支持与UI分析提供更可靠基础。 Abstract: Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

[166] POS-ISP: Pipeline Optimization at the Sequence Level for Task-aware ISP

Jiyun Won,Heemin Yang,Woohyeok Kim,Jungseul Ok,Sunghyun Cho

Main category: cs.CV

TL;DR: 本文提出POS-ISP,一种基于序列级强化学习的图像信号处理(ISP)流水线优化框架,通过单次前向推理预测完整模块序列及参数,并以终端任务奖励进行端到端优化,避免中间监督与冗余执行,在多个下游任务中提升性能并降低计算开销。

Details Motivation: 现有ISP流水线优化方法存在训练-推理不匹配(NAS)或训练不稳定、计算开销大(阶段式RL)等问题,亟需更稳定高效的联合优化范式。 Method: 提出POS-ISP:将模块化ISP优化建模为全局序列预测问题,采用序列级强化学习,在单次前向中预测整条模块序列及其参数,并以终端任务奖励进行端到端优化。 Result: 在多个下游任务上验证了POS-ISP的有效性,相比NAS和阶段式RL方法,任务性能提升且计算成本显著降低。 Conclusion: 序列级优化是一种更稳定、高效的任务感知ISP优化新范式。 Abstract: Recent work has explored optimizing image signal processing (ISP) pipelines for various tasks by composing predefined modules and adapting them to task-specific objectives. However, jointly optimizing module sequences and parameters remains challenging. Existing approaches rely on neural architecture search (NAS) or step-wise reinforcement learning (RL), but NAS suffers from a training-inference mismatch, while step-wise RL leads to unstable training and high computational overhead due to stage-wise decision-making. We propose POS-ISP, a sequence-level RL framework that formulates modular ISP optimization as a global sequence prediction problem. Our method predicts the entire module sequence and its parameters in a single forward pass and optimizes the pipeline using a terminal task reward, eliminating the need for intermediate supervision and redundant executions. Experiments across multiple downstream tasks show that POS-ISP improves task performance while reducing computational cost, highlighting sequence-level optimization as a stable and efficient paradigm for task-aware ISP. The project page is available at https://w1jyun.github.io/POS-ISP

[167] Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen,Chengyu Bai,Junjun hu,Xinda Xue,Mu Xu

Main category: cs.CV

TL;DR: 本文提出Grounded Forcing框架,通过Dual Memory KV Cache、Dual-Reference RoPE Injection和Asymmetric Proximity Recache三项机制,协同解决自回归视频生成中的语义遗忘、视觉漂移和可控性丧失问题,显著提升长程一致性和视觉稳定性。

Details Motivation: 自回归视频合成在无限时序生成中面临语义遗忘、视觉漂移和可控性丧失三大相互关联的挑战,现有方法多孤立处理,难以保障长时连贯性。 Method: 提出Grounded Forcing框架,包含:1)Dual Memory KV Cache——解耦局部时序动态与全局语义锚点;2)Dual-Reference RoPE Injection——约束位置嵌入于训练流形内并保持语义时间不变性;3)Asymmetric Proximity Recache——基于邻近度加权更新缓存以实现提示切换时的平滑语义继承。 Result: 实验表明该方法显著提升了长程一致性与视觉稳定性,为交互式长视频生成奠定了坚实基础。 Conclusion: Grounded Forcing通过语义锚定与动态解耦的协同设计,有效克服了自回归视频生成中的核心瓶颈,推动了长时、可控、稳定视频合成的发展。 Abstract: Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while rendering global semantics time-invariant. Third, to resolve controllability issues, we develop Asymmetric Proximity Recache, which facilitates smooth semantic inheritance during prompt transitions via proximity-weighted cache updates. These components operate synergistically to tether the generative process to stable semantic cores while accommodating flexible local dynamics. Extensive experiments demonstrate that Grounded Forcing significantly enhances long-range consistency and visual stability, establishing a robust foundation for interactive long-form video synthesis.

[168] NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration: Methods and Results

Wenbin Zou,Tianyi Li,Kejun Wu,Huiping Zhuang,Zongwei Wu,Zhuyun Zhou,Radu Timofte,Kim-Hui Yap,Lap-Pui Chau,Yi Wang,Shiqi Zhou,Xiaodi Shi,Yuxiang Chen,Yilian Zhong,Shibo Yin,Yushun Fang,Xilei Zhu,Yahui Wang,Chen Lu,Zhitao Wang,Lifa Ha,Hengyu Man,Xiaopeng Fan,Priyansh Singh,Sidharth,Krrish Dev,Soham Kakkar,Vinit Jakhetiya,Ovais Iqbal Shah,Wei Zhou,Linfeng Li,Qi Xu,Zhenyang Liu,Kepeng Xu,Tong Qiao,Jiachen Tu,Guoyi Xu,Yaoxin Jiang,Jiajia Liu,Yaokun Shi

Main category: cs.CV

TL;DR: 本文介绍了NTIRE 2026挑战赛——比特流损坏视频恢复(BSCVR),旨在推动从解码产生严重时空伪影和内容失真的损坏比特流中恢复视觉连贯视频的研究。

Details Motivation: 推动在真实比特流损坏场景下视频恢复技术的研究,弥补该新兴任务缺乏统一评估基准的空白。 Method: 构建面向比特流损坏视频恢复的公共基准,包括专用数据集、标准化评估协议,并组织多支团队参与方法比拼与技术分析。 Result: 挑战赛揭示了该任务的高难度,并总结出当前主流技术趋势,为后续鲁棒视频恢复研究提供了实证依据和方向指引。 Conclusion: BSCVR挑战赛成功建立了首个面向实际比特流损坏的视频恢复评测平台,凸显任务挑战性,促进了该方向的技术发展与社区共识形成。 Abstract: This paper reports on the NTIRE 2026 Challenge on Bitstream-Corrupted Video Restoration (BSCVR). The challenge aims to advance research on recovering visually coherent videos from corrupted bitstreams, whose decoding often produces severe spatial-temporal artifacts and content distortion. Built upon recent progress in bitstream-corrupted video recovery, the challenge provides a common benchmark for evaluating restoration methods under realistic corruption settings. We describe the dataset, evaluation protocol, and participating methods, and summarize the final results and main technical trends. The challenge highlights the difficulty of this emerging task and provides useful insights for future research on robust video restoration under practical bitstream corruption.

[169] Making MLLMs Blind: Adversarial Smuggling Attacks in MLLM Content Moderation

Zhiheng Li,Zongyang Ma,Yuntong Pan,Ziqi Zhang,Xiaolei Lv,Bo Li,Jun Gao,Jianing Zhang,Chunfeng Yuan,Bing Li,Weiming Hu

Main category: cs.CV

TL;DR: 本文提出了一种新型对抗攻击——对抗走私攻击(Adversarial Smuggling Attacks),利用人类与AI在视觉感知和语义理解上的能力差距,将有害内容编码为AI难以识别但人类易读的视觉形式,从而绕过基于MLLM的内容审核系统;作者构建了首个基准SmuggleBench并验证了主流模型(如GPT-5、Qwen3-VL)对此类攻击高度脆弱(ASR > 90%),分析了三大根本原因,并初步探索了CoT推理与对抗微调等缓解策略。

Details Motivation: 现有MLLM内容审核系统面临新型隐蔽威胁,传统对抗攻击(如扰动、越狱)不适用于规避视觉-语言联合审核机制,而人类可读但AI不可读的有害视觉内容构成实际风险。 Method: 提出对抗走私攻击概念,细分为感知失明(破坏文本识别)与推理阻断(干扰语义理解)两类;构建包含1700个实例的SmuggleBench基准;在多模型上评估攻击成功率(ASR);从感知与推理角度归因漏洞;尝试测试时链式推理(CoT)与监督微调(SFT)对抗训练作为缓解手段。 Result: 主流MLLM(GPT-5、Qwen3-VL等)在SmuggleBench上ASR超90%;发现三大根因:视觉编码器能力有限、OCR鲁棒性不足、领域特异性对抗样本稀缺;CoT与SFT初步显示一定缓解效果但未完全解决。 Conclusion: 对抗走私攻击是MLLM内容审核中亟待重视的真实威胁,需从视觉编码、OCR鲁棒性及对抗数据构建等多层面协同防御,现有模型存在系统性脆弱性。 Abstract: Multimodal Large Language Models (MLLMs) are increasingly being deployed as automated content moderators. Within this landscape, we uncover a critical threat: Adversarial Smuggling Attacks. Unlike adversarial perturbations (for misclassification) and adversarial jailbreaks (for harmful output generation), adversarial smuggling exploits the Human-AI capability gap. It encodes harmful content into human-readable visual formats that remain AI-unreadable, thereby evading automated detection and enabling the dissemination of harmful content. We classify smuggling attacks into two pathways: (1) Perceptual Blindness, disrupting text recognition; and (2) Reasoning Blockade, inhibiting semantic understanding despite successful text recognition. To evaluate this threat, we constructed SmuggleBench, the first comprehensive benchmark comprising 1,700 adversarial smuggling attack instances. Evaluations on SmuggleBench reveal that both proprietary (e.g., GPT-5) and open-source (e.g., Qwen3-VL) state-of-the-art models are vulnerable to this threat, producing Attack Success Rates (ASR) exceeding 90%. By analyzing the vulnerability through the lenses of perception and reasoning, we identify three root causes: the limited capabilities of vision encoders, the robustness gap in OCR, and the scarcity of domain-specific adversarial examples. We conduct a preliminary exploration of mitigation strategies, investigating the potential of test-time scaling (via CoT) and adversarial training (via SFT) to mitigate this threat. Our code is publicly available at https://github.com/zhihengli-casia/smugglebench.

[170] Compression as an Adversarial Amplifier Through Decision Space Reduction

Lewis Evans,Harkrishan Jandu,Zihan Ye,Yang Lu,Shreyank N Gowda

Main category: cs.CV

TL;DR: 本文研究了图像压缩对深度图像分类器对抗鲁棒性的影响,发现压缩可作为对抗性放大器,使压缩域攻击比像素域攻击更有效。

Details Motivation: 尽管图像压缩在现代视觉流程中无处不在,但其对对抗鲁棒性的影响仍缺乏深入理解。 Method: 研究了一种此前未被探索的对抗场景,即攻击直接应用于压缩表示,并分析了压缩如何通过决策空间缩减(非可逆、信息丢失的变换)来缩小分类边界并增强对扰动的敏感性。 Result: 在相同名义扰动预算下,压缩感知攻击显著优于像素空间攻击;大量实验验证了该现象,并揭示了‘压缩嵌入式部署’中的关键脆弱性。 Conclusion: 图像压缩不仅未提升鲁棒性,反而可能加剧模型对抗脆弱性,需在实际部署中谨慎对待压缩环节。 Abstract: Image compression is a ubiquitous component of modern visual pipelines, routinely applied by social media platforms and resource-constrained systems prior to inference. Despite its prevalence, the impact of compression on adversarial robustness remains poorly understood. We study a previously unexplored adversarial setting in which attacks are applied directly in compressed representations, and show that compression can act as an adversarial amplifier for deep image classifiers. Under identical nominal perturbation budgets, compression-aware attacks are substantially more effective than their pixel-space counterparts. We attribute this effect to decision space reduction, whereby compression induces a non-invertible, information-losing transformation that contracts classification margins and increases sensitivity to perturbations. Extensive experiments across standard benchmarks and architectures support our analysis and reveal a critical vulnerability in compression-in-the-loop deployment settings. Code will be released.

[171] Auditing Demographic Bias in Facial Landmark Detection for Fair Human-Robot Interaction

Pablo Parte,Roberto Valle,José M. Buenaposada,Luis Baumela

Main category: cs.CV

TL;DR: 本文系统审计了面部关键点检测中的年龄、性别和种族偏见,提出了一种控制统计方法以分离人口统计学效应与混淆视觉因素(如头部姿态和图像分辨率)。结果表明,在排除混淆因素后,性别和种族偏差消失,但年龄偏差依然显著,尤其对老年人影响更大。

Details Motivation: 公平的人机交互依赖于可靠的感知模型,而面部关键点检测作为低层视觉任务,其潜在的人口统计学偏差尚未被系统研究。 Method: 提出一种受控的统计方法,用于解耦人口统计学属性(年龄、性别、种族)与混淆视觉因素(如头部姿态、图像分辨率),并在标准代表性模型上进行评估。 Result: 混淆视觉因素(尤其是头部姿态和图像分辨率)对性能的影响远超人口统计学属性;在控制这些混淆因素后,性别和种族偏差消失,但年龄偏差仍然显著,老年人表现更差。 Conclusion: 低层视觉组件也可能存在公平性问题,并可能在整个HRI流程中传播,影响脆弱人群;因此,审计和修正此类偏差对构建可信、公平的机器人感知系统至关重要。 Abstract: Fairness in human-robot interaction critically depends on the reliability of the perceptual models that enable robots to interpret human behavior. While demographic biases have been widely studied in high-level facial analysis tasks, their presence in facial landmark detection remains unexplored. In this paper, we conduct a systematic audit of demographic bias in this task, analyzing the age, gender and race biases. To this end we introduce a controlled statistical methodology to disentangle demographic effects from confounding visual factors. Evaluations of a standard representative model demonstrate that confounding visual factors, particularly head pose and image resolution, heavily outweigh the impact of demographic attributes. Notably, after accounting for these confounders, we show that performance disparities across gender and race vanish. However, we identify a statistically significant age-related effect, with higher biases observed for older individuals. This shows that fairness issues can emerge even in low-level vision components and can propagate through the HRI pipeline, disproportionately affecting vulnerable populations. We argue that auditing and correcting such biases is a necessary step toward trustworthy and equitable robot perception systems.

[172] MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Xiaoxiao Ma,Jiachen Lei,Tianfei Ren,Jie Huang,Siming Fu,Aiming Hao,Jiahong Wu,Xiangxiang Chu,Feng Zhao

Main category: cs.CV

TL;DR: 本文提出了一种针对掩码自回归(MAR)模型的稳定强化学习(RL)框架,通过多轨迹期望(MTE)降低扩散头引入的梯度噪声,并结合不确定性感知的token选择与一致性感知策略,显著提升训练稳定性、图像质量与空间结构理解能力。

Details Motivation: 现有RL方法在混合自回归-扩散(AR-diffusion)框架中面临推理交错和对数概率估计噪声大等问题,尤其在MAR模型中,扩散头易导致梯度不稳定和性能早饱和。 Method: 提出多轨迹期望(MTE)以平均多个扩散轨迹来降低梯度噪声;基于多轨迹估计token级不确定性,仅对top-k%最不确定token应用MTE;引入一致性感知token选择策略,过滤与最终生成内容对齐度低的AR token。 Result: 在多个基准上显著优于GRPO及预RL基线,提升视觉质量、训练稳定性与空间结构理解能力。 Conclusion: 所提稳定化RL框架有效缓解MAR模型中扩散头引发的训练不稳定性,为AR-diffusion混合模型的RL训练提供了可行且高效的新范式。 Abstract: Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

[173] CAAP: Capture-Aware Adversarial Patch Attacks on Palmprint Recognition Models

Renyang Liu,Jiale Li,Jie Zhang,Cong Wu,Xiaojun Jia,Shuxin Li,Wei Zhou,Kwok-Yan Lam,See-kiong Ng

Main category: cs.CV

TL;DR: 本文提出了一种面向掌纹识别的物理可实现对抗补丁攻击框架CAAP,通过跨形状补丁结构与捕获感知模块设计,在真实采集条件下实现了高迁移性、通用性强的攻击效果,揭示了当前深度掌纹识别系统在物理世界中的显著脆弱性。

Details Motivation: 现有掌纹识别对抗攻击研究多局限于数字域,未充分考虑掌纹以纹理为主导的特性及物理采集过程引入的失真,导致对实际物理攻击鲁棒性的理解不足。 Method: 提出CAAP框架,包含三模块:ASIT(输入条件化补丁渲染)、RaS(随机捕获感知仿真)和MS-DIFE(特征级身份破坏引导);采用十字形补丁拓扑以提升空间覆盖并干扰长程纹理连续性;学习可跨输入复用的通用对抗补丁。 Result: 在Tongji、IITD和AISEC数据集上,CAAP对通用CNN及专用掌纹模型均展现出强无目标/有目标攻击性能,并具备优异的跨模型与跨数据集迁移能力;对抗训练仅能部分缓解攻击,残余漏洞仍显著。 Conclusion: 深度掌纹识别系统在物理可实现、捕获感知的对抗补丁攻击下依然高度脆弱,亟需更有效的实际防御机制。 Abstract: Palmprint recognition is deployed in security-critical applications, including access control and palm-based payment, due to its contactless acquisition and highly discriminative ridge-and-crease textures. However, the robustness of deep palmprint recognition systems against physically realizable attacks remains insufficiently understood. Existing studies are largely confined to the digital setting and do not adequately account for the texture-dominant nature of palmprint recognition or the distortions introduced during physical acquisition. To address this gap, we propose CAAP, a capture-aware adversarial patch framework for palmprint recognition. CAAP learns a universal patch that can be reused across inputs while remaining effective under realistic acquisition variation. To match the structural characteristics of palmprints, the framework adopts a cross-shaped patch topology, which enlarges spatial coverage under a fixed pixel budget and more effectively disrupts long-range texture continuity. CAAP further integrates three modules: ASIT for input-conditioned patch rendering, RaS for stochastic capture-aware simulation, and MS-DIFE for feature-level identity-disruptive guidance. We evaluate CAAP on the Tongji, IITD, and AISEC datasets against generic CNN backbones and palmprint-specific recognition models. Experiments show that CAAP achieves strong untargeted and targeted attack performance with favorable cross-model and cross-dataset transferability. The results further show that, although adversarial training can partially reduce the attack success rate, substantial residual vulnerability remains. These findings indicate that deep palmprint recognition systems remain vulnerable to physically realizable, capture-aware adversarial patch attacks, underscoring the need for more effective defenses in practice. Code available at https://github.com/ryliu68/CAAP.

[174] Canopy Tree Height Estimation Using Quantile Regression: Modeling and Evaluating Uncertainty in Remote Sensing

Karsten Schrödter,Jan Pauls,Fabian Gieseke

Main category: cs.CV

TL;DR: 本文提出使用分位数回归改进基于卫星数据的树木高度估计模型,以提供统计校准的不确定性估计,从而增强其在风险敏感场景中的适用性。

Details Motivation: 现有树木高度估计方法多依赖点预测,难以应对风险敏感场景,缺乏不确定性量化能力。 Method: 在现有模型预测头基础上进行微小修改,引入分位数回归以实现不确定性量化。 Result: 模型能输出统计校准的不确定性估计,且不确定性程度与遥感中已知挑战(如地形复杂性、植被异质性)正相关,表明模型在更困难条件下置信度更低。 Conclusion: 分位数回归可有效增强树高估计模型的不确定性建模能力,提升其在生态监测和生物量评估等实际应用中的可靠性。 Abstract: Accurate tree height estimation is vital for ecological monitoring and biomass assessment. We apply quantile regression to existing tree height estimation models based on satellite data to incorporate uncertainty quantification. Most current approaches for tree height estimation rely on point predictions, which limits their applicability in risk-sensitive scenarios. In this work, we show that, with minor modifications of a given prediction head, existing models can be adapted to provide statistically calibrated uncertainty estimates via quantile regression. Furthermore, we demonstrate how our results correlate with known challenges in remote sensing (e.g., terrain complexity, vegetation heterogeneity), indicating that the model is less confident in more challenging conditions.

[175] Generative Phomosaic with Structure-Aligned and Personalized Diffusion

Jaeyoung Chung,Hyunjin Son,Kyoung Mu Lee

Main category: cs.CV

TL;DR: 本文提出了一种基于扩散模型的生成式照片马赛克方法,通过条件生成替代传统匹配方式,提升了多样性与结构一致性,并支持少样本个性化定制。

Details Motivation: 传统照片马赛克依赖大量图像块和颜色匹配,导致多样性低、结构不一致;亟需一种能兼顾语义表达与结构连贯的新方法。 Method: 提出基于扩散模型的生成式框架,采用低频条件扩散机制对齐全局结构并保留提示驱动的细节,结合少样本个性化扩散实现用户定制化图像块生成。 Result: 实现了语义丰富、结构连贯的生成式照片马赛克,在无需大量图像库的前提下支持风格一致或用户特定的马赛克合成。 Conclusion: 生成式方法从根本上克服了匹配式方法的局限性,为照片马赛克创作提供了更灵活、可控和高质量的新范式。 Abstract: We present the first generative approach to photomosaic creation. Traditional photomosaic methods rely on a large number of tile images and color-based matching, which limits both diversity and structural consistency. Our generative photomosaic framework synthesizes tile images using diffusion-based generation conditioned on reference images. A low-frequency conditioned diffusion mechanism aligns global structure while preserving prompt-driven details. This generative formulation enables photomosaic composition that is both semantically expressive and structurally coherent, effectively overcoming the fundamental limitations of matching-based approaches. By leveraging few-shot personalized diffusion, our model is able to produce user-specific or stylistically consistent tiles without requiring an extensive collection of images.

[176] IQ-LUT: interpolated and quantized LUT for efficient image super-resolution

Yuxuan Zhang,Zhikai Dong,Xinning Chai,Xiangyun Zhou,Yi Xu,Zhengxue Cheng,Li Song

Main category: cs.CV

TL;DR: 本文提出IQ-LUT方法,在大幅减小查找表(LUT)存储开销的同时提升图像超分辨率质量,通过将插值与量化集成到单输入多输出ECNN、引入残差学习缓解对量化位深依赖、以及利用知识蒸馏指导非均匀量化优化量化层级。

Details Motivation: 传统LUT方法在追求更大感受野和更高比特深度以提升超分质量时,导致指数级增长的索引空间和存储瓶颈,难以部署于资源受限设备。 Method: 1) 将插值与量化集成至单输入多输出ECNN以压缩索引空间;2) 引入残差学习降低对LUT位深的依赖,提升训练稳定性并增强细节重建;3) 利用知识蒸馏指导非均匀量化,优化量化层级以减少存储并补偿量化损失。 Result: 在基准测试中,相比ECNN,LUT存储成本最高降低50倍,同时获得更优的超分辨率质量。 Conclusion: IQ-LUT有效平衡了存储效率与重建质量,为资源受限场景下的高效超分辨率推理提供了新思路。 Abstract: Lookup table (LUT) methods demonstrate considerable potential in accelerating image super-resolution inference. However, pursuing higher image quality through larger receptive fields and bit-depth triggers exponential growth in the LUT's index space, creating a storage bottleneck that limits deployment on resource-constrained devices. We introduce IQ-LUT, which achieves a reduction in LUT size while simultaneously enhancing super-resolution quality. First, we integrate interpolation and quantization into the single-input, multiple-output ECNN, which dramatically reduces the index space and thereby the overall LUT size. Second, the integration of residual learning mitigates the dependence on LUT bit-depth, which facilitates training stability and prioritizes the reconstruction of fine-grained details for superior visual quality. Finally, guided by knowledge distillation, our non-uniform quantization process optimizes the quantization levels, thereby reducing storage while also compensating for quantization loss. Extensive benchmarking demonstrates our approach substantially reduces storage costs (by up to 50x compared to ECNN) while achieving superior super-resolution quality.

[177] Synthetic Dataset Generation for Partially Observed Indoor Objects

Jelle Vermandere,Maarten Bassier,Maarten Vergauwen

Main category: cs.CV

TL;DR: 本文提出了一种基于Unity的虚拟扫描框架,用于生成逼真的合成3D扫描数据集(V-Scan),以解决真实世界扫描获取带完整真值的配对数据成本高、耗时长的问题。

Details Motivation: 学习型3D场景重建与物体补全方法依赖大量带完整真值几何的局部扫描配对数据,而真实扫描获取此类数据(尤其遮挡区域的精确真值)成本高、耗时长。 Method: 在Unity中构建虚拟扫描系统,模拟真实扫描仪行为(如分辨率、测量范围、距离相关噪声);采用基于射线的扫描方式建模传感器可见性与遮挡;结合全景图像为点云上色;并集成程序化室内场景生成管线,自动构建多样化布局。 Result: 构建了V-Scan数据集,包含合成室内扫描、物体级局部点云、体素化遮挡网格及完整真值几何,可用于监督训练和评估学习型重建与补全方法。 Conclusion: 该虚拟扫描框架能高效生成高质量、多样化的合成扫描数据,为3D学习任务提供可靠且可扩展的数据支持。 Abstract: Learning-based methods for 3D scene reconstruction and object completion require large datasets containing partial scans paired with complete ground-truth geometry. However, acquiring such datasets using real-world scanning systems is costly and time-consuming, particularly when accurate ground truth for occluded regions is required. In this work, we present a virtual scanning framework implemented in Unity for generating realistic synthetic 3D scan datasets. The proposed system simulates the behaviour of real-world scanners using configurable parameters such as scan resolution, measurement range, and distance-dependent noise. Instead of directly sampling mesh surfaces, the framework performs ray-based scanning from virtual viewpoints, enabling realistic modelling of sensor visibility and occlusion effects. In addition, panoramic images captured at the scanner location are used to assign colours to the resulting point clouds. To support scalable dataset creation, the scanner is integrated with a procedural indoor scene generation pipeline that automatically produces diverse room layouts and furniture arrangements. Using this system, we introduce the \textit{V-Scan} dataset, which contains synthetic indoor scans together with object-level partial point clouds, voxel-based occlusion grids, and complete ground-truth geometry. The resulting dataset provides valuable supervision for training and evaluating learning-based methods for scene reconstruction and object completion.

[178] ModuSeg: Decoupling Object Discovery and Semantic Retrieval for Training-Free Weakly Supervised Segmentation

Qingze He,Fagui Liu,Dengke Zhang,Qingmao Wei,Quan Tang

Main category: cs.CV

TL;DR: 本文提出ModuSeg,一种无需训练的弱监督语义分割框架,通过显式解耦目标发现与语义分配,结合通用掩码提议器与语义基础模型构建的离线特征库,实现非参数化特征检索式分割,并引入语义边界净化与软掩码特征聚合策略提升原型质量。

Details Motivation: 现有弱监督语义分割方法常将语义识别与目标定位耦合,导致模型仅关注判别性稀疏区域;基础模型虽具潜力,但多数仍采用紧耦合优化范式,难以有效缓解伪标签噪声,且依赖耗时多阶段重训练或不稳定端到端联合优化。 Method: 提出ModuSeg框架:1)使用通用掩码提议器提取具有可靠边界的几何提议;2)利用语义基础模型构建离线特征库,将分割转化为非参数特征检索;3)引入语义边界净化和软掩码特征聚合策略以缓解边界模糊与量化误差,提取高质量类别原型。 Result: 在标准基准数据集上取得极具竞争力的性能,无需参数微调即可更好保持精细边界。 Conclusion: 显式解耦目标发现与语义分配的训练-free范式是弱监督语义分割的有效新路径,ModuSeg验证了其在边界保真度与性能上的优势。 Abstract: Weakly supervised semantic segmentation aims to achieve pixel-level predictions using image-level labels. Existing methods typically entangle semantic recognition and object localization, which often leads models to focus exclusively on sparse discriminative regions. Although foundation models show immense potential, many approaches still follow the tightly coupled optimization paradigm, struggling to effectively alleviate pseudo-label noise and often relying on time-consuming multi-stage retraining or unstable end-to-end joint optimization. To address the above challenges, we present ModuSeg, a training-free weakly supervised semantic segmentation framework centered on explicitly decoupling object discovery and semantic assignment. Specifically, we integrate a general mask proposer to extract geometric proposals with reliable boundaries, while leveraging semantic foundation models to construct an offline feature bank, transforming segmentation into a non-parametric feature retrieval process. Furthermore, we propose semantic boundary purification and soft-masked feature aggregation strategies to effectively mitigate boundary ambiguity and quantization errors, thereby extracting high-quality category prototypes. Extensive experiments demonstrate that the proposed decoupled architecture better preserves fine boundaries without parameter fine-tuning and achieves highly competitive performance on standard benchmark datasets. Code is available at https://github.com/Autumnair007/ModuSeg.

[179] Not all tokens contribute equally to diffusion learning

Guoqing Zhang,Lu Shi,Wanru Xu,Linna Zhang,Sen Wang,Fangfang Wang,Yigang Cen

Main category: cs.CV

TL;DR: 本文提出DARE框架,通过分布感知校正和空间集成来解决条件扩散模型在文本到视频生成中忽略重要语义token的问题,包含DR-CFG(分布校正的无分类器引导)和SRA(空间表征对齐)两个核心方法,显著提升生成保真度与语义对齐效果。

Details Motivation: 现有条件扩散模型在推理时易忽略语义重要token,导致生成偏差或不完整,原因在于训练数据中token频率长尾分布引发的分布偏置,以及交叉注意力中语义重要token被低信息量token掩盖造成的空间错位。 Method: 提出DARE统一框架:1)Distribution-Rectified Classifier-Free Guidance(DR-CFG),动态抑制低语义密度主导token以平衡条件分布;2)Spatial Representation Alignment(SRA),依据token重要性自适应重加权交叉注意力图并强制表征一致性,增强高语义密度token的空间引导能力。 Result: 在多个基准数据集上实验表明,DARE持续提升生成保真度与语义对齐性能,显著优于现有方法。 Conclusion: DARE从分布去偏和空间一致性两方面增强了扩散模型的语义引导能力,有效缓解了因token分布不均和注意力分配失衡导致的生成缺陷。 Abstract: With the rapid development of conditional diffusion models, significant progress has been made in text-to-video generation. However, we observe that these models often neglect semantically important tokens during inference, leading to biased or incomplete generations under classifier-free guidance. We attribute this issue to two key factors: distributional bias caused by the long-tailed token frequency in training data, and spatial misalignment in cross-attention where semantically important tokens are overshadowed by less informative ones. To address these issues, we propose Distribution-Aware Rectification and Spatial Ensemble (DARE), a unified framework that improves semantic guidance in diffusion models from the perspectives of distributional debiasing and spatial consistency. First, we introduce Distribution-Rectified Classifier-Free Guidance (DR-CFG), which regularizes the training process by dynamically suppressing dominant tokens with low semantic density, encouraging the model to better capture underrepresented semantic cues and learn a more balanced conditional distribution. This design mitigates the risk of the model distribution overfitting to tokens with low semantic density. Second, we propose Spatial Representation Alignment (SRA), which adaptively reweights cross-attention maps according to token importance and enforces representation consistency, enabling semantically important tokens to exert stronger spatial guidance during generation. This mechanism effectively prevents low semantic-density tokens from dominating the attention allocation, thereby avoiding the dilution of the spatial and distributional guidance provided by high semantic-density tokens. Extensive experiments on multiple benchmark datasets demonstrate that DARE consistently improves generation fidelity and semantic alignment, achieving significant gains over existing approaches.

[180] PRISM: Rethinking Scattered Atmosphere Reconstruction as a Unified Understanding and Generation Model for Real-world Dehazing

Chengyu Fang,Chunming He,Yuelin Zhang,Chubin Chen,Chenyang Zhu,Longxiang Tang,Xiu Li

Main category: cs.CV

TL;DR: PRISM提出了一种物理结构化的去雾框架PSAR,联合重建清晰场景和散射变量,并通过在线非均匀雾合成与选择性自蒸馏适应策略,有效解决真实场景中雾分布不均、多光源照明及缺乏配对数据等挑战,实现在真实世界图像去雾任务上的SOTA性能。

Details Motivation: 真实世界图像去雾面临非均匀雾分布、多光源导致的空间变化光照以及缺乏成对真实雾-清晰图像数据等挑战。 Method: 提出Proximal Scattered Atmosphere Reconstruction (PSAR)框架,基于大气散射模型联合重建清晰场景与散射变量;设计在线非均匀雾合成流程和Selective Self-distillation Adaptation方案,用于无配对真实数据场景下的自适应学习与残余雾检测引导的自精炼。 Result: 在多个真实世界去雾基准上取得当前最优性能(state-of-the-art)。 Conclusion: PRISM通过物理建模与自适应无监督学习策略的结合,显著提升了复杂真实场景下图像去雾的鲁棒性与可靠性。 Abstract: Real-world image dehazing (RID) aims to remove haze induced degradation from real scenes. This task remains challenging due to non-uniform haze distribution, spatially varying illumination from multiple light sources, and the scarcity of paired real hazy-clean data. In PRISM, we propose Proximal Scattered Atmosphere Reconstruction (PSAR), a physically structured framework that jointly reconstructs the clear scene and scattering variables under the atmospheric scattering model, thereby improving reliability in complex regions and mixed-light conditions. To bridge the synthetic-to-real gap, we design an online non-uniform haze synthesis pipeline and a Selective Self-distillation Adaptation scheme for unpaired real-world scenarios, which enables the model to selectively learn from high-quality perceptual targets while leveraging its intrinsic scattering understanding to audit residual haze and guide self-refinement. Extensive experiments on real-world benchmarks demonstrate that PRISM achieves state-of-the-art performance on RID tasks.

[181] AnchorSplat: Feed-Forward 3D Gaussian SplattingWith 3D Geometric Priors

Xiaoxue Zhang,Xiaoxu Zheng,Yixuan Yin,Tiao Zhao,Kaihua Tang,Michael Bi Mi,Zhan Xu,Dave Zhenyu Chen

Main category: cs.CV

TL;DR: 本文提出AnchorSplat,一种基于3D锚点对齐的高斯重建框架,摆脱像素对齐依赖,利用几何先验直接在3D空间建模,显著减少高斯数量并提升重建质量与效率。

Details Motivation: 现有前馈高斯重建模型采用像素对齐方式,使高斯表示与输入图像强耦合,导致高斯数量多、分辨率与视角依赖性强、几何感知弱。 Method: 提出AnchorSplat框架:1)引入锚点对齐的高斯表示,以稀疏点云/体素/RGB-D等3D几何先验为引导;2)设计高斯优化器(Gaussian Refiner),通过少量前向传播微调中间高斯参数。 Result: 在ScanNet++ v2 NVS基准上达到SOTA性能,生成更视角一致、高斯数量显著更少的高质量重建结果。 Conclusion: AnchorSplat通过3D锚点对齐和轻量Refiner,实现了更高效、几何感知更强、解耦于图像参数的场景级3D高斯重建。 Abstract: Recent feed-forward Gaussian reconstruction models adopt a pixel-aligned formulation that maps each 2D pixel to a 3D Gaussian, entangling Gaussian representations tightly with the input images. In this paper, we propose AnchorSplat, a novel feed-forward 3DGS framework for scene-level reconstruction that represents the scene directly in 3D space. AnchorSplat introduces an anchor-aligned Gaussian representation guided by 3D geometric priors (e.g., sparse point clouds, voxels, or RGB-D point clouds), enabling a more geometry-aware renderable 3D Gaussians that is independent of image resolution and number of views. This design substantially reduces the number of required Gaussians, improving computational efficiency while enhancing reconstruction fidelity. Beyond the anchor-aligned design, we utilize a Gaussian Refiner to adjust the intermediate Gaussiansy via merely a few forward passes. Experiments on the ScanNet++ v2 NVS benchmark demonstrate the SOTA performance, outperforming previous methods with more view-consistent and substantially fewer Gaussian primitives.

[182] Location Is All You Need: Continuous Spatiotemporal Neural Representations of Earth Observation Data

Mojgan Madadikhaljan,Jonathan Prexl,Isabelle Wittmann,Conrad M Albrecht,Michael Schmitt

Main category: cs.CV

TL;DR: LIANet是一种基于坐标的神经表示方法,将多时相遥感数据建模为连续时空神经场,仅需时空坐标即可重建卫星影像,并支持无需原始数据的下游任务微调。

Details Motivation: 提供一种用户友好的地理空间基础模型替代方案,消除终端用户在数据获取和预处理上的开销,并支持仅基于标签的微调。 Method: 提出LIANet(Location Is All You Need Network),一种坐标驱动的连续时空神经场,用于建模多时相星载地球观测数据;通过时空坐标输入重建卫星影像,并支持下游任务(如语义分割、像素级回归)的高效微调。 Result: 在不同尺度目标区域上成功预训练LIANet;微调后在下游任务中性能媲美从头训练或使用现有地理空间基础模型(GFMs)。 Conclusion: LIANet是一种轻量、灵活且实用的神经表示框架,可作为地理空间基础模型的有效补充,在不依赖原始遥感数据的前提下实现高性能下游任务适配。 Abstract: In this work, we present LIANet (Location Is All You Need Network), a coordinate-based neural representation that models multi-temporal spaceborne Earth observation (EO) data for a given region of interest as a continuous spatiotemporal neural field. Given only spatial and temporal coordinates, LIANet reconstructs the corresponding satellite imagery. Once pretrained, this neural representation can be adapted to various EO downstream tasks, such as semantic segmentation or pixel-wise regression, importantly, without requiring access to the original satellite data. LIANet intends to serve as a user-friendly alternative to Geospatial Foundation Models (GFMs) by eliminating the overhead of data access and preprocessing for end-users and enabling fine-tuning solely based on labels. We demonstrate the pretraining of LIANet across target areas of varying sizes and show that fine-tuning it for downstream tasks achieves competitive performance compared to training from scratch or using established GFMs. The source code and datasets are publicly available at https://github.com/mojganmadadi/LIANet/tree/v1.0.1.

[183] Novel Anomaly Detection Scenarios and Evaluation Metrics to Address the Ambiguity in the Definition of Normal Samples

Reiji Saito,Satoshi Kamiya,Kazuhiro Hotta

Main category: cs.CV

TL;DR: 本文提出了一种新的工业异常检测场景,针对正常样本定义模糊的问题,设计了RePaste方法,通过迭代重粘贴高异常分数区域来增强模型学习能力,并在MVTec AD数据集上取得了SOTA性能。

Details Motivation: 现实工业场景中,'正常样本'的定义往往模糊(如存在微小划痕或灰尘但仍被接受),且随着制造设备升级,对缺陷检测精度要求提高,传统仅用纯正常样本训练的方法难以适应这种规格变化。 Method: 提出了RePaste方法:在多步推理过程中,将前一步预测出的高异常分数区域重新粘贴到当前输入图像中,以增强模型对细微异常的学习能力;同时设计了适配规格变化的新评估指标。 Result: 在MVTec AD基准上的新场景中,RePaste在所提评估指标上达到SOTA,同时保持较高的AUROC和PRO分数。 Conclusion: RePaste有效应对了正常样本定义模糊及规格动态变化的挑战,为工业异常检测提供了更贴近实际应用的新范式。 Abstract: In conventional anomaly detection, training data consist of only normal samples. However, in real-world scenarios, the definition of a normal sample is often ambiguous. For example, there are cases where a sample has small scratches or stains but is still acceptable for practical usage. On the other hand, higher precision is required when manufacturing equipment is upgraded. In such cases, normal samples may include small scratches, tiny dust particles, or a foreign object that we would prefer to classify as an anomaly. Such cases frequently occur in industrial settings, yet they have not been discussed until now. Thus, we propose novel scenarios and an evaluation metric to accommodate specification changes in real-world applications. Furthermore, to address the ambiguity of normal samples, we propose the RePaste, which enhances learning by re-pasting regions with high anomaly scores from the previous step into the input for the next step. On our scenarios using the MVTec AD benchmark, RePaste achieved the state-of-the-art performance with respect to the proposed evaluation metric, while maintaining high AUROC and PRO scores. Code: https://github.com/ReijiSoftmaxSaito/Scenario

[184] SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

Qizhou Wang,Guansong Pang,Christopher Leckie

Main category: cs.CV

TL;DR: 本文提出了SurFITR数据集,用于监控风格图像的伪造检测与定位,以应对生成式AI带来的视觉证据伪造问题。该数据集通过多模态大语言模型驱动的管道生成,包含13.7万张具有多样化篡改类型和分辨率的图像,显著提升了现有伪造检测器在监控场景下的性能。

Details Motivation: 现有伪造检测模型在监控场景下泛化能力差,因为监控图像中的篡改通常局部且细微,且存在视角多样、目标小或被遮挡、画质较低等特点,而现有数据集多基于全图合成或大幅操纵的对象中心图像。 Method: 构建了一个名为SurFITR的大规模监控风格伪造图像数据集,采用多模态大语言模型驱动的管道进行语义感知、细粒度编辑,覆盖多种监控场景,并使用多个图像编辑模型生成超过13.7万张篡改图像。 Result: 实验表明,现有伪造检测器在SurFITR上性能显著下降;而在SurFITR上训练则大幅提升其在本领域及跨领域场景下的检测性能。 Conclusion: SurFITR填补了监控图像伪造检测数据集的空白,为提升伪造检测模型在真实监控场景中的鲁棒性提供了关键资源,并已开源。 Abstract: We present the Surveillance Forgery Image Test Range (SurFITR), a dataset for surveillance-style image forgery detection and localisation, in response to recent advances in open-access image generation models that raise concerns about falsifying visual evidence. Existing forgery models, trained on datasets with full-image synthesis or large manipulated regions in object-centric images, struggle to generalise to surveillance scenarios. This is because tampering in surveillance imagery is typically localised and subtle, occurring in scenes with varied viewpoints, small or occluded subjects, and lower visual quality. To address this gap, SurFITR provides a large collection of forensically valuable imagery generated via a multimodal LLM-powered pipeline, enabling semantically aware, fine-grained editing across diverse surveillance scenes. It contains over 137k tampered images with varying resolutions and edit types, generated using multiple image editing models. Extensive experiments show that existing detectors degrade significantly on SurFITR, while training on SurFITR yields substantial improvements in both in-domain and cross-domain performance. SurFITR is publicly available on GitHub.

[185] Assessing the Added Value of Onboard Earth Observation Processing with the IRIDE HEO Service Segment

Parampuneet Kaur Thind,Charles Mwangi,Giovanni Varetto,Lorenzo Sarti,Andrea Papa,Andrea Taramelli

Main category: cs.CV

TL;DR: 本文探讨了地面处理架构的局限性,并通过IRIDE计划中的Hawk for Earth Observation(HEO)系统,展示了星上处理在 burnt-area mapping服务中的实际价值,包括更高空间分辨率、更小事件检测能力及更快响应速度,定位为对现有Copernicus服务的补充而非替代。

Details Motivation: 现有地球观测服务受限于下行链路延迟、带宽限制和缺乏自主观测优先级调度能力,亟需提升时效性和响应能力。 Method: 以IRIDE burnt-area mapping服务为案例,评估星上处理(HEO)在 operational service 层面的增值效果,对比传统地面处理架构。 Result: 星上处理支持亚三米空间分辨率、三公顷最小制图单元,并显著提升系统响应速度;HEO作为图像驱动的预分类层,有效支撑下游应急与土地管理流程。 Conclusion: 星上智能处理对构建低延迟地球观测服务体系具有明确的运营价值,可作为现有服务的互补增强层。 Abstract: Current operational Earth Observation (EO) services, including the Copernicus Emergency Management Service (CEMS), the European Forest Fire Information System (EFFIS), and the Copernicus Land Monitoring Service (CLMS), rely primarily on ground-based processing pipelines. While these systems provide mature large-scale information products, they remain constrained by downlink latency, bandwidth limitations, and limited capability for autonomous observation prioritisation. The International Report for an Innovative Defence of Earth (IRIDE) programme is a national Earth observation initiative led by the Italian government to support public authorities through timely, objective information derived from spaceborne data. Rather than a single constellation, IRIDE is designed as a constellation of constellations, integrating heterogeneous sensing technologies within a unified service-oriented architecture. Within this framework, Hawk for Earth Observation (HEO) enables onboard generation of data products, allowing information extraction earlier in the processing chain. This paper examines the limitations of ground-only architectures and evaluates the added value of onboard processing at the operational service level. The IRIDE burnt-area mapping service is used as a representative case study to demonstrate how onboard intelligence can support higher spatial detail (sub-three-metre ground sampling distance), smaller detectable events (minimum mapping unit of three hectares), and improved system responsiveness. Rather than replacing existing Copernicus services, the IRIDE HEO capability is positioned as a complementary layer providing image-driven pre-classification to support downstream emergency and land-management workflows. This work highlights the operational value of onboard intelligence for emerging low-latency EO service architectures.

[186] Accuracy Improvement of Semi-Supervised Segmentation Using Supervised ClassMix and Sup-Unsup Feature Discriminator

Takahiro Mano,Reiji Saito,Kazuhiro Hotta

Main category: cs.CV

TL;DR: 本文提出了一种改进的半监督语义分割方法,通过将标注图像的类别标签和对应区域粘贴到未标注图像及其伪标签上,并对齐未标注图像与标注图像的预测结果,从而缓解伪标签不准和数据质量差异问题,在Chase和COVID-19数据集上mIoU平均提升2.07%。

Details Motivation: 解决半监督语义分割中伪标签不准确以及标注与未标注图像间数据质量差异导致特征图偏差的问题。 Method: 1)将标注图像的类别标签及对应图像区域粘贴到未标注图像及其伪标签图像上;2)训练模型使未标注图像的预测结果更接近标注图像的预测结果。 Result: 在Chase和COVID-19数据集上,相比传统半监督学习方法,mIoU平均提升2.07%。 Conclusion: 所提方法有效缓解了伪标签误差和数据质量差异带来的负面影响,提升了半监督语义分割性能。 Abstract: In semantic segmentation, the creation of pixel-level labels for training data incurs significant costs. To address this problem, semi-supervised learning, which utilizes a small number of labeled images alongside unlabeled images to enhance the performance, has gained attention. A conventional semi-supervised learning method, ClassMix, pastes class labels predicted from unlabeled images onto other images. However, since ClassMix performs operations using pseudo-labels obtained from unlabeled images, there is a risk of handling inaccurate labels. Additionally, there is a gap in data quality between labeled and unlabeled images, which can impact the feature maps. This study addresses these two issues. First, we propose a method where class labels from labeled images, along with the corresponding image regions, are pasted onto unlabeled images and their pseudo-labeled images. Second, we introduce a method that trains the model to make predictions on unlabeled images more similar to those on labeled images. Experiments on the Chase and COVID-19 datasets demonstrated an average improvement of 2.07% in mIoU compared to conventional semi-supervised learning methods.

[187] A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

Chenhao Liu,Zelin Wen,Yan Tong,Junjie Zhu,Xinyu Tian,Yuchi Liu,Ashu Gupta,Syed M. S. Islam,Tom Gedeon,Yue Yao

Main category: cs.CV

TL;DR: 本文提出了一种效用保持的放射学数据去标识化管道(UPDP),在保护患者隐私的同时,保留了用于视觉-语言模型训练和跨医院迁移学习的关键病理信息。

Details Motivation: 大型放射学数据对医疗AI系统至关重要,但因隐私问题难以跨医院共享;现有去标识化方法主要关注合规性发布,而忽视了其在大规模视觉-语言模型训练和跨院迁移中的实用性。 Method: 构建隐私敏感词黑名单与病理相关词白名单;对影像采用生成式过滤机制合成保留病理特征但去除隐私信息的图像;对报告进行ID过滤;联合生成图像与过滤报告实现安全共享。 Result: 在公开胸部X光基准上验证:去标识化数据有效消除隐私信息、保留诊断相关病理线索;基于该数据训练的模型诊断准确率接近原始数据,但身份识别准确率显著下降;跨医院场景下,去标识化数据与本地数据结合可进一步提升性能。 Conclusion: UPDP在保障患者隐私的前提下,实现了放射学数据跨医院共享的实用性与安全性平衡,为医疗AI协作训练提供了可行路径。 Abstract: Large-scale radiology data are critical for developing robust medical AI systems. However, sharing such data across hospitals remains heavily constrained by privacy concerns. Existing de-identification research in radiology mainly focus on removing identifiable information to enable compliant data release. Yet whether de-identified radiology data can still preserve sufficient utility for large-scale vision-language model training and cross-hospital transfer remains underexplored. In this paper, we introduce a utility-preserving de-identification pipeline (UPDP) for cross-hospital radiology data sharing. Specifically, we compile a blacklist of privacy-sensitive terms and a whitelist of pathology-related terms. For radiology images, we use a generative filtering mechanism that synthesis a privacy-filtered and pathology-reserved counterparts of the original images. These synthetic image counterparts, together with ID-filtered reports, can then be securely shared across hospitals for downstream model development and evaluation. Experiments on public chest X-ray benchmarks demonstrate that our method effectively removes privacy-sensitive information while preserving diagnostically relevant pathology cues. Models trained on the de-identified data maintain competitive diagnostic accuracy compared with those trained on the original data, while exhibiting a marked decline in identity-related accuracy, confirming effective privacy protection. In the cross-hospital setting, we further show that de-identified data can be combined with local data to yield better performance.

[188] CSA-Graphs: A Privacy-Preserving Structural Dataset for Child Sexual Abuse Research

Carlos Caetano,Camila Laranjeira,Clara Ernesto,Artur Barros,João Macedo,Leo S. F. Ribeiro,Jefersson A. dos Santos,Sandra Avila

Main category: cs.CV

TL;DR: 本文提出CSA-Graphs——一种隐私保护的图结构数据集,用场景图和骨架图替代原始儿童性虐待图像,以兼顾法律伦理约束与算法研究需求。

Details Motivation: 由于法律与伦理限制,儿童性虐待图像(CSAI)数据集无法公开共享,严重阻碍了相关计算机视觉研究的可复现性与进展。 Method: 构建CSA-Graphs数据集,采用两种图结构模态:描述物体关系的场景图与编码人体姿态的骨架图,替代原始图像,在去除显式视觉内容的同时保留上下文信息。 Result: 实验证明两种图表示均保留了CSAI分类的有效信息,融合二者可进一步提升分类性能。 Conclusion: CSA-Graphs为儿童安全相关的计算机视觉研究提供了合规、可用的新范式,平衡了技术发展与隐私伦理要求。 Abstract: Child Sexual Abuse Imagery (CSAI) classification is an important yet challenging problem for computer vision research due to the strict legal and ethical restrictions that prevent the public sharing of CSAI datasets. This limitation hinders reproducibility and slows progress in developing automated methods. In this work, we introduce CSA-Graphs, a privacy-preserving structural dataset. Instead of releasing the original images, we provide structural representations that remove explicit visual content while preserving contextual information. CSA-Graphs includes two complementary graph-based modalities: scene graphs describing object relationships and skeleton graphs encoding human pose. Experiments show that both representations retain useful information for classifying CSAI, and that combining them further improves performance. This dataset enables broader research on computer vision methods for child safety while respecting legal and ethical constraints.

[189] USCNet: Transformer-Based Multimodal Fusion with Segmentation Guidance for Urolithiasis Classification

Changmiao Wang,Songqi Zhang,Yongquan Zhang,Yifei Wang,Liya Liu,Nannan Li,Xingzhi Li,Jiexin Pan,Yi Jiang,Xiang Wan,Hai Wang,Ahmed Elazab

Main category: cs.CV

TL;DR: 本文提出了一种名为USCNet的新方法,通过融合CT影像与电子健康记录(EHR)数据,实现肾结石术前精准分类,显著优于现有方法。

Details Motivation: 现有肾结石成分分析依赖术后标本,无法实现术前快速分类,限制了个体化治疗和复发预防。 Method: 提出基于Transformer的多模态融合网络USCNet,包含CT-EHR注意力机制和分割引导注意力模块,并引入动态损失函数联合优化分割与分类任务。 Result: 在自建数据集上,USCNet在所有评估指标上均表现优异,分类性能显著超越主流现有方法。 Conclusion: USCNet为肾结石术前精准分类提供了有前景的临床解决方案,并已开源代码。 Abstract: Kidney stone disease ranks among the most prevalent conditions in urology, and understanding the composition of these stones is essential for creating personalized treatment plans and preventing recurrence. Current methods for analyzing kidney stones depend on postoperative specimens, which prevents rapid classification before surgery. To overcome this limitation, we introduce a new approach called the Urinary Stone Segmentation and Classification Network (USCNet). This innovative method allows for precise preoperative classification of kidney stones by integrating Computed Tomography (CT) images with clinical data from Electronic Health Records (EHR). USCNet employs a Transformer-based multimodal fusion framework with CT-EHR attention and segmentation-guided attention modules for accurate classification. Moreover, a dynamic loss function is introduced to effectively balance the dual objectives of segmentation and classification. Experiments on an in-house kidney stone dataset show that USCNet demonstrates outstanding performance across all evaluation metrics, with its classification efficacy significantly surpassing existing mainstream methods. This study presents a promising solution for the precise preoperative classification of kidney stones, offering substantial clinical benefits. The source code has been made publicly available: https://github.com/ZhangSongqi0506/KidneyStone.

[190] Learning to Search: A Decision-Based Agent for Knowledge-Based Visual Question Answering

Zhuohong Chen,Zhenxian Wu,Yunyao Yu,Hangrui Xu,Zirui Liao,Zhifang Liu,Xiangwen Deng,Pen Jiao,Haoqian Wang

Main category: cs.CV

TL;DR: 本文提出将知识增强型视觉问答(KB-VQA)建模为搜索智能体(search-agent)问题,通过多步决策(Answer/ Image Retrieval/ Text Retrieval/ Caption)动态协调检索与推理,并利用自动构建的多步轨迹数据进行监督微调,在InfoSeek和E-VQA上达到SOTA。

Details Motivation: 现有RAG方法采用固定串行流程,难以适配多样问题类型,且检索与推理分离,导致证据对齐差、查询不自适应、终止时机不合理。 Method: 将KB-VQA建模为多步决策搜索智能体,每步从四个动作中选择其一;设计自动化流程采集包含推理过程、工具调用与中间决策的多步轨迹;以轨迹为监督信号进行微调。 Result: 在InfoSeek和E-VQA数据集上均取得SOTA性能,显著优于各类基线方法。 Conclusion: 将KB-VQA重构为搜索智能体范式,实现检索与推理的联合优化,能更灵活、精准地应对复杂知识问答任务。 Abstract: Knowledge-based visual question answering (KB-VQA) requires vision-language models to understand images and use external knowledge, especially for rare entities and long-tail facts. Most existing retrieval-augmented generation (RAG) methods adopt a fixed pipeline that sequentially retrieves information, filters it, and then produces an answer. Such a design makes it difficult to adapt to diverse question types. Moreover, it separates retrieval from reasoning, making it hard for the model to decide when to search, how to refine queries, or when to stop. As a result, the retrieved evidence is often poorly aligned with the question. To address these limitations, we reformulate KB-VQA as a search-agent problem and model the solving process as a multi-step decision-making procedure. At each step, the agent selects one of four actions-Answer, Image Retrieval, Text Retrieval, and Caption-based on its current information state. We further design an automated pipeline to collect multi-step trajectories that record the agent's reasoning process, tool usage, and intermediate decisions. These trajectories are then used as supervision for fine-tuning. Experiments on InfoSeek and E-VQA demonstrate that our method achieves state-of-the-art performance, consistently outperforming prior baselines and confirming the effectiveness of our framework.

[191] Bridging MRI and PET physiology: Untangling complementarity through orthogonal representations

Sonja Adomeit,Kartikay Tehlan,Lukas Förner,Katharina Weisser,Helen Scholtiseek,David Kaufmann,Julie Steinestel,Constantin Lapa,Thomas Kröncke,Thomas Wendler

Main category: cs.CV

TL;DR: 本文提出一种基于正交子空间分解的多模态融合框架,将PSMA PET信号分解为MRI可解释的生理包络和正交残差,揭示PET中不可被MRI生理特征描述的特异性信号,尤其在肿瘤区域显著。

Details Motivation: 现有多模态成像分析缺乏对共享信息与模态特异性信息的明确定义,而这一区分对临床理解各模态不可替代价值及优化影像采集策略至关重要。 Method: 构建子空间分解框架,利用多参数MRI训练强度驱动的隐式神经表示(INR)映射至PSMA PET摄取值,并引入基于SVD的投影正则化,强制PET残差与MRI特征流形正交。 Result: 在13例前列腺癌患者上验证:MRI特征流形内的残差被吸收进生理包络,而正交残差在肿瘤区最大,表明PSMA PET包含MRI无法表征的信号成分。 Conclusion: 该方法从表征几何角度结构化刻画了模态互补性,超越传统图像翻译范式,为多模态融合提供了更具临床解释性的数学基础。 Abstract: Multimodal imaging analysis often relies on joint latent representations, yet these approaches rarely define what information is shared versus modality-specific. Clarifying this distinction is clinically relevant, as it delineates the irreducible contribution of each modality and informs rational acquisition strategies. We propose a subspace decomposition framework that reframes multimodal fusion as a problem of orthogonal subspace separation rather than translation. We decompose Prostate-Specific Membrane Antigen (PSMA) PET uptake into an MRI-explainable physiological envelope and an orthogonal residual reflecting signal components not expressible within the MRI feature manifold. Using multiparametric MRI, we train an intensity-based, non-spatial implicit neural representation (INR) to map MRI feature vectors to PET uptake. We introduce a projection-based regularization using singular value decomposition to penalize residual components lying within the span of the MRI feature manifold. This enforces mathematical orthogonality between tissue-level physiological properties (structure, diffusion, perfusion) and intracellular PSMA expression. Tested on 13 prostate cancer patients, the model demonstrates that residual components spanned by MRI features are absorbed into the learned envelope, while the orthogonal residual is largest in tumour regions. This indicates that PSMA PET contains signal components not recoverable from MRI-derived physiological descriptors. The resulting decomposition provides a structured characterization of modality complementarity grounded in representation geometry rather than image translation.

[192] DINO-QPM: Adapting Visual Foundation Models for Globally Interpretable Image Classification

Robert Zimmermann,Thomas Norrenbrock,Bodo Rosenhahn

Main category: cs.CV

TL;DR: DINO-QPM是一种轻量级可解释性适配器,将DINOv2等视觉基础模型的高维特征转化为人类可理解的对比性、类别无关表示,在保持主干网络冻结的前提下,通过平均池化和稀疏性损失实现全局可解释图像分类,并在准确性和解释质量上均优于现有方法。

Details Motivation: 视觉基础模型(如DINOv2)虽性能优异,但其高维、纠缠的特征表示严重阻碍了可解释性,亟需一种能在不微调主干网络前提下提供人类可理解解释的方法。 Method: 提出DINO-QPM适配器,基于冻结的DINO主干,摒弃CLS token,改用平均池化连接patch嵌入与特征,实现空间定位;引入稀疏性损失抑制背景噪声与空间散射;将QPM方法适配至视觉领域以生成全局可解释表示。 Result: DINO-QPM在分类准确率上超越DINOv2线性探针,并在新提出的Plausibility指标及其他可解释性指标上显著优于其他适用于冻结视觉基础模型的方法。 Conclusion: DINO-QPM成功将QPM的可解释性水平以轻量适配器形式迁移至视觉基础模型,在不牺牲精度的前提下实现了高质量、空间可定位、类别无关的全局解释,为冻结大模型的可信部署提供了新路径。 Abstract: Although visual foundation models like DINOv2 provide state-of-the-art performance as feature extractors, their complex, high-dimensional representations create substantial hurdles for interpretability. This work proposes DINO-QPM, which converts these powerful but entangled features into contrastive, class-independent representations that are interpretable by humans. DINO-QPM is a lightweight interpretability adapter that pursues globally interpretable image classification, adapting the Quadratic Programming Enhanced Model (QPM) to operate on strictly frozen DINO backbones. While classification with visual foundation models typically relies on the \texttt{CLS} token, we deliberately diverge from this standard. By leveraging average-pooling, we directly connect the patch embeddings to the model's features and therefore enable spatial localisation of DINO-QPM's globally interpretable features within the input space. Furthermore, we apply a sparsity loss to minimise spatial scatter and background noise, ensuring that explanations are grounded in relevant object parts. With DINO-QPM we make the level of interpretability of QPM available as an adapter while exceeding the accuracy of DINOv2 linear probe. Evaluated through an introduced Plausibility metric and other interpretability metrics, extensive experiments demonstrate that DINO-QPM is superior to other applicable methods for frozen visual foundation models in both classification accuracy and explanation quality.

[193] Multiple Domain Generalization Using Category Information Independent of Domain Differences

Reiji Saito,Kazuhiro Hotta

Main category: cs.CV

TL;DR: 本文提出了一种面向域泛化分割的新方法,通过解耦域不变特征与域特异性特征,并结合SQ-VAE中的量子向量吸收域间隙,提升了跨域血管与细胞核分割的性能。

Details Motivation: 解决模型在训练域(源域)上训练后,在不同测试域(目标域)上性能显著下降的问题,尤其在医学图像中由成像设备、染色方法等引起的域偏移。 Method: 1)分离与域无关的类别信息和源域特异性信息;2)利用Stochastically Quantized Variational AutoEncoder(SQ-VAE)中的量子向量吸收剩余域间隙。 Result: 在血管分割和细胞核分割数据集上的实验表明,所提方法相比传统方法具有更高的跨域分割精度。 Conclusion: 解耦域不变表示并辅以SQ-VAE建模域间隙,可有效提升医学图像分割的域泛化能力。 Abstract: Domain generalization is a technique aimed at enabling models to maintain high accuracy when applied to new environments or datasets (unseen domains) that differ from the datasets used in training. Generally, the accuracy of models trained on a specific dataset (source domain) often decreases significantly when evaluated on different datasets (target domain). This issue arises due to differences in domains caused by varying environmental conditions such as imaging equipment and staining methods. Therefore, we undertook two initiatives to perform segmentation that does not depend on domain differences. We propose a method that separates category information independent of domain differences from the information specific to the source domain. By using information independent of domain differences, our method enables learning the segmentation targets (e.g., blood vessels and cell nuclei). Although we extract independent information of domain differences, this cannot completely bridge the domain gap between training and test data. Therefore, we absorb the domain gap using the quantum vectors in Stochastically Quantized Variational AutoEncoder (SQ-VAE). In experiments, we evaluated our method on datasets for vascular segmentation and cell nucleus segmentation. Our methods improved the accuracy compared to conventional methods.

[194] Energy-based Tissue Manifolds for Longitudinal Multiparametric MRI Analysis

Kartikay Tehlan,Lukas Förner,Nico Schmutzenhofer,Michael Frühwald,Matthias Wagner,Nassir Navab,Thomas Wendler

Main category: cs.CV

TL;DR: 本文提出了一种基于患者特异性能量建模的纵向多参数MRI分析几何框架,利用单次基线扫描学习序列空间中的隐式能量函数,将后续随访扫描映射到该固定基线能量流形上进行无监督、无分割的组织演化分析。

Details Motivation: 传统纵向MRI分析依赖于图像空间分割或配准,易受噪声、伪影和解剖变异影响;亟需一种不依赖标注、能刻画组织内在物理状态变化的几何化分析方法。 Method: 对每个体素的多序列强度向量(T1、T1c、T2、FLAIR、ADC)建模,通过去噪分数匹配训练紧凑隐式神经网络,学习基线扫描的能量函数Eθ(u);将该能量流形作为固定几何参考,用梯度、拉普拉斯曲率等微分几何量刻画组织状态,并在随访扫描中评估序列向量分布相对于该基线能量的演化。 Result: 在儿童脑瘤病例中,复发前随访扫描已显示出能量升高及朝向基线肿瘤相关能量极小值的定向位移;稳定病例则保持在低能量组织盆地内,无系统性漂移;验证了该方法可早于影像学可见变化检测组织状态异常。 Conclusion: 患者特异性能量流形可作为无需分割与监督学习的纵向mpMRI几何参考系统,为神经肿瘤学中基于流形的‘高危组织’追踪提供新范式。 Abstract: We propose a geometric framework for longitudinal multi-parametric MRI analysis based on patient-specific energy modelling in sequence space. Rather than operating on images with spatial networks, each voxel is represented by its multi-sequence intensity vector ($T1$, $T1c$, $T2$, FLAIR, ADC), and a compact implicit neural representation is trained via denoising score matching to learn an energy function $E_θ(\mathbf{u})$ over $\mathbb{R}^d$ from a single baseline scan. The learned energy landscape provides a differential-geometric description of tissue regimes without segmentation labels. Local minima define tissue basins, gradient magnitude reflects proximity to regime boundaries, and Laplacian curvature characterises local constraint structure. Importantly, this baseline energy manifold is treated as a fixed geometric reference: it encodes the set of contrast combinations observed at diagnosis and is not retrained at follow-up. Longitudinal assessment is therefore formulated as evaluation of subsequent scans relative to this baseline geometry. Rather than comparing anatomical segmentations, we analyse how the distribution of MRI sequence vectors evolves under the baseline energy function. In a paediatric case with later recurrence, follow-up scans show progressive deviation in energy and directional displacement in sequence space toward the baseline tumour-associated regime before clear radiological reappearance. In a case with stable disease, voxel distributions remain confined to established low-energy basins without systematic drift. The presented cases serve as proof-of-concept that patient-specific energy manifolds can function as geometric reference systems for longitudinal mpMRI analysis without explicit segmentation or supervised classification, providing a foundation for further investigation of manifold-based tissue-at-risk tracking in neuro-oncology.

[195] TeaLeafVision: An Explainable and Robust Deep Learning Framework for Tea Leaf Disease Classification

Rafi Ahamed,Sidratul Moon Nafsin,Md Abir Rahman,Tasnia Tarannum Roza,Munaia Jannat Easha,Abu Raihan

Main category: cs.CV

TL;DR: 本文提出了一种基于深度学习的茶树病害识别方法,使用DenseNet201在teaLeafBD数据集上达到99%准确率,并结合Grad-CAM、遮挡敏感性分析和对抗训练提升模型可解释性与鲁棒性,最终构建了农业应用原型。

Details Motivation: 茶是全球第二大饮品,其病害精准识别对保障产量和经济价值至关重要;现有方法难以应对田间复杂条件下的病害检测挑战。 Method: 评估多个CNN模型(DenseNet201、MobileNetV2、InceptionV3)在teaLeafBD七分类数据集上的性能;采用Grad-CAM可视化、遮挡敏感性分析和对抗训练提升模型可解释性与抗噪能力;开发部署原型系统。 Result: DenseNet201在teaLeafBD数据集上测试准确率达99%;可视化与鲁棒性分析验证了模型决策依据合理且对噪声具有较强抵抗能力;成功实现面向实际农业场景的原型系统。 Conclusion: 深度学习模型(尤其是DenseNet201)在真实场景茶树病害识别中表现出高精度与实用性,结合可解释性与鲁棒性增强技术后,具备落地农业智能诊断的潜力。 Abstract: As the worlds second most consumed beverage after water, tea is not just a cultural staple but a global economic force of profound scale and influence. More than a mere drink, it represents a quiet negotiation between nature, culture, and the human desire for a moment of reflection. So, the precise identification and detection of tea leaf disease is crucial. With this goal, we have evaluated several Convolutional Neural Networks (CNN) models, among them three shows noticeable performance including DenseNet201, MobileNetV2, InceptionV3 on the teaLeafBD dataset. teaLeafBD dataset contains seven classes, six disease classes and one healthy class, collected under various field conditions reflecting real world challenges. Among the CNN models, DenseNet201 has achieved the highest test accuracy of 99%. In order to enhance the model reliability and interpretability, we have implemented Gradient weighted Class Activation Mapping (Grad CAM), occlusion sensitivity analysis and adversarial training techniques to increase the noise resistance of the model. Finally, we have developed a prototype in order to leverage the models capabilities on real life agriculture. This paper illustrates the deep learning models capabilities to classify the disease in real life tea leaf disease detection and management.

[196] INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling

InSpatio Team,Donghui Shen,Guofeng Zhang,Haomin Liu,Haoyu Ji,Hujun Bao,Hongjia Zhai,Jialin Liu,Jing Guo,Nan Wang,Siji Pan,Weihong Pan,Weijian Xie,Xianbin Liu,Xiaojun Xiang,Xiaoyu Zhang,Xinyu Chen,Yifu Wang,Yipeng Chen,Zhenzhou Fan,Zhewen Le,Zhichao Ye,Ziqiang Zhao

Main category: cs.CV

TL;DR: 本文提出INSPATIO-WORLD框架,通过时空自回归(STAR)架构与联合分布匹配蒸馏(JDMD)技术,实现从单参考视频中实时重建高保真、可交互的动态三维场景,显著提升空间一致性和交互精度。

Details Motivation: 现有视频生成方法在空间持久性和视觉真实性方面存在不足,难以支持复杂环境中的无缝导航。 Method: 提出INSPATIO-WORLD框架,核心为时空自回归(STAR)架构,包含隐式时空缓存模块(保障长时导航全局一致性)和显式空间约束模块(实现物理合理的相机轨迹控制);并引入联合分布匹配蒸馏(JDMD)以缓解合成数据过拟合导致的保真度下降。 Result: 在WorldScore-Dynamic基准上,INSPATIO-WORLD在空间一致性和交互精度上显著超越现有SOTA方法,位居实时交互类方法榜首,并构建了从单目视频重建并导航4D环境的实用流程。 Conclusion: INSPATIO-WORLD为构建具空间一致性与实时交互能力的世界模型提供了有效新范式,推动了基于单目视频的动态场景理解与导航发展。 Abstract: Building world models with spatial consistency and real-time interactivity remains a fundamental challenge in computer vision. Current video generation paradigms often struggle with a lack of spatial persistence and insufficient visual realism, making it difficult to support seamless navigation in complex environments. To address these challenges, we propose INSPATIO-WORLD, a novel real-time framework capable of recovering and generating high-fidelity, dynamic interactive scenes from a single reference video. At the core of our approach is a Spatiotemporal Autoregressive (STAR) architecture, which enables consistent and controllable scene evolution through two tightly coupled components: Implicit Spatiotemporal Cache aggregates reference and historical observations into a latent world representation, ensuring global consistency during long-horizon navigation; Explicit Spatial Constraint Module enforces geometric structure and translates user interactions into precise and physically plausible camera trajectories. Furthermore, we introduce Joint Distribution Matching Distillation (JDMD). By using real-world data distributions as a regularizing guide, JDMD effectively overcomes the fidelity degradation typically caused by over-reliance on synthetic data. Extensive experiments demonstrate that INSPATIO-WORLD significantly outperforms existing state-of-the-art (SOTA) models in spatial consistency and interaction precision, ranking first among real-time interactive methods on the WorldScore-Dynamic benchmark, and establishing a practical pipeline for navigating 4D environments reconstructed from monocular videos.

[197] VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

Jian Yu,Fei Shen,Cong Wang,Yi Xin,Si Shen,Xiaoyu Du,Jinhui Tang

Main category: cs.CV

TL;DR: 本文提出VersaVogue框架,统一解决服装生成与虚拟试穿问题,通过特质路由注意力(TA)模块实现多源异构条件下的解耦特征注入,并结合无需人工标注的多视角偏好优化(MPO)流程提升生成质量与可控性。

Details Motivation: 现有方法将服装生成与虚拟试穿分开处理,且在多源异构条件下依赖简单特征拼接或静态注入,导致属性纠缠与语义干扰,难以满足真实时尚工作流需求。 Method: 提出VersaVogue统一框架,包含特质路由注意力(TA)模块(基于MoE动态路由条件特征至匹配专家与生成层)和自动化多视角偏好优化(MPO)流程(融合内容保真度、文本对齐与感知质量评估器构建偏好对,再通过DPO优化模型)。 Result: 在服装生成与虚拟试穿基准上,VersaVogue在视觉保真度、语义一致性与细粒度可控性方面均优于现有方法。 Conclusion: VersaVogue实现了设计与展示阶段的统一建模,提升了多条件时尚图像合成的灵活性、解耦性与真实性,为实际时尚应用提供了更鲁棒可控的生成方案。 Abstract: Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.

[198] PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

Ruihang Xu,Dewei Zhou,Xiaolong Shen,Fan Ma,Yi Yang

Main category: cs.CV

TL;DR: 本文提出PhyEdit框架,通过引入显式的3D几何模拟作为视觉引导,并结合2D-3D联合监督,提升图像编辑中物体操作的物理准确性与空间一致性;同时构建RealManip-10K真实数据集和ManipEval评测基准以支持研究与评估。

Details Motivation: 现有视觉生成模型在图像编辑中缺乏对3D几何与透视投影的显式建模,导致物体缩放和定位不准确,难以实现物理上精确的操作。 Method: 提出PhyEdit框架,利用显式几何仿真提供3D感知的上下文视觉引导,并结合2D-3D联合监督;构建RealManip-10K真实世界数据集(含配对图像与深度标注)和ManipEval多维评测基准。 Result: 在3D几何精度与操作一致性上显著优于现有方法,包括强闭源模型;RealManip-10K与ManipEval为该任务提供了首个真实、可量化的评估基础。 Conclusion: 显式引入3D先验与联合2D-3D监督是提升图像编辑物理准确性的有效路径,PhyEdit为构建交互式世界模型中的可信视觉操作提供了新范式。 Abstract: Achieving physically accurate object manipulation in image editing is essential for its potential applications in interactive world models. However, existing visual generative models often fail at precise spatial manipulation, resulting in incorrect scaling and positioning of objects. This limitation primarily stems from the lack of explicit mechanisms to incorporate 3D geometry and perspective projection. To achieve accurate manipulation, we develop PhyEdit, an image editing framework that leverages explicit geometric simulation as contextual 3D-aware visual guidance. By combining this plug-and-play 3D prior with joint 2D--3D supervision, our method effectively improves physical accuracy and manipulation consistency. To support this method and evaluate performance, we present a real-world dataset, RealManip-10K, for 3D-aware object manipulation featuring paired images and depth annotations. We also propose ManipEval, a benchmark with multi-dimensional metrics to evaluate 3D spatial control and geometric consistency. Extensive experiments show that our approach outperforms existing methods, including strong closed-source models, in both 3D geometric accuracy and manipulation consistency.

[199] Geo-EVS: Geometry-Conditioned Extrapolative View Synthesis for Autonomous Driving

Yatong Lan,Rongkui Tang,Lei He

Main category: cs.CV

TL;DR: 本文提出Geo-EVS框架,通过几何感知重投影与缺陷引导的潜在扩散,在稀疏监督下提升自动驾驶中轨迹外新视角合成的质量与几何准确性。

Details Motivation: 现有外推式新视角合成方法在记录轨迹外性能下降,因外推位姿缺乏强几何支撑且无密集目标视图监督;需在训练中显式暴露模型于轨迹外缺陷。 Method: 提出Geo-EVS:1)Geometry-Aware Reprojection(GAR),利用微调VGGT重建着色点云并重投影至观测/虚拟位姿,生成几何条件图;2)Artifact-Guided Latent Diffusion(AGLD),在训练中注入重投影产生的伪影掩码以引导结构恢复;采用LiDAR-Projected Sparse-Reference(LPSR)协议评估。 Result: 在Waymo数据集上,Geo-EVS显著提升稀疏视角合成质量与几何精度,尤其在高角度和低覆盖场景;同时提升下游3D检测性能。 Conclusion: 显式建模几何条件与外推缺陷可有效增强外推式新视角合成的鲁棒性与实用性,为减少自动驾驶中相机标定依赖提供新思路。 Abstract: Extrapolative novel view synthesis can reduce camera-rig dependency in autonomous driving by generating standardized virtual views from heterogeneous sensors. Existing methods degrade outside recorded trajectories because extrapolated poses provide weak geometric support and no dense target-view supervision. The key is to explicitly expose the model to out-of-trajectory condition defects during training. We propose Geo-EVS, a geometry-conditioned framework under sparse supervision. Geo-EVS has two components. Geometry-Aware Reprojection (GAR) uses fine-tuned VGGT to reconstruct colored point clouds and reproject them to observed and virtual target poses, producing geometric condition maps. This design unifies the reprojection path between training and inference. Artifact-Guided Latent Diffusion (AGLD) injects reprojection-derived artifact masks during training so the model learns to recover structure under missing support. For evaluation, we use a LiDAR-Projected Sparse-Reference (LPSR) protocol when dense extrapolated-view ground truth is unavailable. On Waymo, Geo-EVS improves sparse-view synthesis quality and geometric accuracy, especially in high-angle and low-coverage settings. It also improves downstream 3D detection.

[200] Non-identifiability of Explanations from Model Behavior in Deep Networks of Image Authenticity Judgments

Icaro Re Depaolini,Uri Hasson

Main category: cs.CV

TL;DR: 本文探讨了深度神经网络在预测人类真实性判断时的可解释性问题,发现尽管模型预测性能良好,但其归因图(如Grad-CAM、LIME等)在架构间一致性差,难以反映真实认知机制,提示后验解释不应被当作认知机制的强证据。

Details Motivation: 深度神经网络虽能预测人类判断,但未必使用人类依赖的信息;现有归因方法(如热力图)的解释力依赖其鲁棒性,而该鲁棒性尚不明确。 Method: 在多个冻结预训练视觉模型上拟合轻量回归头以预测人类真实性评分;使用Grad-CAM、LIME和多尺度像素掩码生成归因图;评估模型内(随机种子)与跨架构归因一致性,并构建集成模型提升预测与归因稳定性。 Result: 多个模型预测性能达噪声上限的约80%;VGG主要依赖图像质量而非真实性特异性特征;EfficientNetB3和Barlow Twins归因较稳定;但跨架构归因一致性弱;集成模型提升了预测性能与图像级归因可靠性。 Conclusion: 深度网络可较好预测人类真实性判断,但无法提供可识别、一致的解释;后验归因不应被视为认知机制的可靠证据。 Abstract: Deep neural networks can predict human judgments, but this does not imply that they rely on human-like information or reveal the cues underlying those judgments. Prior work has addressed this issue using attribution heatmaps, but their explanatory value in itself depends on robustness. Here we tested the robustness of such explanations by evaluating whether models that predict human authenticity ratings also produce consistent explanations within and across architectures. We fit lightweight regression heads to multiple frozen pretrained vision models and generated attribution maps using Grad-CAM, LIME, and multiscale pixel masking. Several architectures predicted ratings well, reaching about 80% of the noise ceiling. VGG models achieved this by tracking image quality rather than authenticity-specific variance, limiting the relevance of their attributions. Among the remaining models, attribution maps were generally stable across random seeds within an architecture, especially for EfficientNetB3 and Barlow Twins, and consistency was higher for images judged as more authentic. Crucially, agreement in attribution across architectures was weak even when predictive performance was similar. To address this, we combined models in ensembles, which improved prediction of human authenticity judgments and enabled image-level attribution via pixel masking. We conclude that while deep networks can predict human authenticity judgments well, they do not produce identifiable explanations for those judgments. More broadly, our findings suggest that post hoc explanations from successful models of behavior should be treated as weak evidence for cognitive mechanism.

[201] GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

Yiqian Wu,Rawal Khirodkar,Egor Zakharov,Timur Bagautdinov,Lei Xiao,Zhaoen Su,Shunsuke Saito,Xiaogang Jin,Junxuan Li

Main category: cs.CV

TL;DR: GenLCA是一种基于扩散模型的生成方法,能从文本和图像输入生成并编辑逼真的全身3D虚拟形象,通过可见性感知的扩散训练策略,利用大规模真实视频数据进行高效训练。

Details Motivation: 现有方法难以利用海量但部分可观测的真实世界视频数据训练高质量、可动画的3D扩散模型;需解决2D视频到3D表示中因遮挡导致的模糊与透明伪影问题。 Method: 提出可见性感知的扩散训练策略,将预训练的前馈式虚拟形象重建模型用作可驱动的3D分词器,将视频帧编码为结构化3D token,并仅在有效区域计算损失;在此token数据集上训练流式扩散模型。 Result: 实现了高保真、可动画的全身3D虚拟形象生成与编辑,在真实感、泛化性和可控性上显著优于现有方法。 Conclusion: GenLCA成功打通了利用大规模不完整2D视频数据训练原生3D扩散模型的技术路径,为生成式3D内容创作提供了新范式。 Abstract: We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

[202] Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

Changkun Liu,Jiezhi Yang,Zeman Li,Yuan Deng,Jiancong Guo,Luca Ballan

Main category: cs.CV

TL;DR: Mem3R是一种面向长序列流式3D感知的新型模型,通过解耦相机跟踪与几何建图,并分别采用隐式快速权重记忆(TTT更新的轻量MLP)和显式固定大小token状态,显著缓解漂移与遗忘问题,在性能提升的同时减小模型规模并保持内存与吞吐效率。

Details Motivation: 现有循环式流式3D感知模型受限于压缩潜变量记忆容量,易在长序列中产生位姿漂移和时间遗忘,影响时序一致性。 Method: 提出混合记忆架构Mem3R:相机跟踪使用基于测试时训练(TTT)更新的轻量MLP作为隐式快速权重记忆;几何建图则维护显式的、固定大小的token化状态;支持复用CUT3R生态中的状态更新策略(如TTT3R)。 Result: 相比CUT3R,参数量从793M降至644M;在500–1000帧序列上绝对轨迹误差(ATE)最高降低39%;同时提升视频深度估计与3D重建等下游任务性能,且保持恒定GPU内存占用与相近推理吞吐。 Conclusion: Mem3R验证了分离式混合记忆设计对长序列流式3D感知的有效性,兼顾精度、效率与可扩展性,为机器人与AR等实时应用提供了更鲁棒的解决方案。 Abstract: Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/

[203] Are Face Embeddings Compatible Across Deep Neural Network Models?

Fizza Rubab,Yiying Tong,Arun Ross

Main category: cs.CV

TL;DR: 本文探讨了不同深度神经网络模型(包括领域特定模型和基础模型)在面部身份编码上的相似性,通过分析嵌入空间的几何结构,发现简单的仿射变换即可实现跨模型面部表示对齐,并显著提升跨模型人脸识别性能。

Details Motivation: 探究不同DNN模型(领域特定与基础模型)是否以相似方式编码面部身份,尽管它们训练数据、损失函数和架构各不相同。 Method: 将人脸图像嵌入视为点云,分析不同DNN模型嵌入空间的几何结构,并检验简单仿射变换能否对齐不同模型的面部表示。 Result: 发现低容量线性映射能显著提升跨模型人脸识别(识别与验证任务)性能;对齐模式在数据集间泛化,且随模型族系统变化,表明面部身份表征存在收敛性。 Conclusion: 不同DNN模型在面部身份编码上存在表征收敛性,支持跨模型对齐,对模型互操作性、集成设计及生物特征模板安全具有重要启示。 Abstract: Automated face recognition has made rapid strides over the past decade due to the unprecedented rise of deep neural network (DNN) models that can be trained for domain-specific tasks. At the same time, foundation models that are pretrained on broad vision or vision-language tasks have shown impressive generalization across diverse domains, including biometrics. This raises an important question: Do different DNN models--both domain-specific and foundation models--encode facial identity in similar ways, despite being trained on different datasets, loss functions, and architectures? In this regard, we directly analyze the geometric structure of embedding spaces imputed by different DNN models. Treating embeddings of face images as point clouds, we study whether simple affine transformations can align face representations of one model with another. Our findings reveal surprising cross-model compatibility: low-capacity linear mappings substantially improve cross-model face recognition over unaligned baselines for both face identification and verification tasks. Alignment patterns generalize across datasets and vary systematically across model families, indicating representational convergence in facial identity encoding. These findings have implications for model interoperability, ensemble design, and biometric template security.

[204] Region-Graph Optimal Transport Routing for Mixture-of-Experts Whole-Slide Image Classification

Xin Tian,Jiuliu Lu,Ephraim Tsalik,Bart Wanders,Colleen Knoth,Julian Knight

Main category: cs.CV

TL;DR: 本文提出ROAM方法,一种空间感知的混合专家(MoE)多实例学习(MIL)聚合器,通过容量约束的熵最优传输实现区域令牌到专家的均衡路由,并引入图正则化提升空间一致性,在WSI分类任务中取得优异泛化性能。

Details Motivation: 现有MIL方法对所有实例使用共享路径,难以适应病理切片内部的高度异质性;而传统MoE的无约束softmax路由易导致专家利用严重不均衡,退化为单路径方案。 Method: ROAM将密集图像块压缩为空间区域令牌,以区域为单位进行路由;采用带每张切片容量约束的熵最优传输(Sinkhorn算法)实现区域到专家的均衡分配;并引入基于空间区域图的正则化Sinkhorn迭代,使邻近区域倾向于路由至同一专家。 Result: 在四个WSI基准上,ROAM使用冻结的基础模型补丁嵌入,性能媲美强MIL和MoE基线;在NSCLC跨数据集泛化任务(TCGA-CPTAC)中达到外部AUC 0.845 ± 0.019。 Conclusion: ROAM通过结构化、容量约束且空间感知的最优传输路由机制,有效缓解了MoE在WSI分析中的专家失衡问题,提升了模型表达能力与泛化性,为计算病理学提供了更鲁棒的MIL聚合范式。 Abstract: Multiple Instance Learning (MIL) is the dominant framework for gigapixel whole-slide image (WSI) classification in computational pathology. However, current MIL aggregators route all instances through a shared pathway, constraining their capacity to specialise across the pathological heterogeneity inherent in each slide. Mixture-of-Experts (MoE) methods offer a natural remedy by partitioning instances across specialised expert subnetworks; yet unconstrained softmax routing may yield highly imbalanced utilisation, where one or a few experts absorb most routing mass, collapsing the mixture back to a near-single-pathway solution. To address these limitations, we propose ROAM (Region-graph OptimAl-transport Mixture-of-experts), a spatially aware MoE-MIL aggregator that routes region tokens to expert poolers via capacity-constrained entropic optimal transport, promoting balanced expert utilisation by construction. ROAM operates on spatial region tokens, obtained by compressing dense patch bags into spatially binned units that align routing with local tissue neighbourhoods and introduces two key mechanisms: (i) region-to-expert assignment formulated as entropic optimal transport (Sinkhorn) with explicit per slide capacity marginals, enforcing balanced expert utilisation without auxiliary load-balancing losses; and (ii) graph-regularised Sinkhorn iterations that diffuse routing assignments over the spatial region graph, encouraging neighbouring regions to coherently route to the same experts. Evaluated on four WSI benchmarks with frozen foundation-model patch embeddings, ROAM achieves performance competitive against strong MIL and MoE baselines, and on NSCLC generalisation (TCGA-CPTAC) reaches external AUC 0.845 +- 0.019.

[205] Beyond Loss Values: Robust Dynamic Pruning via Loss Trajectory Alignment

Huaiyuan Qin,Muli Yang,Gabriel James Goenawan,Kai Wang,Zheng Wang,Peng Hu,Xi Peng,Hongyuan Zhu

Main category: cs.CV

TL;DR: 本文提出AlignPrune,一种基于动态对齐分数(DAS)的噪声鲁棒动态数据剪枝方法,可即插即用提升现有剪枝框架在标签噪声下的性能。

Details Motivation: 现有动态数据剪枝方法依赖单样本损失排序,在标签噪声下易误保留高损失的噪声样本,导致性能显著下降。 Method: 提出动态对齐分数(DAS)作为损失轨迹驱动的排序准则,构建轻量、即插即用的AlignPrune模块,无需修改模型结构或训练流程。 Result: 在五种基准数据集、多种噪声类型和剪枝比例下,AlignPrune相较SOTA方法最高提升准确率6.3%。 Conclusion: AlignPrune为噪声数据下的动态剪枝提供了通用、有效且可扩展的解决方案,推动真实场景中的鲁棒学习研究。 Abstract: Existing dynamic data pruning methods often fail under noisy-label settings, as they typically rely on per-sample loss as the ranking criterion. This could mistakenly lead to preserving noisy samples due to their high loss values, resulting in significant performance drop. To address this, we propose AlignPrune, a noise-robust module designed to enhance the reliability of dynamic pruning under label noise. Specifically, AlignPrune introduces the Dynamic Alignment Score (DAS), which is a loss-trajectory-based criterion that enables more accurate identification of noisy samples, thereby improving pruning effectiveness. As a simple yet effective plug-and-play module, AlignPrune can be seamlessly integrated into state-of-the-art dynamic pruning frameworks, consistently outperforming them without modifying either the model architecture or the training pipeline. Extensive experiments on five widely-used benchmarks across various noise types and pruning ratios demonstrate the effectiveness of AlignPrune, boosting accuracy by up to 6.3\% over state-of-the-art baselines. Our results offer a generalizable solution for pruning under noisy data, encouraging further exploration of learning in real-world scenarios. Code is available at: https://github.com/leonqin430/AlignPrune.

[206] Distilling Photon-Counting CT into Routine Chest CT through Clinically Validated Degradation Modeling

Junqi Liu,Xinze Zhou,Wenxuan Li,Scott Ye,Arkadiusz Sitek,Xiaofeng Yang,Yucheng Tang,Daguang Xu,Kai Ding,Kang Wang,Yang Yang,Alan L. Yuille,Zongwei Zhou

Main category: cs.CV

TL;DR: 本文提出SUMI方法,通过模拟真实采集退化过程,将低质量能量积分CT(EICT)增强为高质量光子计数CT(PCCT)质量,无需大规模配对数据,显著提升图像质量、临床效用和病灶检测性能。

Details Motivation: 光子计数CT(PCCT)虽性能优越但临床普及受限,亟需在不依赖大量配对扫描的前提下,利用少量高质量PCCT数据提升常规EICT的成像质量。 Method: 提出SUMI方法:1)显式建模并模拟PCCT到临床合理低质EICT的退化过程;2)基于1046例PCCT训练潜在扩散模型,使用在1046例PCCT与40.5万例EICT上预训练的自编码器提取通用CT潜在特征;3)构建含17316例EICT增强图像及放射科医生验证的体素级解剖标注数据集。 Result: 在外部数据上,SUMI较SOTA图像转换方法SSIM提升15%、PSNR提升20%;放射科医生评估临床实用性提高;病灶检测敏感度最高提升15%,F1分数提升10%。 Conclusion: 新兴高端成像技术(如PCCT)可通过有限高质量样本作为参考,系统性地‘蒸馏’至常规设备(如EICT),推动其临床普惠应用。 Abstract: Photon-counting CT (PCCT) provides superior image quality with higher spatial resolution and lower noise compared to conventional energy-integrating CT (EICT), but its limited clinical availability restricts large-scale research and clinical deployment. To bridge this gap, we propose SUMI, a simulated degradation-to-enhancement method that learns to reverse realistic acquisition artifacts in low-quality EICT by leveraging high-quality PCCT as reference. Our central insight is to explicitly model realistic acquisition degradations, transforming PCCT into clinically plausible lower-quality counterparts and learning to invert this process. The simulated degradations were validated for clinical realism by board-certified radiologists, enabling faithful supervision without requiring paired acquisitions at scale. As outcomes of this technical contribution, we: (1) train a latent diffusion model on 1,046 PCCTs, using an autoencoder first pre-trained on both these PCCTs and 405,379 EICTs from 145 hospitals to extract general CT latent features that we release for reuse in other generative medical imaging tasks; (2) construct a large-scale dataset of over 17,316 publicly available EICTs enhanced to PCCT-like quality, with radiologist-validated voxel-wise annotations of airway trees, arteries, veins, lungs, and lobes; and (3) demonstrate substantial improvements: across external data, SUMI outperforms state-of-the-art image translation methods by 15% in SSIM and 20% in PSNR, improves radiologist-rated clinical utility in reader studies, and enhances downstream top-ranking lesion detection performance, increasing sensitivity by up to 15% and F1 score by up to 10%. Our results suggest that emerging imaging advances can be systematically distilled into routine EICT using limited high-quality scans as reference.

[207] From Blobs to Spokes: High-Fidelity Surface Reconstruction via Oriented Gaussians

Diego Gomez,Antoine Guédon,Nissim Maruani,Bingchen Gong,Maks Ovsjanikov

Main category: cs.CV

TL;DR: 本文提出Gaussian Wrapping方法,为3D高斯泼溅(3DGS)构建可学习的有向法向与连续占据场,实现高质量、闭合、轻量化的三维表面重建,并在DTU和Tanks and Temples数据集上达到SOTA。

Details Motivation: 3D高斯泼溅缺乏全局几何场,导致表面提取困难;现有方法依赖启发式策略(如TSDF融合),难以获得准确、闭合的网格。 Method: 引入每个高斯元的可学习有向法向和自适应衰减公式,推导出空间任意位置的法向与占据场闭式表达;设计一致性损失与专用致密化策略以闭合几何空洞;改进可微光栅化器输出等值面深度;提出Primal Adaptive Meshing实现感兴趣区域的任意分辨率网格生成。 Result: 在DTU和Tanks and Temples数据集上达到新SOTA,生成完整、闭合、轻量(远小于同期工作)的网格,成功恢复细长结构(如自行车辐条);并揭示并修正了标准表面评估协议中的根本性偏差。 Conclusion: Gaussian Wrapping为3DGS提供了首个原理性、可微、几何一致的表面建模框架,显著提升了表面重建质量与鲁棒性,推动了高斯泼溅从渲染向三维重建的拓展。 Abstract: 3D Gaussian Splatting (3DGS) has revolutionized fast novel view synthesis, yet its opacity-based formulation makes surface extraction fundamentally difficult. Unlike implicit methods built on Signed Distance Fields or occupancy, 3DGS lacks a global geometric field, forcing existing approaches to resort to heuristics such as TSDF fusion of blended depth maps. Inspired by the Objects as Volumes framework, we derive a principled occupancy field for Gaussian Splatting and show how it can be used to extract highly accurate watertight meshes of complex scenes. Our key contribution is to introduce a learnable oriented normal at each Gaussian element and to define an adapted attenuation formulation, which leads to closed-form expressions for both the normal and occupancy fields at arbitrary locations in space. We further introduce a novel consistency loss and a dedicated densification strategy to enforce Gaussians to wrap the entire surface by closing geometric holes, ensuring a complete shell of oriented primitives. We modify the differentiable rasterizer to output depth as an isosurface of our continuous model, and introduce Primal Adaptive Meshing for Region-of-Interest meshing at arbitrary resolution. We additionally expose fundamental biases in standard surface evaluation protocols and propose two more rigorous alternatives. Overall, our method Gaussian Wrapping sets a new state-of-the-art on DTU and Tanks and Temples, producing complete, watertight meshes at a fraction of the size of concurrent work-recovering thin structures such as the notoriously elusive bicycle spokes.

[208] Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images

Yuechen Jiang,Enze Zhang,Md Mohsinul Kabir,Qianqian Xie,Stavroula Golfomitsou,Konstantinos Arvanitis,Sophia Ananiadou

Main category: cs.CV

TL;DR: 本文提出了一种用于从图像中推断结构化文化元数据(如创作者、起源、时期)的多类别跨文化基准,并利用LLM-as-Judge框架评估视觉语言模型(VLMs)在文化推理任务上的表现,发现当前VLMs在该任务上存在显著局限性。

Details Motivation: 现有视觉语言模型在图像描述方面已有进展,但如何从图像中准确推断结构化文化元数据(如创作者、起源、时期等)仍缺乏研究。 Method: 构建了一个多类别、跨文化的结构化文化元数据推理基准,并采用LLM-as-Judge框架评估模型输出与参考标注之间的语义对齐程度;通过精确匹配、部分匹配和属性级准确率衡量跨文化区域的文化推理能力。 Result: 模型仅能捕捉零散的视觉信号,在不同文化区域和元数据类型间性能差异显著,预测结果不一致且缺乏充分依据。 Conclusion: 当前VLMs在超越基础视觉感知、进行结构化文化元数据推理方面存在明显不足,亟需更深层次的文化语义建模与评估方法。 Abstract: Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.

[209] TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

Teng Li,Ziyuan Huang,Cong Chen,Yangfu Li,Yuanhuiyi Lyu,Dandan Zheng,Chunhua Shen,Jun Zhang

Main category: cs.CV

TL;DR: TC-AE是一种基于ViT的深度压缩自编码器架构,通过优化token空间(而非增加通道数)来缓解高倍压缩下的潜在表征坍塌问题,提升重建与生成性能。

Details Motivation: 现有方法通过增加潜在表示通道数维持高倍压缩下的重建质量,但易导致潜在表征坍塌、损害生成性能;需从token空间角度寻找更有效的压缩路径。 Method: 提出两方面创新:1)将token到latent的压缩解耦为两个阶段,缓解结构信息损失,支持有效token数量扩展;2)通过联合自监督训练增强图像token的语义结构,获得更具生成友好性的潜在表示。 Result: TC-AE在深度压缩下显著提升了重建质量和生成性能。 Conclusion: 从token空间出发优化ViT-based压缩自编码器是可行且有效的路径,有望推动面向视觉生成的ViT tokenizer发展。 Abstract: We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

[210] MoRight: Motion Control Done Right

Shaowei Liu,Xuanchi Ren,Tianchang Shen,Huan Ling,Saurabh Gupta,Shenlong Wang,Sanja Fidler,Jun Gao

Main category: cs.CV

TL;DR: 本文提出MoRight框架,通过解耦运动建模实现动作可控视频生成,支持独立控制物体运动与相机视角,并建模运动因果性(主动驱动与被动响应),支持前向与逆向推理。

Details Motivation: 现有方法无法解耦相机与物体运动,且忽略运动间的因果关系,导致动作控制不灵活、物理反应不真实。 Method: MoRight采用时序跨视角注意力机制将静态规范视角下的物体运动映射到任意相机视角,实现相机与物体运动解耦;进一步将运动分解为主动(用户驱动)与被动(因果响应)分量,通过数据驱动学习运动因果性,并支持前向与逆向推理。 Result: 在三个基准上达到生成质量、运动可控性和交互感知性的SOTA性能。 Conclusion: MoRight统一解决了运动控制解耦与因果建模两大挑战,为物理合理、用户可控的动态场景视频生成提供了新范式。 Abstract: Generating motion-controlled videos--where user-specified actions drive physically plausible scene dynamics under freely chosen viewpoints--demands two capabilities: (1) disentangled motion control, allowing users to separately control the object motion and adjust camera viewpoint; and (2) motion causality, ensuring that user-driven actions trigger coherent reactions from other objects rather than merely displacing pixels. Existing methods fall short on both fronts: they entangle camera and object motion into a single tracking signal and treat motion as kinematic displacement without modeling causal relationships between object motion. We introduce MoRight, a unified framework that addresses both limitations through disentangled motion modeling. Object motion is specified in a canonical static-view and transferred to an arbitrary target camera viewpoint via temporal cross-view attention, enabling disentangled camera and object control. We further decompose motion into active (user-driven) and passive (consequence) components, training the model to learn motion causality from data. At inference, users can either supply active motion and MoRight predicts consequences (forward reasoning), or specify desired passive outcomes and MoRight recovers plausible driving actions (inverse reasoning), all while freely adjusting the camera viewpoint. Experiments on three benchmarks demonstrate state-of-the-art performance in generation quality, motion controllability, and interaction awareness.

[211] Fast Spatial Memory with Elastic Test-Time Training

Ziqiao Ma,Xueyang Yu,Haoyu Zhen,Yuncong Yang,Joyce Chai,Chuang Gan

Main category: cs.CV

TL;DR: 本文提出Elastic Test-Time Training和Fast Spatial Memory (FSM),改进LaCT方法以解决长序列3D/4D重建中的灾难性遗忘、过拟合及内存瓶颈问题,实现更鲁棒的多块自适应。

Details Motivation: LaCT虽在长上下文3D重建中表现优异,但其完全可塑的测试时更新易导致灾难性遗忘和过拟合,且受限于单一大块输入,难以处理任意长度序列。 Method: 提出基于弹性权重巩固(EWC)思想的Elastic Test-Time Training,通过Fisher加权弹性先验稳定快速权重更新,并以指数滑动平均维护锚点状态;在此基础上构建Fast Spatial Memory(FSM)模型,支持高效可扩展的4D重建。 Result: FSM在大规模3D/4D数据上预训练,实验表明其能在较小分块下实现快速适应,提升3D/4D重建质量,缓解相机插值捷径问题,并显著缓解激活内存瓶颈。 Conclusion: 该工作推动LaCT从受限的单块设定迈向鲁棒的多块自适应,是通向真正超长序列泛化的关键一步。 Abstract: Large Chunk Test-Time Training (LaCT) has shown strong performance on long-context 3D reconstruction, but its fully plastic inference-time updates remain vulnerable to catastrophic forgetting and overfitting. As a result, LaCT is typically instantiated with a single large chunk spanning the full input sequence, falling short of the broader goal of handling arbitrarily long sequences in a single pass. We propose Elastic Test-Time Training inspired by elastic weight consolidation, that stabilizes LaCT fast-weight updates with a Fisher-weighted elastic prior around a maintained anchor state. The anchor evolves as an exponential moving average of past fast weights to balance stability and plasticity. Based on this updated architecture, we introduce Fast Spatial Memory (FSM), an efficient and scalable model for 4D reconstruction that learns spatiotemporal representations from long observation sequences and renders novel view-time combinations. We pre-trained FSM on large-scale curated 3D/4D data to capture the dynamics and semantics of complex spatial environments. Extensive experiments show that FSM supports fast adaptation over long sequences and delivers high-quality 3D/4D reconstruction with smaller chunks and mitigating the camera-interpolation shortcut. Overall, we hope to advance LaCT beyond the bounded single-chunk setting toward robust multi-chunk adaptation, a necessary step for generalization to genuinely longer sequences, while substantially alleviating the activation-memory bottleneck.