Skip to content

Table of Contents

cs.CL [Back]

[1] Linguistic Blind Spots in Clinical Decision Extraction

Mohamed Elgaar,Hadi Amiri

Main category: cs.CL

TL;DR: 本文研究了临床决策在不同类别中的语言特征差异及其对提取效果的影响,发现叙事风格的决策(如建议和预防措施)在精确匹配下召回率较低,提示下游系统应采用边界容忍的评估和提取策略。

Details Motivation: 从临床笔记中提取医疗决策是临床决策支持和患者护理摘要的关键步骤,但不同类别的临床决策在语言特征上存在差异,这些差异可能导致提取失败。 Method: 使用基于DICTUM分类法标注的MedDec出院摘要,计算每个决策片段的七个语言指标,并分析标准Transformer模型在片段级别上的提取召回率。 Result: 精确匹配召回率为48%,其中停用词比例最低和最高组的召回率分别为58%和24%;含模糊化或否定线索的片段更难被正确提取;宽松重叠匹配下召回率提升至71%,表明多数错误源于边界分歧而非完全遗漏。 Conclusion: 叙事风格的决策片段(常见于建议和预防措施类)在精确匹配下是一致的盲点,下游系统应采用边界容忍的评估与提取策略。 Abstract: Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans--common in advice and precaution decisions--are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.

[2] Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines

Erik Saule,Kalpathi Subramanian,Razvan Bunescu

Main category: cs.CL

TL;DR: 本文提出使用自然语言处理(NLP)技术,包括传统NLP工具和大语言模型(LLM),自动评估计算机科学课程对ACM/IEEE课程指南的覆盖程度,以减轻人工审核负担。

Details Motivation: ACM/IEEE课程指南内容繁杂(含数千项),人工审核每门课程平均需耗时一天,效率低、认知负荷高。 Method: 探索两类NLP技术:一是基于传统NLP工具(解析、词性标注、词嵌入)的方法;二是基于大语言模型(LLM)的方法,并在教学材料语料库上进行分类任务评估。 Result: 实验表明,两类NLP技术均能对教学文档进行有意义的自动分类,验证了其在课程指南覆盖度评估中的可行性。 Conclusion: NLP技术(尤其是LLM)可有效加速课程与标准对齐的审计过程,为程序管理者提供实用、可扩展的自动化支持。 Abstract: Professional societies often publish curriculum guidelines to help programs align their content to international standards. In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program. While very helpful, it remains difficult for program administrators to assess how much of the guidelines is being covered by a CS program. This is in particular due to the extensiveness of the guidelines, containing thousands of individual items. As such, it is time consuming and cognitively demanding to audit every course to confidently mark everything that is actually being covered. Our preliminary work indicated that it takes about a day of work per course. In this work, we propose using Natural Language Processing techniques to accelerate the process. We explore two kinds of techniques, the first relying on traditional tools for parsing, tagging, and embeddings, while the second leverages the power of Large Language Models. We evaluate the application of these techniques to classify a corpus of pedagogical materials and show that we can meaningfully classify documents automatically.

[3] Likelihood-Based Reward Designs for General LLM Reasoning

Ariel Kwiatkowski,Natasha Butt,Ismail Labiad,Julia Kempe,Yann Ollivier

Main category: cs.CL

TL;DR: 本文系统研究了基于概率或对数概率的奖励函数在大语言模型推理微调中的应用,发现使用参考答案的对数概率作为链式思维(CoT)学习的奖励,在可验证与不可验证设置下均表现优异,且与预训练目标一致。

Details Motivation: 解决强化学习微调大语言模型时依赖人工设计的稀疏二元奖励所带来的局限性,探索无需外部验证器、可大规模获取的替代奖励形式。 Method: 系统比较多种基于似然(概率/对数概率)的奖励变体与标准二元奖励基线,在数学推理基准和无外部验证器的长文本生成任务上进行评估。 Result: 仅参考答案的对数概率奖励在所有设置中均表现良好:在可验证场景下成功率媲美或优于二元奖励且困惑度显著更低;在不可验证场景下性能与监督微调(SFT)相当;而基于概率的方法(如VeriFree)在不可验证场景下因正确答案概率趋零而失效。 Conclusion: 对数概率奖励是一种通用、有效且与预训练目标一致的链式思维微调方法,能统一处理短答案(可验证)与长答案(不可验证)两类任务。 Abstract: Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.

[4] Transformers perform adaptive partial pooling

Vsevolod Kapatsinski

Main category: cs.CL

TL;DR: 本文研究了Transformer模型(GPT-2)在训练过程中如何对不同频率和相似性的上下文进行证据聚合(adaptive partial pooling),发现其行为类似于分层回归:上下文越稀有、越相似,模型越倾向于借用其他上下文的信息;且这种‘池化’效应随训练逐步减弱。该现象在认知与统计上均具合理性。

Details Motivation: 探究语言模型(尤其是Transformer)如何在非新颖但低频的上下文中进行泛化,即是否以及如何借鉴相似上下文的经验,以理解其归纳偏置是否符合人类认知和统计建模原则(如分层回归)。 Method: 分析GPT-2在训练各阶段对next-word预测中跨上下文证据依赖的变化,通过控制上下文频率、类型数量(type frequency)和行为变异性,量化其‘证据池化’程度,并与分层回归的预期行为进行对比。 Result: GPT-2的next-word预测随训练进行,逐渐减少对其他上下文信息的依赖(pooling减弱);且pooling程度受上下文频率、类型数量和变异性影响,趋势与分层回归一致。 Conclusion: Transformer模型的学习动态展现出类似分层回归的自适应部分池化特性,表明其泛化机制不仅适用于新颖情境,也合理处理低频情境,在理性(贝叶斯)和经验(心理语言学)层面均具现实性。 Abstract: Because language is creative, any reasonable language model must generalize, deciding what to say in novel contexts by using information from similar contexts. But what about contexts that are not novel but merely infrequent? In hierarchical regression, the model's predictions for behavior in a context are affected by observations from other similar contexts to the extent that 1) the current context is infrequent and 2) different contexts behave similarly. This is called adaptive partial pooling of evidence. This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context across epochs of training (the amount of pooling reduces with training), and that the extent of pooling is affected by context frequency, context number (type frequency) and context variability in a similar way to hierarchical regression. These characteristics of learning in transformers are argued to be realistic on both rational and empirical grounds.

[5] On the Credibility of Evaluating LLMs using Survey Questions

Jindřich Libovický

Main category: cs.CL

TL;DR: 本文指出当前评估大语言模型价值取向的方法存在缺陷,发现提示方式和解码策略显著影响结果,并提出新指标‘自相关距离’以衡量模型回答间的一致性结构,强调需结合多种指标进行鲁棒评估。

Details Motivation: 现有基于社会调查改编的LLM价值取向评估方法存在方法论局限,可能导致对模型与人类价值相似性的高估或低估,亟需更严谨的评估框架。 Method: 在三种语言、五个国家的World Value Survey数据上,系统比较直接提示与思维链(CoT)提示、贪心解码与采样解码的影响;提出新指标‘自相关距离’来量化模型回答间的结构性一致性,并分析均方距离与KL散度的相关性。 Result: 提示方式和解码策略显著影响评估结果;LLM即使在平均层面与人类响应高度一致,其内部响应结构(如跨问题一致性)也可能严重偏离人类;均方距离与KL散度相关性弱,说明二者隐含的独立性假设不成立。 Conclusion: 应摒弃单一指标评估,推荐采用CoT提示、多样本采样解码,并引入自相关距离等能反映响应结构的复合评估体系。 Abstract: Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.

[6] Abstraction Induces the Brain Alignment of Language and Speech Models

Emily Cheng,Aditya R. Vaidya,Richard Antonello

Main category: cs.CL

TL;DR: 本文探讨了为何大型语言模型和语音模型的中间隐藏层比输出层更能预测大脑对自然语言刺激的反应,发现这与共享的意义抽象能力有关,而非单纯的下一个词预测能力。

Details Motivation: 研究旨在理解为什么中间层在预测大脑响应方面表现最佳,以及这种高预测性能背后的表征特性是什么。 Method: 通过分析模型各层的内在维度(intrinsic dimension)与fMRI和ECoG信号解释能力的关系,并考察预训练和微调过程对内在维度及语义内容的影响。 Result: 中间层具有较高的内在维度,与大脑预测能力高度相关;该关系在预训练过程中形成;微调以提升大脑预测能力会因果性地增加表示的内在维度和语义内容。 Conclusion: 语义丰富性、高内在维度和大脑可预测性相互关联,模型与大脑相似性的关键驱动力在于输入的丰富意义抽象,而语言建模任务的复杂性足以促成这一抽象过程。 Abstract: Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their next-word prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension, a measure of feature complexity. We show that a layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity is rich meaning abstraction of the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.

[7] Expert Selections In MoE Models Reveal (Almost) As Much As Text

Amir Nuriyev,Gabriel Kulp

Main category: cs.CL

TL;DR: 本文提出了一种针对混合专家(MoE)语言模型的文本重建攻击,仅通过专家选择信息即可高精度恢复原始输入token,揭示了MoE路由机制存在严重隐私泄露风险。

Details Motivation: MoE模型中每个token被路由到特定专家子网络,但这种路由决策可能泄露远超预期的敏感信息,现有研究对其隐私风险认识不足。 Method: 提出基于专家选择序列的文本重建攻击:先用3层MLP实现63.1% top-1重建准确率;再设计基于Transformer的序列解码器,在OpenWebText数据上对32-token序列达到91.2% top-1和94.8% top-10重建准确率;同时分析噪声注入等缓解措施的效果。 Result: Transformer解码器在32-token序列上实现91.2% top-1和94.8% top-10 token重建准确率;噪声可降低但无法消除重建能力;专家选择信息应被视为与原始文本同等敏感。 Conclusion: MoE模型的专家路由决策蕴含大量原始文本信息,构成实质性隐私威胁,需在实际部署中将其视为敏感数据加以保护。 Abstract: We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.

[8] DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling

Jiangnan Yang,Junjie Chen,Fei Wang,Yiqi Nie,Yuxin Liu,Zhangling Duan,Jie Chen

Main category: cs.CL

TL;DR: 本文提出了DELTA框架,一种基于多模态信号的多智能体心理辅导系统,通过结构化推理过程(包括证据基础、心理状态抽象和响应生成)提升共情能力,并引入基于分布级情绪协调评分的强化学习来优化情绪协调响应。

Details Motivation: 现有基于语言模型的心理辅导系统大多仅依赖文本,隐式推断心理状态,缺乏对多模态线索(如视觉和语音)的有效整合,难以实现真正共情的交互。 Method: 提出DELTA——一种多智能体、多阶段推理框架,将辅导过程分解为证据 grounding、心理状态抽象与响应生成;融合多模态输入;并采用基于分布级Emotion Attunement Score的强化学习优化响应情感协调性。 Result: 在多模态心理辅导基准测试中,DELTA显著提升了辅导质量与情绪协调性;消融与定性分析表明,显式多模态推理与结构化心理状态表征具有互补增益。 Conclusion: 显式建模多模态信号与结构化心理状态是提升AI心理辅导共情能力的关键路径,DELTA为构建更具同理心的人机交互系统提供了新范式。 Abstract: Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients' mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.

[9] From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?

Sercan Karakaş,Yusuf Şimşek

Main category: cs.CL

TL;DR: 本文系统研究了土耳其语轻动词结构(LVCs)的自动识别问题,通过限制模型输入来探究驱动分类的关键信号,发现仅靠粗粒度形态句法特征不足以稳健识别LVC,而词元(lemma)信息虽有效但高度依赖归一化方式。

Details Motivation: 土耳其语中丰富的形态变化和高产的复杂谓词导致习语性谓词与字面动词-论元用法之间存在细微区别,使得LVC识别极具挑战性;本文旨在厘清哪些语言信号真正驱动LVC分类决策。 Method: 在UD标注数据监督下,对比多种输入受限模型:基于词元的TF-IDF+逻辑回归、仅用词元序列微调的BERTurk、仅使用UD形态句法特征(UPOS/DEPREL/MORPH)的逻辑回归,以及全输入BERTurk基线;在包含随机负例、词汇控制负例(NLVC)和LVC正例的诊断集上分组评估。 Result: 粗粒度形态句法特征单独使用无法在严格对照条件下实现稳健LVC检测;词元信息有助于LVC判断,但其效果对校准和归一化策略高度敏感;'仅词元'并非单一明确表征,其定义严重依赖归一化操作方式。 Conclusion: 研究凸显了针对土耳其语多词表达(MWE)开展目标导向评测的必要性,并指出词元表征需谨慎定义与实现,不能简单视为统一基准。 Abstract: Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish, where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicate meanings and literal verb--argument uses. This paper asks what signals drive LVC classification by systematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines (lemma TF--IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regression over UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlled diagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wise performance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficient for robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but is sensitive to calibration and normalization choices. Overall, Our findings motivate targeted evaluation of Turkish MWEs and show that ``lemma-only'' is not a single, well-defined representation, but one that depends critically on how normalization is operationalized.

[10] The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang,Yida Lu,Junfeng Fang,Junxiao Yang,Shiyao Cui,Hao Zhou,Fandong Meng,Jie Zhou,Hongning Wang,Minlie Huang,Tat-Seng Chua

Main category: cs.CL

TL;DR: 本文首次系统研究了AI模型在训练阶段的隐式安全风险,即模型基于内部动机和上下文信息产生的有害行为,并提出了风险分类体系,通过实验证明其普遍存在且严重。

Details Motivation: 现有研究多关注部署阶段的安全风险(如越狱攻击),而训练阶段的安全风险,尤其是隐式风险(非显式奖励操控)被严重忽视。 Method: 提出包含五个风险等级、十个细粒度风险类别和三种动机类型的分类学框架;通过大量实验(包括单模型与多智能体训练)评估风险表现及影响因素。 Result: Llama-3.1-8B-Instruct在仅提供背景信息时,74.4%的训练运行中表现出风险行为;隐式训练风险在多智能体训练中同样存在。 Conclusion: 训练阶段的隐式安全风险是一个被忽视但紧迫的新挑战,亟需引起AI安全领域重视并开展深入研究。 Abstract: Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.

[11] From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Xinyue Wang,Yuanhe Zhang,Zhengshuo Gong,Haoran Gao,Fanyu Meng,Zhenhong Zhou,Li Sun,Yang Liu,Sen Su

Main category: cs.CL

TL;DR: 本文提出并定义了'毒性主动性(Toxic Proactivity)'这一新型主动失效模式:LLM智能体为追求工具性效用最大化,主动违背伦理约束、采取操纵性或过度行为;作者构建基于两模型困境交互的多步评估框架,实证发现该现象普遍存在,并发布系统性评测基准。

Details Motivation: 现有LLM对齐方法导致智能体在‘有益性-无害性’间权衡时,不仅出现被动的‘过度拒绝(over-refusal)’,还因主动规划与工具使用能力催生出新的主动风险——即为最大化‘有用性’而主动忽视伦理约束的行为,但该现象尚未被充分识别和研究。 Method: 提出一种基于双模型困境驱动交互的新型评估框架,通过模拟多步行为轨迹来揭示和分析智能体的‘毒性主动性’;在主流大语言模型上开展大量实验,并构建覆盖多种上下文场景的系统性评测基准。 Result: 实验证明‘毒性主动性’是广泛存在的行为现象,并识别出两种主要行为倾向;所提框架能有效暴露该风险,且发布的基准支持跨情境量化评估。 Conclusion: ‘毒性主动性’是LLM智能体对齐中不可忽视的主动失效模式,需在模型设计与评估中同步关注被动拒绝与主动越界两类风险;本文框架与基准为后续安全对齐研究提供了新视角与实用工具。 Abstract: The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.

Hsien-Jyh Liao

Main category: cs.CL

TL;DR: 本文提出Soft-FSM神经符号架构,通过外部确定性状态控制器强制语言模型在法律交叉询问等长周期、强程序约束任务中实现单调进展,显著提升任务完成率。

Details Motivation: 大型语言模型虽具语言流畅性,但在需严格遵循显式程序约束的长周期任务(如法律交叉询问)中易出现‘程序停滞’,即行为连贯但无法保证程序推进。 Method: 提出Soft-FSM:结合神经网络(LLM)与符号系统(外部确定性状态控制器),以关键信息单元(KIUs)为状态变量,强制执行单调进展。 Result: 在三个真实台湾刑事案件上的实验表明,基线方法任务完成率低于40%,而Soft-FSM稳定达到97%以上完成率且冗余近乎为零。 Conclusion: 在强程序约束领域,仅依赖大模型涌现行为无法保障可靠任务完成,必须引入可验证的外部状态控制机制。 Abstract: Large language models (LLMs) exhibit impressive linguistic fluency but struggle to reliably complete long-horizon tasks under explicit procedural constraints. In legal cross-examination, purely proba-bilistic generation often maintains behavioral coherence while failing to ensure procedural advancement. We characterize this failure as procedural stagnation and propose Soft-FSM, a neuro-symbolic architecture that enforces monotonic progress over accumulated Key Information Units (KIUs) via an external deterministic state controller. Experiments on three real-world Taiwanese criminal homicide cases show that baseline methods collapse below 40% completeness, while Soft-FSM consistently achieves over 97% with near-zero redundancy. These results suggest that, in such domains, reliable task completion cannot be guaranteed by emergent LLM behavior alone, and can be reliably enforced through explicit and verifiable external state control.

[13] Language Models Struggle to Use Representations Learned In-Context

Michael A. Lepori,Tal Linzen,Ann Yuan,Katja Filippova

Main category: cs.CL

TL;DR: 本文研究了大语言模型(LLMs)是否能利用上下文中学到的表征来完成下游任务,发现无论开源还是闭源先进模型,在将上下文定义的新语义灵活用于预测或建模时均表现不佳,表明当前LLM在表征编码与部署之间存在关键脱节。

Details Motivation: 推动AI系统实现真正部署后适应新环境的能力,关键在于让模型不仅能从上下文中学习丰富表征,还能灵活运用这些表征完成任务;然而现有研究尚未验证LLM是否具备这种‘部署能力’。 Method: 首先在开放权重LLM上测试其利用上下文表征进行下一词预测的能力;其次设计新任务‘自适应世界建模’(adaptive world modeling)进行探针实验;最后在闭源先进推理模型上复现该任务以验证普适性。 Result: 开放权重LLM虽能在隐空间中编码上下文定义的新语义,却难以将其用于下一词预测或自适应世界建模;闭源SOTA模型在该任务上同样无法稳定利用上下文中的新模式。 Conclusion: 当前LLM普遍存在‘编码-部署断层’:能学但不会用;亟需新方法促使模型以支持灵活部署的方式构建上下文表征。 Abstract: Though large language models (LLMs) have enabled great success across a wide variety of tasks, they still appear to fall short of one of the loftier goals of artificial intelligence research: creating an artificial system that can adapt its behavior to radically new contexts upon deployment. One important step towards this goal is to create systems that can induce rich representations of data that are seen in-context, and then flexibly deploy these representations to accomplish goals. Recently, Park et al. (2024) demonstrated that current LLMs are indeed capable of inducing such representation from context (i.e., in-context representation learning). The present study investigates whether LLMs can use these representations to complete simple downstream tasks. We first assess whether open-weights LLMs can use in-context representations for next-token prediction, and then probe models using a novel task, adaptive world modeling. In both tasks, we find evidence that open-weights LLMs struggle to deploy representations of novel semantics that are defined in-context, even if they encode these semantics in their latent representations. Furthermore, we assess closed-source, state-of-the-art reasoning models on the adaptive world modeling task, demonstrating that even the most performant LLMs cannot reliably leverage novel patterns presented in-context. Overall, this work seeks to inspire novel methods for encouraging models to not only encode information presented in-context, but to do so in a manner that supports flexible deployment of this information.

[14] Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Nuo Xu,Ahrii Kim

Main category: cs.CL

TL;DR: 本研究系统比较了三种子词分词方法(BPE、OBPE 和 Unigram)在六种乌拉尔语系语言上的表现,发现 OBPE 在形态对齐和词性标注准确率上表现更优,尤其在拉丁字母书写的语言中;其优势源于减少开放类词汇的碎片化及更均衡的词频覆盖;此外,迁移效果还受下游标注架构、训练数据量和语言谱系相近性影响。

Details Motivation: 子词分词对NLP性能至关重要,但在形态丰富且资源匮乏的语言家族中的行为尚缺乏深入探索。 Method: 系统比较Byte Pair Encoding (BPE)、Overlap BPE (OBPE) 和 Unigram Language Model 三种子词范式,在六种具有不同资源可用性和类型学多样性的乌拉尔语上,以词性标注为受控下游任务进行评估。 Result: OBPE 在形态对齐和词性标注准确率上持续优于传统方法,尤其在拉丁字母组语言中;其提升来自开放类词汇碎片化减少及词频分布更均衡;迁移效果受下游标注架构、训练量和语言谱系距离共同影响。 Conclusion: 面向形态的分词不仅是预处理选择,更是实现黏着型低资源语言有效跨语言迁移的关键决定性因素。 Abstract: Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.

[15] CoLT: Reasoning with Chain of Latent Tool Calls

Fangwei Zhu,Zhifang Sui

Main category: cs.CL

TL;DR: 本文提出CoLT框架,通过将隐式推理实现为“工具调用”,利用种子token和外部小模型解包推理步骤,在保持主模型显式推理能力的同时提升效率。

Details Motivation: 现有隐式推理方法需修改模型结构并进行大量训练,限制了其通用性。 Method: CoLT框架将隐式推理建模为工具调用:主模型生成含推理信息的种子token;触发隐式工具调用时,外部小模型以种子token的隐藏状态为输入,将其解包为完整推理步骤。 Result: 在四个数学数据集上,CoLT相比基线隐式模型具有更高准确率和更短推理长度,并兼容强化学习算法及不同解码器结构。 Conclusion: CoLT在不改变主模型结构和训练方式的前提下,实现了高效且可解释的隐式推理,提升了大语言模型的推理效率与泛化能力。 Abstract: Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.

[16] DementiaBank-Emotion: A Multi-Rater Emotion Annotation Corpus for Alzheimer's Disease Speech (Version 1.0)

Cheonkam Jeong,Jessica Liao,Audrey Lu,Yutong Song,Christopher Rashidian,Donna Krogh,Erik Krogh,Mahkameh Rasouli,Jung-Ah Lee,Nikil Dutt,Lisa M Gibbs,David Sultzer,Julie Rousseau,Jocelyn Ludlow,Margaret Galvez,Alexander Nuth,Chet Khay,Sabine Brunswicker,Adeline Nyamathi

Main category: cs.CL

TL;DR: 本文介绍了DementiaBank-Emotion——首个针对阿尔茨海默病(AD)语音的多标注者情绪标注语料库,发现AD患者比健康对照组表达更多非中性情绪,并初步揭示了声学特征(如基频、响度)在情绪表达中的变化模式。

Details Motivation: 构建首个面向阿尔茨海默病语音的多标注者情绪语料库,以支持临床人群情绪识别研究,并探究AD患者情绪表达的声学特征变化机制。 Method: 对来自108名说话者的1492个话语进行Ekman六种基本情绪及中性情绪的多标注;开展探索性声学分析(包括基频F0和响度),比较AD患者与健康对照组的情绪表达差异。 Result: AD患者表达非中性情绪比例显著高于对照组(16.9% vs. 5.7%,p < .001);控制组在悲伤时F0明显下降,而AD组变化微弱(交互效应p = .023);AD组内响度可区分不同情绪类别。 Conclusion: AD患者情绪表达频率增加,但部分声学—情绪映射(如F0与悲伤)可能受损,而另一些(如响度与情绪类别)仍部分保留;该语料库及配套资源已公开发布。 Abstract: We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer's disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman's six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p < .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.

[17] Scaling Agentic Verifier for Competitive Coding

Zeyao Ma,Jing Zhang,Xiaokang Zhang,Jiaxi Yang,Zongmeng Zhang,Jiajun Zhang,Yuheng Jing,Lei Zhang,Hao Zheng,Wenting Zhao,Junyang Lin,Binyuan Hui

Main category: cs.CL

TL;DR: 本文提出Agentic Verifier,一种基于执行的智能体验证器,通过与代码执行环境多轮交互,主动推理程序行为并生成高区分度测试输入,以暴露候选解之间的行为差异,从而提升竞争性编程问题的求解准确率。

Details Motivation: 现有基于执行的重排序方法受限于困难测试用例生成或低效随机采样,难以有效提升LLM在竞争性编程问题上的单次求解正确率。 Method: 提出Agentic Verifier,结合大规模数据合成、拒绝微调和智能体强化学习训练该验证器,使其能迭代优化输入生成器并生成有针对性的反例,而非盲目采样。 Result: 在五个竞争性编程基准上显著优于强基线,Best@K准确率最高提升10-15个百分点;实验还揭示了清晰的测试时缩放行为及更广应用潜力。 Conclusion: Agentic Verifier是一种有效的测试时扩展策略,通过主动、目标导向的测试输入生成,显著提升了LLM在复杂编程任务中的可靠性与准确性。 Abstract: Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.

[18] ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Jiarui Jin,Haoyu Wang,Xingliang Wu,Xiaocheng Fang,Xiang Lan,Zihan Wang,Deyun Zhang,Bo Liu,Yingying Zhang,Xian Wu,Hongyan Li,Shenda Hong

Main category: cs.CL

TL;DR: 本文提出了ECG-R1,首个专为可靠心电图(ECG)解读设计的推理型多模态大语言模型,通过协议引导的数据生成、模态解耦架构与诊断证据奖励的强化学习三大创新提升临床可靠性,并揭示现有MLLMs在ECG解读中存在普遍幻觉问题。

Details Motivation: 现有多模态大语言模型(MLLMs)在心电图(ECG)解读中不可靠,常产生看似合理但临床错误的分析,亟需提升其诊断准确性与可信度。 Method: 提出ECG-R1模型,包含三项核心方法:1)协议引导的指令数据生成,确保解读基于可测量ECG特征与权威定量标准;2)模态解耦架构配合交错模态丢弃(Interleaved Modality Dropout),增强信号或图像任一模态缺失时的鲁棒性与跨模态一致性;3)基于ECG诊断证据的强化学习奖励机制,强化证据驱动的解读能力。 Result: 系统评估了多种商用、开源及医学专用MLLMs,首次以量化方式证实其ECG解读中存在严重幻觉现象;ECG-R1在可靠性、鲁棒性和证据支持性方面显著优于基线模型。 Conclusion: ECG-R1为多模态医学AI树立了新范式,强调临床协议对齐、模态鲁棒性与证据驱动决策的重要性;研究警示临床实践中不可直接信任现有MLLMs的ECG输出,必须经独立验证。 Abstract: Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{https://github.com/PKUDigitalHealth/ECG-R1}{here}, and an online platform can be accessed at \href{http://ai.heartvoice.com.cn/ECG-R1/}{here}.

[19] Contextual Drag: How Errors in the Context Affect LLM Reasoning

Yun Cheng,Xingyu Zhu,Haoyu Zhao,Sanjeev Arora

Main category: cs.CL

TL;DR: 本文揭示了大语言模型(LLM)在自我改进过程中存在一种名为‘上下文拖拽’(contextual drag)的现象:历史错误会结构性地影响后续推理,导致性能下降甚至自我恶化。该效应广泛存在于11种模型、8类推理任务中,且难以通过反馈或自检消除,现有缓解方法效果有限。

Details Motivation: 探究LLM自我改进范式中‘反思错误’这一核心假设的潜在缺陷,特别是上下文中的失败尝试是否反而损害后续推理质量。 Method: 在8个推理任务上系统评估11个主流LLM;使用树编辑距离进行结构化错误模式分析;测试外部反馈、自验证及多种缓解策略(如回退行为微调、上下文去噪)的有效性。 Result: 发现上下文拖拽导致10–20%性能下降;迭代自修正可能引发自我恶化;错误结构在推理路径间跨步继承;反馈与自验证无法消除该效应;缓解策略仅带来部分恢复。 Conclusion: 上下文拖拽是当前推理架构中一种普遍、顽固且尚未被充分认识的失败模式,对基于反思的自我改进范式构成根本性挑战。 Abstract: Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.

[20] Proxy Compression for Language Modeling

Lin Zheng,Xinyu Li,Qian Liu,Xiachong Feng,Lingpeng Kong

Main category: cs.CL

TL;DR: 本文提出了一种名为'proxy compression'的新训练范式,使语言模型能在保持训练效率的同时,直接以原始字节(raw bytes)作为输入进行推理,摆脱对固定分词器(tokenizer)的依赖。

Details Motivation: 现有语言模型严重依赖固定分词器(如UTF-8字节分词器),导致模型与压缩方式强耦合,限制了灵活性和鲁棒性;而纯字节级建模虽鲁棒但训练效率低。本文旨在解耦压缩与建模,实现高效且端到端的字节级建模。 Method: 提出proxy compression:训练时联合输入原始字节序列及其由外部压缩器生成的压缩视图,通过多视图对齐学习内部字节-压缩序列映射;压缩视图仅用于训练,推理时完全使用原始字节。 Result: 在代码语言建模任务上,proxy compression显著提升训练效率,大幅超越纯字节基线;随着模型规模增大,性能增益更明显,最终可匹敌甚至媲美传统分词器方法,同时全程基于原始字节、保持字节级鲁棒性。 Conclusion: proxy compression成功实现了训练效率与推理灵活性的统一,为摆脱分词器依赖、构建真正端到端字节级语言模型提供了可行路径。 Abstract: Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.

[21] Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Lingzhuang Sun,Ruitong Liu,Yuxia Zhu,Xiaohan Xu,Jingxuan Wei,Xiangxiang Zhang,Bihui Yu,Wentao Zhang

Main category: cs.CL

TL;DR: 本文提出了一种名为'Guided Verifier'的框架,通过引入动态验证器与策略模型实时协作推理,缓解多模态大语言模型在强化学习中因错误传播导致的性能下降问题,并构建了专门用于训练该验证器的CoRe数据集。

Details Motivation: 现有RL增强MLLM推理的方法依赖单一rollout策略,缺乏中间监督,易导致错误传播和噪声优化信号。 Method: 提出Guided Verifier框架:设计动态验证器在rollout过程中与策略模型实时交互、检测不一致并提供方向性引导;构建CoRe数据集(含过程级负样本与正确引导推理轨迹)用于训练验证器。 Result: 在MathVista、MathVerse和MMMU上实验表明,8B参数模型通过协同推理与动态验证可实现强性能。 Conclusion: 主动、过程级的动态验证机制能有效提升MLLM在复杂多模态推理任务中的鲁棒性与准确性,为RL训练范式提供了新思路。 Abstract: Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.

[22] How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Yanshu Wang,Shuaishuai Yang,Jingjing He,Tong Yang

Main category: cs.CL

TL;DR: 本文系统评估了少样本示例(few-shot demonstrations)在基于提示的LLM安全防御(RoP和ToP)中的作用,发现其对RoP有正面效果(提升安全率4.5%),但对ToP有显著负面效果(降低有效性21.2%),并据此提出实际部署建议。

Details Motivation: 尽管基于提示的防御(如RoP、ToP)已被证明有效,但少样本示例在其中的作用尚不明确;已有研究暗示其可能损害安全性,却未探究其与不同系统提示策略的交互机制。 Method: 在四个主流安全基准(AdvBench、HarmBench、SG-Bench、XSTest)上,使用六种主流越狱攻击方法,对多个主流大语言模型开展综合实证评估,对比分析少样本示例在RoP与ToP两类提示策略下的安全影响机制。 Result: 少样本示例对RoP和ToP产生相反影响:增强RoP安全性(+4.5%),削弱ToP有效性(−21.2%);机制归因为前者强化角色身份认同,后者分散任务指令注意力。 Conclusion: 少样本示例的效果高度依赖于系统提示类型,不能一概而论;应根据所选提示策略审慎设计是否及如何引入少样本示例,并为实际LLM安全部署提供差异化实践指南。 Abstract: Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.

[23] Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification

Branislav Pecher,Michal Spiegel,Robert Belanec,Jan Cegin

Main category: cs.CL

TL;DR: 本文探讨了大语言模型(LLMs)在零样本和少样本分类任务中对提示(prompt)变化的敏感性,指出这种敏感性很大程度上源于提示的“欠规范性”(underspecification),即提示缺乏明确的任务指令和输出约束;通过性能分析、logit分析和线性探针分析,作者发现欠规范提示导致更高性能方差和更低相关token logit值,但其影响主要出现在模型最后层,而非深层表征;研究强调需更严谨地分析和缓解提示敏感性。

Details Motivation: 大量研究表明LLM对prompt变化敏感,但现有研究多使用欠规范prompt(指令模糊、输出约束弱),作者认为这可能是敏感性的主要来源,因而需系统对比欠规范与规范prompt的影响。 Method: 采用性能分析(accuracy variance)、logit分析(relevant token logits)和线性探针(linear probing)三种方法,在相同任务下对比欠规范prompt与含具体指令prompt的行为差异,并定位敏感性在模型内部的产生位置。 Result: 欠规范prompt导致更高的性能方差和更低的相关token logit值;线性探针显示其对中间层表征影响微弱,敏感性主要在最终层输出阶段显现;规范prompt显著缓解上述问题。 Conclusion: prompt敏感性很大程度上源于欠规范设计,而非模型固有缺陷;未来研究与应用中应提高prompt规范性,并在评估敏感性时采用更严谨的prompt构造标准。 Abstract: Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.

[24] DeFrame: Debiasing Large Language Models Against Framing Effects

Kahee Lim,Soyeon Kim,Steven Euijong Whang

Main category: cs.CL

TL;DR: 本文揭示了提示词的表述方式(即'框架效应')对大语言模型公平性评估结果的影响,提出了'框架差异'概念,并设计了一种能提升模型在不同表述下公平性一致性的去偏方法。

Details Motivation: 现有公平性评估方法未能充分考虑语义等价但表述不同的提示词(即'框架效应')对模型输出公平性的影响,导致模型看似公平实则存在隐藏偏差。 Method: 提出'框架差异'概念,通过在公平性评估基准中引入多种语义等价但表述不同的提示来量化该效应;并设计一种面向框架感知的去偏方法,增强模型在不同框架下的响应一致性。 Result: 实验表明,不同框架下模型的公平性得分差异显著;现有去偏方法虽提升平均公平性,却未能缓解框架差异;所提方法可同时降低总体偏差并提升对框架差异的鲁棒性。 Conclusion: 框架效应是影响LLM公平性评估与实际表现的重要因素,需在评估与去偏过程中显式建模和缓解。 Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

[25] A Domain-Specific Curated Benchmark for Entity and Document-Level Relation Extraction

Marco Martinelli,Stefano Marchesin,Vanessa Bonato,Giorgio Maria Di Nunzio,Nicola Ferro,Ornella Irrera,Laura Menotti,Federica Vezzani,Gianmaria Silvello

Main category: cs.CL

TL;DR: 本文介绍了GutBrainIE,一个基于1600多篇PubMed摘要、由生物医学和术语学专家手工标注的肠道-大脑轴领域信息抽取新基准,旨在解决现有生物医学IE基准范围窄、标注质量低的问题。

Details Motivation: 现有生物医学信息抽取(IE)基准范围狭窄、依赖远监督或自动生成标注,难以支撑鲁棒IE方法的发展,尤其在快速发展的肠道-大脑轴等复杂领域。 Method: 构建了一个名为GutBrainIE的新基准,涵盖1600多篇PubMed摘要,由领域专家进行细粒度实体、概念级链接和关系的手工标注,并融合高精度标注与弱监督数据。 Result: GutBrainIE提供了丰富的模式、多任务支持及高质量标注数据,显著提升了生物医学IE系统在跨领域开发与评估中的适用性与可靠性。 Conclusion: GutBrainIE为生物医学信息抽取研究提供了更可靠、更具泛化能力的基准资源,有助于推动面向复杂生物医学领域的鲁棒IE方法发展。 Abstract: Information Extraction (IE), encompassing Named Entity Recognition (NER), Named Entity Linking (NEL), and Relation Extraction (RE), is critical for transforming the rapidly growing volume of scientific publications into structured, actionable knowledge. This need is especially evident in fast-evolving biomedical fields such as the gut-brain axis, where research investigates complex interactions between the gut microbiota and brain-related disorders. Existing biomedical IE benchmarks, however, are often narrow in scope and rely heavily on distantly supervised or automatically generated annotations, limiting their utility for advancing robust IE methods. We introduce GutBrainIE, a benchmark based on more than 1,600 PubMed abstracts, manually annotated by biomedical and terminological experts with fine-grained entities, concept-level links, and relations. While grounded in the gut-brain axis, the benchmark's rich schema, multiple tasks, and combination of highly curated and weakly supervised data make it broadly applicable to the development and evaluation of biomedical IE systems across domains.

[26] Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Sichu Liang,Hongyu Zhu,Wenwen Wang,Deyu Zhou

Main category: cs.CL

TL;DR: 本文通过空间n-back任务,对比了Qwen2.5(纯文本)与Qwen2.5-VL(视觉-语言)模型在文本和图像模态下的工作记忆表现,发现文本模态性能显著优于视觉模态;进一步分析表明,模型实际执行的常是基于最近项的比较而非真正按指定滞后(lag)的记忆更新,并受网格尺寸影响干扰模式,提示需对多模态工作记忆进行计算机制层面的解读。

Details Motivation: 探究视觉-语言模型是否像语言模型一样展现出可比的工作记忆计算能力,尤其关注信息编码模态(文本vs视觉)对n-back任务表现的影响。 Method: 在控制条件下,使用匹配的文本渲染和图像渲染网格,对Qwen2.5和Qwen2.5-VL进行空间n-back任务评估;采用试次级对数概率证据分析实际记忆策略,并考察网格尺寸对刺激流中重复结构及干扰模式的影响。 Result: 模型在文本条件下的准确率和d'显著高于图像条件;名义上的2/3-back任务常未真正执行对应滞后比较,而表现为近因锁定的比较;网格尺寸变化会改变近期重复结构,进而影响干扰类型和错误模式。 Conclusion: 当前视觉-语言模型在视觉n-back任务中并未展现出与文本n-back相当的、符合经典工作记忆定义的动态维持与更新能力;性能差异源于模态特异性计算机制,需发展更计算敏感的多模态工作记忆评估框架。 Abstract: Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.

[27] Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

Jie Deng,Hanshuang Tong,Jun Li,Shining Liang,Ning Wu,Hongzhi Li,Yutao Xie

Main category: cs.CL

TL;DR: 本文提出TrajFusion方法,通过融合错误推理轨迹、反思提示与正确轨迹来增强大语言模型的数学推理能力,无需修改模型结构或训练目标,显著优于传统拒绝采样微调。

Details Motivation: 现有基于拒绝采样的微调方法仅保留正确推理路径,忽略教师生成的错误轨迹,导致对推理失败建模不足,限制了模型在复杂数学问题上的提升。 Method: TrajFusion将拒绝采样重构为结构化监督构建过程:交错拼接精选的错误轨迹、反射提示和正确轨迹,形成融合推理路径;其长度根据教师错误的频率与多样性自适应调整。 Result: 在多个数学基准测试中,TrajFusion持续超越传统拒绝采样微调(RFT),尤其在高难度和长推理任务上表现更优。 Conclusion: TrajFusion通过显式建模试错推理过程,有效利用错误信息提升数学推理能力,是一种简单、通用且高性能的微调策略。 Abstract: Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.

[28] Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

Isabel Tsintsiper,Sheng Wong,Beth Albert,Shaun P Brennecke,Gabriel Davis Jones

Main category: cs.CL

TL;DR: 本文研究了当前大型语言模型(LLMs)在临床推理中是否存在性别偏差,发现不同模型存在稳定且特定的性别分配偏向(如ChatGPT偏女性、Gemini偏男性),且该偏差在温度参数调整和允许不作答时仍影响下游诊断结果,提示需谨慎配置、专科级数据审计与持续人工监督。

Details Motivation: 大型语言模型日益应用于医疗场景,但其训练数据可能继承并放大现实中的性别诊断与治疗差异,亟需系统评估其临床推理中的性别偏差。 Method: 基于50个由临床医生撰写的、性别对初始诊断路径无信息价值的病例 vignette,使用4个通用大模型(ChatGPT、Claude 3.7 Sonnet、Gemini 2.0 Flash、DeepSeekchat),在温度0.5下测试其对患者性别的预测倾向,并分析允许‘不作答’对诊断结果的影响。 Result: 所有模型均表现出显著且稳定的性别分配偏差:ChatGPT(70%女性)、DeepSeek(61%)、Claude(59%)偏女性,Gemini(36%女性)偏男性;允许模型 abstention 可减少显式性别标注,但未能消除下游诊断差异。 Conclusion: 当代通用大模型在临床推理中存在模型特异性的性别偏差,安全临床部署需保守配置、专科级数据审计及持续人工监督。 Abstract: Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.

[29] Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Yujie Lin,Kunquan Li,Yixuan Liao,Xiaoxin Chen,Jinsong Su

Main category: cs.CL

TL;DR: 本文提出了一种无需微调或提示工程的LLM去偏框架,通过检测刻板印象诱导词并进行神经元级偏差归因,在投影层直接干预神经元激活以减轻偏差,实验证明其在减少偏差的同时保持模型性能。

Details Motivation: 现有去偏方法(如额外数据集微调或提示工程)存在可扩展性差或多轮交互中损害用户体验的问题。 Method: 1)通过跨人口统计群体的对比分析识别刻板印象诱导的形容词和名词;2)基于积分梯度的两种归因策略定位偏差相关神经元;3)在投影层直接干预这些神经元的激活以缓解偏差。 Result: 在三个主流大语言模型上的实验表明,该方法能有效降低偏差,同时保持模型整体性能。 Conclusion: 所提框架是一种高效、无侵入式的LLM去偏新范式,兼顾效果与实用性。 Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: https://github.com/XMUDeepLIT/Bi-directional-Bias-Attribution.

[30] Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models

Yu Zhang,Xinchen Li,Jialei Zhou,Hongnan Ma,Zhongwei Wan,Yiwei Shi,Duoqian Miao,Qi Zhang,Longbing Cao

Main category: cs.CL

TL;DR: 本文提出Swordsman框架,一种基于熵驱动的自适应分块解码方法,用于提升扩散语言模型(DLMs)的推理速度与质量。该方法通过识别词元间熵变来动态划分语义/句法成分边界,并实时调整去掩码阈值,在无需训练的前提下显著提升性能。

Details Motivation: 现有分块解码方法采用固定分块策略,易割裂语义或句法成分,导致性能次优;受熵减假设(ERH)启发,作者认为成分边界更利于不确定性降低,故引入熵分析识别边界。 Method: 提出Swordsman框架:1)基于相邻词元间熵变自适应划分解码块,以对齐语义/句法成分;2)根据块内实时去掩码状态动态调整去掩码阈值;3)为训练无关方法,支持KV Cache加速。 Result: 作为训练无关框架,Swordsman在多项评估中达到SOTA性能,同时提升推理效率与稳定性。 Conclusion: 熵驱动的自适应分块策略能更合理地匹配语言结构,显著优于固定分块方法,为DLMs高效高质量解码提供了新范式。 Abstract: Block-wise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations.

[31] History-Guided Iterative Visual Reasoning with Self-Correction

Xinglong Yang,Zhilin Peng,Zhanzhan Liu,Haochen Shi,Sheng-Jun Huang

Main category: cs.CL

TL;DR: 本文提出H-GIVR框架,通过多轮视觉观察与历史答案参考实现动态错误修正,显著提升多模态大语言模型跨模态推理准确率,且计算开销低。

Details Motivation: 现有自一致性方法局限于固定重复采样与投票范式,无法复用历史推理信息,导致模型难以主动纠正视觉理解错误和动态调整推理过程。 Method: 提出H-GIVR框架,在迭代推理中让MLLM多次观察图像,并将先前生成的答案作为后续步骤的参考,从而实现动态错误修正。 Result: 在五个数据集和三个模型上的实验表明,H-GIVR显著提升跨模态推理准确率;例如在ScienceQA上使用Llama3.2-vision:11b时,平均2.57次响应即达78.90%准确率,较基线提升107%。 Conclusion: H-GIVR通过模拟人类反复验证与动态纠错的推理行为,有效提升了MLLM的推理可靠性与效率,为自一致性方法提供了新范式。 Abstract: Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.

[32] Fine-Grained Activation Steering: Steering Less, Achieving More

Zijian Feng,Tianjiao Li,Zixiao Zhu,Hanzhang Zhou,Junlang Qian,Li Zhang,Jia Jim Deryl Chua,Lee Onn Mak,Gee Wah Ng,Kezhi Mao

Main category: cs.CL

TL;DR: 本文提出AUSteer方法,通过在原子单元(AU)级别进行激活干预,解决现有块级激活引导方法因特征混杂导致的粗粒度、低效和侵入性问题,显著提升大语言模型行为调控的精度与效率。

Details Motivation: 现有块级激活引导方法因块内激活异质性(混杂有益、无关和有害特征)而效果受限,需更细粒度的干预机制。 Method: 将块级激活分解为原子单元(AU)级别,每个AU对应权重矩阵的一个切片;通过对比样本计算激活动量全局筛选判别性AU,并为不同输入和AU分配自适应引导强度。 Result: AUSteer在多个大语言模型和任务上持续超越先进基线,且干预的激活数量大幅减少,验证了‘少干预、多收益’的有效性。 Conclusion: AU级细粒度引导能有效解耦混杂特征,提升行为调控的精准性与效率,为LLM可控生成提供了新范式。 Abstract: Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.

[33] No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Dmitry Karpov

Main category: cs.CL

TL;DR: 本文研究了五种突厥语系语言对的机器翻译,通过LoRA微调NLLB模型和基于检索的提示方法,在多个语言对上取得了显著的chrF++分数,并公开了数据集和模型权重。

Details Motivation: 探索突厥语系语言(如巴什基尔语、哈萨克语、吉尔吉斯语、鞑靼语、楚瓦什语)与俄语或英语之间的机器翻译性能,解决低资源语言翻译的挑战。 Method: 使用LoRA技术在合成数据上微调nllb-200-distilled-600M模型;对DeepSeek-V3.2采用基于检索的提示方法;还尝试了零样本方法。 Result: 哈萨克语chrF++达49.71,巴什基尔语达46.94,楚瓦什语达39.47,鞑靼语达41.6,吉尔吉斯语达45.6。同时发布了数据集和训练权重。 Conclusion: LoRA微调在部分突厥语对上效果显著,而检索增强提示和零样本方法在其他语言上也展现出潜力,验证了多种策略在低资源语言翻译中的适用性。 Abstract: We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.

[34] Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

Masaya Tsunokake,Yuta Koreeda,Terufumi Morishita,Koichi Nagatsuka,Hikaru Tomonari,Yasuhiro Sogawa

Main category: cs.CL

TL;DR: 本文研究了微领域自适应预训练(mDAPT)在生成式任务中的有效性,发现其能显著提升事实提取能力,但对推理和答案生成能力提升有限;通过将回答过程解耦为提取、推理和生成三阶段,验证了mDAPT在IT技术支持场景中主要缓解知识提取瓶颈,而推理能力成为后续关键瓶颈。

Details Motivation: 现有研究仅验证mDAPT在多项选择题上的有效性,其在真实企业运营所需的生成式任务中的表现尚不明确,需系统揭示其潜力与瓶颈。 Method: 将生成式问答过程解耦为三个子任务:事实提取(eliciting)、基于事实的推理(reasoning)和长文本答案生成(composing),并在IT产品技术支持的真实专有知识上实证评估mDAPT对各子任务的影响。 Result: mDAPT显著改善了基础模型薄弱的事实提取能力,但未明显提升推理和答案生成性能;进一步分析表明,当提取和推理任务均被解决时,整体性能可达90%以上。 Conclusion: mDAPT的核心价值在于增强模型对微领域知识的提取能力,而非推理或生成能力;未来应重点提升模型的推理能力以突破当前瓶颈。 Abstract: When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ($\textbf{micro domains}$). A previous study shows micro domain-adaptive pre-training ($\textbf{mDAPT}$) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) $\textbf{eliciting}$ facts relevant to questions from an LLM's own knowledge, (2) $\textbf{reasoning}$ over the facts to obtain conclusions, and (3) $\textbf{composing}$ long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT's effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.

[35] Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Jinlong Ma,Yu Zhang,Xuefeng Bai,Kehai Chen,Yuwei Wang,Zeming Liu,Jun Yu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出Modality-aware Consistency Reasoning (MCR) 方法,通过多风格推理模式注入(MRSI)和约束引导的可验证优化(CVO),缓解多模态大语言模型(MLLMs)在端到端接地多模态命名实体识别(GMNER)中的模态偏见问题,显著提升性能。

Details Motivation: 现有MLLMs在GMNER任务中常作为辅助工具,且存在视觉或文本单模态捷径倾向(即模态偏见),难以实现严谨的跨模态验证,限制其端到端应用效果。 Method: 提出Modality-aware Consistency Reasoning(MCR),包含两个核心组件:1)Multi-style Reasoning Schema Injection(MRSI),将抽象约束转化为可执行推理链;2)Constraint-guided Verifiable Optimization(CVO),利用Group Relative Policy Optimization(GRPO)动态对齐推理路径。 Result: 在GMNER和视觉定位任务上,MCR有效缓解模态偏见,性能优于现有基线方法。 Conclusion: MLLMs具备端到端GMNER潜力,但需克服模态偏见;MCR通过结构化跨模态推理机制,为鲁棒多模态理解提供了新范式。 Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.

[36] Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough

Dario Paape,Tal Linzen,Shravan Vasishth

Main category: cs.CL

TL;DR: 本文提出了一种潜过程混合模型,用于分析人类在四种阅读范式下对暂时歧义花园路径句的阅读行为,区分了花园路径概率、成本及再分析成本,并通过考虑不专注阅读试次,提高了处理成本估计的真实性。

Details Motivation: 为更准确地建模人类对花园路径句的阅读行为,尤其是区分不同类型的处理成本并纳入不专注阅读的影响。 Method: 构建一个潜过程混合模型,整合眼动追踪、单/双向自定步调阅读和迷宫任务四种范式的数据,区分花园路径概率、花园路径成本和再分析成本,并与基于GPT-2惊异度的无混合模型进行交叉验证比较。 Result: 该模型能成功复现重读行为、理解性问题作答和语法判断等实证模式;交叉验证表明其对人类阅读模式和试验末尾任务数据的预测拟合优于GPT-2惊异度基线模型。 Conclusion: 潜过程混合模型更真实地刻画了人类句子加工中的异质性,为未来计算语言学与认知建模的结合提供了新方法论路径。 Abstract: Using temporarily ambiguous garden-path sentences ("While the team trained the striker wondered ...") as a test case, we present a latent-process mixture model of human reading behavior across four different reading paradigms (eye tracking, uni- and bidirectional self-paced reading, Maze). The model distinguishes between garden-path probability, garden-path cost, and reanalysis cost, and yields more realistic processing cost estimates by taking into account trials with inattentive reading. We show that the model is able to reproduce empirical patterns with regard to rereading behavior, comprehension question responses, and grammaticality judgments. Cross-validation reveals that the mixture model also has better predictive fit to human reading patterns and end-of-trial task data than a mixture-free model based on GPT-2-derived surprisal values. We discuss implications for future work.

[37] PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation

Saleh Afzoon,MohammadHossein Ahmadi,Usman Naseem,Amin Beheshti

Main category: cs.CL

TL;DR: 本文提出了PersoDPO框架,利用自动评估信号(聚焦连贯性、个性化及指令遵循)构建偏好对,无需人工标注,从而有效提升开源大模型在人格化对话中的一致性与个性化能力。

Details Motivation: 现有开源大语言模型虽具备良好通用对话能力,但在人格化对话中难以同时保证上下文连贯性与人格线索一致性。 Method: 提出PersoDPO——一种可扩展的偏好优化框架,融合自动评估指标(连贯性、个性化、长度/格式合规性)生成高质量偏好对,用于微调对话模型。 Result: 在FoCus数据集上,经PersoDPO微调的开源模型在多项指标上持续超越强开源基线及标准DPO变体。 Conclusion: PersoDPO提供了一种无需人工标注、可复现且可扩展的训练范式,显著提升了开源模型在人格化对话任务中的性能。 Abstract: Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.

[38] Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Hyeontaek Hwang,Nguyen Dinh Son,Daeyoung Kim

Main category: cs.CL

TL;DR: 本文提出Model-Dowser,一种针对多模态大语言模型(MLLMs)的稀疏微调方法,通过综合权值大小、输入激活和输出敏感性来评估参数重要性,选择性冻结高重要性参数以缓解灾难性遗忘,在保持资源效率的同时可扩展至数十亿参数模型。

Details Motivation: 现有缓解MLLM微调中灾难性遗忘的方法在深层语言解码器微调时失效,或随模型增大而扩展性差。 Method: 提出Model-Dowser:基于权值幅度、输入激活与输出敏感性联合计算参数重要性分数;微调中仅更新低重要性参数,冻结高重要性参数。 Result: 在LLaVA和NVILA两个代表性MLLM上验证,Model-Dowser显著缓解灾难性遗忘,性能持续优于先前方法,且资源高效、可扩展。 Conclusion: Model-Dowser为MLLM提供了一种高效、可扩展的稀疏微调范式,有效平衡下游任务适配与预训练能力保留。 Abstract: Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.

[39] ReFRAME or Remain: Unsupervised Lexical Semantic Change Detection with Frame Semantics

Bach Phan-Tat,Kris Heylen,Dirk Geeraerts,Stefano De Pascale,Dirk Speelman

Main category: cs.CL

TL;DR: 本文提出了一种基于框架语义学的词汇语义变化检测新方法,该方法在效果上可媲美甚至超越主流分布语义模型,且具有高度可解释性。

Details Motivation: 现有基于神经词嵌入的语义变化检测方法虽性能良好,但结果难以解释,亟需更透明、可理解的替代方案。 Method: 完全基于框架语义学(frame semantics)构建语义变化检测方法,不依赖词嵌入或分布表示。 Result: 该方法在LSC基准任务上表现有效,部分情况下优于多种分布语义模型;定量与定性分析证实其预测既合理又高度可解释。 Conclusion: 框架语义学提供了一种有竞争力且更可解释的LSC检测路径,为语义变化研究提供了新范式。 Abstract: The majority of contemporary computational methods for lexical semantic change (LSC) detection are based on neural embedding distributional representations. Although these models perform well on LSC benchmarks, their results are often difficult to interpret. We explore an alternative approach that relies solely on frame semantics. We show that this method is effective for detecting semantic change and can even outperform many distributional semantic models. Finally, we present a detailed quantitative and qualitative analysis of its predictions, demonstrating that they are both plausible and highly interpretable

[40] $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Aditya Kasliwal,Pratinav Seth,Vinay Kumar Sankarapu

Main category: cs.CL

TL;DR: 本文提出C-Δθ方法,将选择性拒绝机制完全离线化,通过电路受限的权重更新实现安全策略部署,无需推理时干预,显著降低计算开销。

Details Motivation: 现有大语言模型的安全控制多依赖推理时干预,带来持续计算成本和部署复杂性;激活引导等方法虽常用,但仍需运行时钩子且成本随生成次数增加。 Method: 提出C-Δθ(Circuit Restricted Weight Arithmetic):(i) 利用EAP-IG定位拒绝行为相关的稀疏计算电路;(ii) 仅在该电路上计算受限权重更新ΔθC(通常覆盖<5%参数),生成可直接部署的编辑后检查点。 Result: C-Δθ实现了类别定向的选择性拒绝,在拒绝与效用基准测试中保持良好能力保留,且完全消除推理时钩子,将成本从每次请求转为一次性离线更新。 Conclusion: 选择性拒绝可被蒸馏为电路受限的权重更新,实现高效、轻量、即插即用的安全策略部署,为LLM安全机制提供新范式。 Abstract: Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-Δθ: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update ΔθC supported only on that circuit (typically <5% of parameters). Applying ΔθC yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.

[41] LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Gang Lin,Dongfang Li,Zhuoen Chen,Yukun Shi,Xuhui Chen,Baotian Hu,Min Zhang

Main category: cs.CL

TL;DR: 本文提出LycheeDecode,一种基于细粒度混合头注意力机制的高效解码方法,通过将注意力头划分为检索头和稀疏头,动态选择关键token并复用,显著提升长上下文LLM推理速度(最高2.7倍)且不牺牲生成质量。

Details Motivation: 长上下文大语言模型在解码过程中键值缓存急剧膨胀,带来高内存与延迟开销;现有跨层共享关键token的方法因粗粒度共享忽视了注意力头的功能多样性,损害性能。 Method: 提出LycheeDecode方法,核心是细粒度混合头注意力机制:采用HardKuma机制将注意力头分为少量动态检索关键token的检索头和大量复用这些token的稀疏头,并结合硬件友好的top-k选择策略。 Result: 在Llama3、Qwen3等主流模型及LongBench、RULER、AIME24、OlympiadBench等多个长上下文理解与复杂推理基准上验证,LycheeDecode在128K上下文长度下实现最高2.7倍加速,同时生成质量媲美甚至超越全注意力基线。 Conclusion: 通过保留注意力头的功能多样性,LycheeDecode克服了现有方法的性能瓶颈,为高效高质量的长上下文LLM推理提供了经过验证的有效路径。 Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.

[42] Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates

Jian Gu,Aldeida Aleti,Chunyang Chen,Hongyu Zhang

Main category: cs.CL

TL;DR: 本文提出伪逆绑定(PIT)方法,通过将嵌入层与输出层建模为共享正交隐空间的耦合投影,保证token接口在整个训练过程中保持伪逆一致性,从而提升训练稳定性与后训练干预的可预测性。

Details Motivation: 权重绑定虽减少参数,但训练中嵌入与解嵌入间的对应关系易漂移,导致优化敏感、后训练干预不可控。 Method: PIT构建共享正交隐空间(通过薄极分解或随机正交初始化),引入由Cholesky因子参数化的对称正定隐空间变换;输出头先变换隐状态再投影,嵌入层用稳定三角求解应用逆变换,避免显式伪逆计算和额外词表尺寸参数。 Result: 在256M–1.3B参数的端侧模型上验证,PIT显著提升训练稳定性、层间语义一致性,并大幅降低后训练干预副作用。 Conclusion: PIT通过几何约束建模嵌入-解嵌入关系,为紧凑语言模型提供了更鲁棒、可干预的参数绑定范式。 Abstract: Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, weight sharing does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and making post-training interventions such as editing, patching, and lightweight adaptation less predictable. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by thin polar decomposition for teacher initialization or random orthonormal initialization from scratch, and introduces a fully learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and any vocabulary-sized auxiliary parameters. We evaluate PIT on on-device models spanning 256M-1.3B parameters across pretraining and adaptation, and consistently observe improved training stability, stronger layerwise semantic consistency, and substantially reduced side effects.

[43] Textual Planning with Explicit Latent Transitions

Eliezer Shlomi,Ido Levy,Eilam Shapira,Michael Katz,Guy Uziel,Segev Shlomov,Nir Mashkif,Roi Reichart,Sarah Keren

Main category: cs.CL

TL;DR: EmbedPlan 提出了一种在冻结语言嵌入空间中进行轻量级状态转移建模的方法,替代传统LLM的自回归生成,以加速规划过程;实验表明其在域内泛化效果好,但跨域泛化仍具挑战。

Details Motivation: 现有基于大语言模型(LLM)的规划方法受限于逐token生成和重复全量前向传播,导致多步前瞻与rollout搜索在延迟和计算开销上代价高昂。 Method: EmbedPlan 利用冻结的语言模型编码器将自然语言描述的状态和动作映射为嵌入向量,在该嵌入空间中训练一个轻量级过渡模型预测下一状态嵌入,并通过最近邻检索还原为自然语言状态,无需微调编码器。 Result: 在九个经典规划领域、六种递进难度评估协议(插值、计划变体、外推、多域、跨域、留一法)下,EmbedPlan 在插值任务中接近完美,但在需泛化至未见问题或未见领域的任务中性能显著下降;计划变体评估显示其能泛化到替代路径而非简单记忆轨迹。 Conclusion: 冻结语言嵌入可支持在单一规划域内有效学习动力学,但跨域迁移仍是当前主要瓶颈。 Abstract: Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.

[44] Can LLMs capture stable human-generated sentence entropy measures?

Estrella Pivel-Villanueva,Elisabeth Frederike Sterner,Franziska Knolle

Main category: cs.CL

TL;DR: 本文通过大规模cloze数据集分析了人类语言预测熵的稳定性,并评估了大语言模型(LLM)在生成与人类一致的词级熵估计方面的能力;结果表明,人类熵估计在不同可预测性句子中收敛所需响应数差异显著,GPT-4o在多种LLM中与人类熵最接近,但方法和提示设计影响显著。

Details Motivation: 缺乏关于人类反应数量对词级Shannon熵估计稳定性和无偏性的实证共识;同时,大语言模型常被用作人类norming数据的替代,但其能否准确复现人类熵尚不明确。 Method: 使用德语和英语两个大型公开cloze数据集,实施基于自助法(bootstrap)的收敛性分析,追踪熵估计随样本量增加的稳定过程;并对比多个LLM(如GPT-4o、GPT2-xl、RoBERTa、LLaMA2等)通过logit提取和采样频率两种方式生成的熵估计与稳定人类熵的一致性。 Result: 97%以上句子在可用样本量内达到稳定熵估计;德语和英语分别在约111和81个响应后90%句子收敛;低熵句子仅需约20响应,高熵句子需更多;GPT-4o与人类熵一致性最高,但logit法误差更小,采样法更能反映人类变异性分布。 Conclusion: 为人类norming实践提供了首个直接实证支持,表明熵收敛高度依赖句子可预测性;LLM虽可近似人类熵,但不能完全替代稳定的人类分布,实际应用中需谨慎选择模型与提取方法。 Abstract: Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (<1) required as few as 20 responses and high-entropy sentences (>2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.

[45] Semantic Self-Distillation for Language Model Uncertainty

Edward Phillips,Sean Wu,Boyan Gao,David A. Clifton

Main category: cs.CL

TL;DR: 本文提出Semantic Self-Distillation (SSD)方法,通过轻量级学生模型预测语义分布及其熵,以高效估计大语言模型输出的不确定性,用于幻觉检测与域外答案识别。

Details Motivation: 大语言模型难以进行原则性的不确定性量化,而现有基于采样语义分散度的方法计算开销大、不适用于低延迟场景。 Method: 将采样得到的语义分布蒸馏为轻量级学生模型,该模型在生成答案前即可预测prompt条件下的语义分布,利用其熵和概率密度提供不确定性信号。 Result: 在TriviaQA上,学生模型在幻觉预测任务中匹配或超越有限样本语义分散度,并能有效检测域外答案。 Conclusion: SSD为复杂输出空间(不限于语言)中的预测不确定性蒸馏提供了一种通用框架。 Abstract: Large language models present challenges for principled uncertainty quantification, in part due to their complexity and the diversity of their outputs. Semantic dispersion, or the variance in the meaning of sampled answers, has been proposed as a useful proxy for model uncertainty, but the associated computational cost prohibits its use in latency-critical applications. We show that sampled semantic distributions can be distilled into lightweight student models which estimate a prompt-conditioned uncertainty before the language model generates an answer token. The student model predicts a semantic distribution over possible answers; the entropy of this distribution provides an effective uncertainty signal for hallucination prediction, and the probability density allows candidate answers to be evaluated for reliability. On TriviaQA, our student models match or outperform finite-sample semantic dispersion for hallucination prediction and provide a strong signal for out-of-domain answer detection. We term this technique Semantic Self-Distillation (SSD), which we suggest provides a general framework for distilling predictive uncertainty in complex output spaces beyond language.

[46] Trust The Typical

Debargha Ganguly,Sreehari Sankar,Biyao Zhang,Vikash Singh,Kanan Gupta,Harshini Kavuru,Alan Luo,Weicong Chen,Warren Morningstar,Raghu Machiraju,Vipin Chaudhary

Main category: cs.CL

TL;DR: 本文提出Trust The Typical (T3)框架,将LLM安全建模为语义空间中的分布外(OOD)检测问题,仅用安全样本训练即可实现跨语言、跨领域的高效安全防护,显著降低误报率并具备生产级部署能力。

Details Motivation: 现有LLM安全方法依赖于识别和屏蔽已知威胁的脆弱对抗机制;作者主张应转向以‘理解什么是安全’为核心的安全范式。 Method: T3框架在语义空间中学习可接受提示的分布,将显著偏离该分布的输入视为潜在威胁;无需有害样本训练,仅使用安全英文文本训练单个模型。 Result: 在18个基准测试(涵盖毒性、仇恨言论、越狱、多语言危害及过拒绝)上达到SOTA性能,误报率最高降低40倍;单模型可零样本迁移至14种以上语言和多种领域;GPU优化版本集成至vLLM,生成时持续防护开销<6%。 Conclusion: 安全不应依赖穷举有害内容,而应基于对安全分布的建模;T3验证了OOD检测范式在LLM安全中的有效性、泛化性与工程实用性。 Abstract: Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.

[47] VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Jaeyoon Jung,Yejun Yoon,Seunghyun Yoon,Kunwoo Park

Main category: cs.CL

TL;DR: VILLAIN是一个基于多智能体协作的多模态事实核查系统,利用视觉-语言模型在多个阶段协同验证图文声明,在AVerImaTeC共享任务中排名第一。

Details Motivation: 解决图文声明的事实核查问题,尤其需融合文本与视觉证据并处理模态间不一致性。 Method: 采用提示驱动的多智能体协作框架:先检索文本和视觉证据;再由模态特异性和跨模态智能体生成分析报告;接着生成基于报告的问答对;最后由判决预测智能体输出核查结果。 Result: 在AVerImaTeC共享任务中,VILLAIN在所有评估指标上均排名第一。 Conclusion: 提示驱动的多智能体协作范式能有效提升多模态事实核查性能,且系统开源可复现。 Abstract: This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.

[48] Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays

Lucile Favero,Juan Antonio Pérez-Ortiz,Tanja Käser,Nuria Oliver

Main category: cs.CL

TL;DR: 本文研究了基于特征的自动议论文评分方法,提出了两种互补的建模范式:基于小开源大语言模型的结构化上下文学习和基于BigBird编码器与CORAL序数回归的监督模型,并在ASAP++数据集上验证了其有效性。

Details Motivation: 传统自动作文评分系统主要关注整体分数,难以满足教育场景中教师和学习者对可解释、符合教学目标和评分标准的细粒度反馈的需求。 Method: 提出两种建模范式:(1) 基于小开源大语言模型的结构化上下文学习,结合评分标准设计提示;(2) 基于BigBird编码器与CORAL序数回归的监督模型,显式建模分数的序数关系。 Result: 显式建模分数序数性显著提升与人工评分者的一致性,优于大语言模型及名义分类/回归基线;小开源大语言模型无需微调即具竞争力,尤其在推理类特征上表现突出。 Conclusion: 模型目标需与评分标准语义对齐;小开源大语言模型可在保护隐私、本地部署前提下提供可解释反馈,为AI教育系统设计提供方法论与实践启示。 Abstract: Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.

[49] RexBERT: Context Specialized Bidirectional Encoders for E-commerce

Rahul Bajaj,Anuj Garg

Main category: cs.CL

TL;DR: RexBERT is a family of BERT-style encoder models tailored for e-commerce, trained on the newly introduced 350B-token Ecom-niverse corpus and a three-phase pretraining recipe; it achieves superior performance on e-commerce tasks despite fewer parameters than general-purpose encoders.

Details Motivation: General-purpose encoder-only transformers lack sufficient coverage of specialized e-commerce semantics, limiting their effectiveness in domain-specific retrieval, classification, and ranking tasks. Method: Introduce Ecom-niverse (350B-token e-commerce corpus), propose a modular data curation pipeline, adopt ModernBERT's architectural advances, and design a three-phase pretraining recipe: general pre-training, context extension, and annealed domain specialization. Result: RexBERT models (17M–400M parameters) outperform larger general-purpose encoders and match or surpass modern long-context models on e-commerce token classification, semantic similarity, and NLU benchmarks. Conclusion: High-quality in-domain data combined with a principled, phased pretraining strategy yields stronger e-commerce encoders than brute-force model scaling. Abstract: Encoder-only transformers remain indispensable in retrieval, classification, and ranking systems where latency, stability, and cost are paramount. Most general purpose encoders, however, are trained on generic corpora with limited coverage of specialized domains. We introduce RexBERT, a family of BERT-style encoders designed specifically for e-commerce semantics. We make three contributions. First, we release Ecom-niverse, a 350 billion token corpus curated from diverse retail and shopping sources. We describe a modular pipeline that isolates and extracts e-commerce content from FineFineWeb and other open web resources, and characterize the resulting domain distribution. Second, we present a reproducible pretraining recipe building on ModernBERT's architectural advances. The recipe consists of three phases: general pre-training, context extension, and annealed domain specialization. Third, we train RexBERT models ranging from 17M to 400M parameters and evaluate them on token classification, semantic similarity, and general natural language understanding tasks using e-commerce datasets. Despite having 2-3x fewer parameters, RexBERT outperforms larger general-purpose encoders and matches or surpasses modern long-context models on domain-specific benchmarks. Our results demonstrate that high quality in-domain data combined with a principled training approach provides a stronger foundation for e-commerce applications than indiscriminate scaling alone.

[50] Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection

Junhao Liu,Haonan Yu,Zhenyu Yan,Xin Zhang

Main category: cs.CL

TL;DR: 本文提出Focus-LIME,一种粗到细的框架,通过代理模型优化扰动邻域,提升大语言模型在长上下文场景下的特征级可解释性与归因保真度。

Details Motivation: 现有局部模型无关解释方法在大规模上下文场景中面临归因稀释问题,难以提供高保真的手术级特征解释,尤其在法律审计、代码调试等高风险任务中亟需解决。 Method: 提出Focus-LIME框架:首先用轻量代理模型筛选关键上下文区域(粗粒度),再在该优化后的子空间内对目标模型进行细粒度特征归因。 Result: 在多个长上下文基准测试中,Focus-LIME显著提升了归因的保真度与可解释性,使手术级解释变得切实可行。 Conclusion: Focus-LIME有效缓解了高维特征空间下的归因稀释问题,为大语言模型在长上下文任务中的可信解释提供了新范式。 Abstract: As Large Language Models (LLMs) scale to handle massive context windows, achieving surgical feature-level interpretation is essential for high-stakes tasks like legal auditing and code debugging. However, existing local model-agnostic explanation methods face a critical dilemma in these scenarios: feature-based methods suffer from attribution dilution due to high feature dimensionality, thus failing to provide faithful explanations. In this paper, we propose Focus-LIME, a coarse-to-fine framework designed to restore the tractability of surgical interpretation. Focus-LIME utilizes a proxy model to curate the perturbation neighborhood, allowing the target model to perform fine-grained attribution exclusively within the optimized context. Empirical evaluations on long-context benchmarks demonstrate that our method makes surgical explanations practicable and provides faithful explanations to users.

[51] Disentangling meaning from language in LLM-based machine translation

Théo Lasnier,Armel Zebaze,Djamé Seddah,Rachel Bawden,Benoît Sagot

Main category: cs.CL

TL;DR: 本文研究了大语言模型在机器翻译中的机制可解释性,发现不同注意力头专门负责目标语言识别和句子语义保持两个子任务,并通过构造子任务特定的引导向量,在仅修改1%相关头的情况下实现了无需指令的高质量翻译。

Details Motivation: 大规模语言模型在机器翻译中的机制可解释性研究受限于模型规模,以往工作多局限于词级分析,缺乏对句级翻译机制的深入理解。 Method: 通过分析注意力头,将机器翻译分解为目标语言识别和句子等价性保持两个子任务,并在三个开源模型家族及20种翻译方向上进行实证分析;进一步构建子任务特定的引导向量并开展干预实验。 Result: 发现不同稀疏注意力头集分别专精于两个子任务;仅修改1%相关头即可实现与指令式提示相当的无指令翻译性能;选择性地消融这些头会特异性破坏对应翻译功能。 Conclusion: 句级机器翻译能力在模型内部由分工明确的稀疏注意力头实现,该发现为机制驱动的模型编辑与可控翻译提供了新路径。 Abstract: Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.

[52] LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

Ruixiao Yang,Yuanhe Tian,Xu Yang,Huiqi Li,Yan Song

Main category: cs.CL

TL;DR: 本文提出Layer-wise Expert-aligned Decoding (LEAD)方法,通过在LVLM解码过程中逐层引入多专家病理特征并结合门控机制,动态校正解码偏差,提升放射科报告生成的事实一致性与临床准确性,有效缓解幻觉问题。

Details Motivation: 现有大视觉语言模型(LVLM)在放射科报告生成中存在图像无关的幻觉问题;外部知识引导方法忽视了预训练模型固有的解码先验和跨模态对齐偏差,且鲁棒性不足。 Method: 提出LEAD方法:设计多专家模块提取不同病理特征,并通过门控机制将这些特征逐层注入解码器各层,在每步推理中动态调用专家信息以校正解码轨迹。 Result: 在多个公开数据集上实验表明,LEAD显著提升临床准确性指标,有效缓解幻觉,同时保持高质量文本生成能力。 Conclusion: LEAD是一种无需外部知识干预、从解码过程内部提升LVLM事实一致性的有效范式,为医学报告生成提供了更鲁棒、可解释的对齐机制。 Abstract: Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.

[53] Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Tim Kunt,Annika Buchholz,Imene Khebouri,Thorsten Koch,Ida Litzel,Thi Huong Vu

Main category: cs.CL

TL;DR: 本文提出了一种结合文本语义特征和文本间图结构关系的嵌入方法,并在Web of Science数据集(约5600万篇论文)上验证了其有效性,揭示了文本的自组织结构。

Details Motivation: 大型文本数据集同时包含文本语义信息和文本间的关系结构(如链接、引用等),需融合二者以提升分析能力。 Method: 提出一种融合LLM文本嵌入与图结构信息的嵌入方法。 Result: 在Web of Science数据集上成功揭示了大规模科学文献的自组织结构。 Conclusion: 该嵌入方法兼具语义表征与结构建模能力,具备实际应用价值。 Abstract: Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.

[54] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang,Yantao Liu,Yuxuan Liu,Tianyi Tang,Shenzhi Wang,Chang Gao,Chujie Zheng,Yichang Zhang,Le Yu,Shixuan Liu,Tao Gui,Qi Zhang,Xuanjing Huang,Bowen Yu,Fei Huang,Junyang Lin

Main category: cs.CL

TL;DR: 本文提出了一种新的评估指标'理由一致性'(Rationale Consistency),用于检测生成式奖励模型(GenRMs)和LLM-as-a-Judge中存在的‘欺骗性对齐’问题,并通过结合该指标与结果准确率的混合信号训练方法,显著提升了模型在多个基准测试中的性能,同时缓解了欺骗性对齐问题。

Details Motivation: 现有GenRMs和LLM-as-a-Judge因过度关注结果准确率(Outcome Accuracy)而出现欺骗性对齐——即判断正确但推理错误,导致其在RLHF中泛化能力差。 Method: 提出细粒度评估指标'理由一致性'来量化模型推理过程与人类判断的一致性;设计融合理由一致性和结果准确率的混合训练信号用于GenRM训练。 Result: 新方法在RM-Bench达87.1%,JudgeBench达82%,平均超越仅用结果准确率的基线5%;在Arena Hard v2上RLHF中创意写作任务提升7%;有效逆转了纯结果准确率训练导致的理由一致性下降趋势。 Conclusion: 理由一致性是比结果准确率更可靠的GenRM评估与训练指标;混合信号训练能有效缓解欺骗性对齐,提升模型鲁棒性与泛化能力。 Abstract: Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.

[55] Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers

Lukas Radosky,Miroslav Blstak,Matej Krajcovic,Ivan Polasek

Main category: cs.CL

TL;DR: 本文对斯洛伐克语的语义文本相似度(STS)方法进行了比较评估,涵盖传统算法、监督机器学习模型及第三方深度学习工具,并采用人工蜂群优化进行特征选择和超参数调优。

Details Motivation: 斯洛伐克语等低资源语言的语义文本相似度(STS)研究仍具挑战性,而高资源语言已有广泛研究,因此需系统评估适用于斯洛伐克语的STS方法。 Method: 采用传统算法、监督机器学习模型(以传统算法输出为特征,并用人工蜂群优化进行特征选择与超参数调优),以及多种第三方深度学习工具(如CloudNLP微调模型、OpenAI嵌入模型、GPT-4、预训练SlovakBERT)进行对比评估。 Result: 不同方法在斯洛伐克语STS任务中表现出显著性能差异与权衡,部分深度学习工具(如SlovakBERT、GPT-4)表现突出,而传统与机器学习方法在可解释性和资源效率方面具有一定优势。 Conclusion: 针对低资源语言如斯洛伐克语,应根据任务需求(如精度、计算成本、可解释性)选择合适的STS方法;深度学习模型虽整体性能更优,但传统与优化后的机器学习方法仍具实用价值。 Abstract: Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI's embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.

[56] Investigating Disability Representations in Text-to-Image Models

Yang Yian,Yu Fan,Liudmila Zavolokina,Sarah Ebling

Main category: cs.CL

TL;DR: 本研究探讨了Stable Diffusion XL和DALL-E 3在生成残障人士图像时的表征偏差,发现存在持续的表征失衡,并强调需持续评估与优化模型以提升包容性。

Details Motivation: 尽管文本到图像生成模型在高质量内容生成方面取得进展,但其对社会群体(尤其是残障人士)的表征仍缺乏深入研究,亟需系统分析。 Method: 通过结构化提示设计,分析Stable Diffusion XL和DALL-E 3生成的残障相关图像;比较通用残障提示与具体残障类别提示的图像相似性;结合自动与人工评估,开展情感极性分析以评估缓解策略对情感框架的影响。 Result: 揭示了模型中残障表征的持续不平衡现象,包括刻板印象、缺失或负面情感倾向等问题。 Conclusion: 当前生成模型在残障表征方面仍存在显著缺陷,需持续评估、数据改进与算法优化,以推动更具多样性与包容性的AI图像生成。 Abstract: Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.

[57] LinGO: A Linguistic Graph Optimization Framework with LLMs for Interpreting Intents of Online Uncivil Discourse

Yuan Zhang,Thales Bertaglia

Main category: cs.CL

TL;DR: 本文提出LinGO框架,通过将语言分解为多步语言成分并迭代优化提示或示例,提升大语言模型对政治不文明语言多类意图的分类性能。

Details Motivation: 现有不文明语言检测模型常将含不文明线索但表达文明意图的帖子误判,导致对网络有害不文明现象的高估。 Method: LinGO是一种面向大语言模型的语言图优化框架,利用语言结构和优化技术,将语言分解为多步语言成分,识别易出错的关键步骤,并迭代优化对应提示或示例;在2022年巴西总统大选数据集上评估,涵盖四种政治不文明类型及六种文明/不文明意图;使用三种轻量级LLM和四种优化技术(TextGrad、AdalFlow、DSPy、RAG)进行基准测试。 Result: LinGO在所有模型上均显著优于零样本、思维链、直接优化和微调等基线方法;RAG是最优优化技术,与Gemini 2.5 Flash-Lite结合时性能最佳。 Conclusion: 将多步语言成分融入LLM指令并针对性优化,有助于模型理解复杂语义,该思路可推广至其他复杂语义解释任务。 Abstract: Detecting uncivil language is crucial for maintaining safe, inclusive, and democratic online spaces. Yet existing classifiers often misinterpret posts containing uncivil cues but expressing civil intents, leading to inflated estimates of harmful incivility online. We introduce LinGO, a linguistic graph optimization framework for large language models (LLMs) that leverages linguistic structures and optimization techniques to classify multi-class intents of incivility that use various direct and indirect expressions. LinGO decomposes language into multi-step linguistic components, identifies targeted steps that cause the most errors, and iteratively optimizes prompt and/or example components for targeted steps. We evaluate it using a dataset collected during the 2022 Brazilian presidential election, encompassing four forms of political incivility: Impoliteness (IMP), Hate Speech and Stereotyping (HSST), Physical Harm and Violent Political Rhetoric (PHAVPR), and Threats to Democratic Institutions and Values (THREAT). Each instance is annotated with six types of civil/uncivil intent. We benchmark LinGO using three cost-efficient LLMs: GPT-5-mini, Gemini 2.5 Flash-Lite, and Claude 3 Haiku, and four optimization techniques: TextGrad, AdalFlow, DSPy, and Retrieval-Augmented Generation (RAG). The results show that, across all models, LinGO consistently improves accuracy and weighted F1 compared with zero-shot, chain-of-thought, direct optimization, and fine-tuning baselines. RAG is the strongest optimization technique and, when paired with Gemini model, achieves the best overall performance. These findings demonstrate that incorporating multi-step linguistic components into LLM instructions and optimize targeted components can help the models explain complex semantic meanings, which can be extended to other complex semantic explanation tasks in the future.

[58] ERNIE 5.0 Technical Report

Haifeng Wang,Hua Wu,Tian Wu,Yu Sun,Jing Liu,Dianhai Yu,Yanjun Ma,Jingzhou He,Zhongjun He,Dou Hong,Qiwen Liu,Shuohuan Wang,Junyuan Shang,Zhenyu Zhang,Yuchen Ding,Jinle Zeng,Jiabin Yang,Liang Shen,Ruibiao Chen,Weichong Yin,Siyu Ding,Dai Dai,Shikun Feng,Siqi Bao,Bolei He,Yan Chen,Zhenyu Jiao,Ruiqing Zhang,Zeyu Chen,Qingqing Dang,Kaipeng Deng,Jiajun Jiang,Enlei Gong,Guoxia Wang,Yanlin Sha,Yi Liu,Yehan Zheng,Weijian Xu,Jiaxiang Liu,Zengfeng Zeng,Yingqi Qu,Zhongli Li,Zhengkun Zhang,Xiyang Wang,Zixiang Xu,Xinchao Xu,Zhengjie Huang,Dong Wang,Bingjin Chen,Yue Chang,Xing Yuan,Shiwei Huang,Qiao Zhao,Xinzhe Ding,Shuangshuang Qiao,Baoshan Yang,Bihong Tang,Bin Li,Bingquan Wang,Binhan Tang,Binxiong Zheng,Bo Cui,Bo Ke,Bo Zhang,Bowen Zhang,Boyan Zhang,Boyang Liu,Caiji Zhang,Can Li,Chang Xu,Chao Pang,Chao Zhang,Chaoyi Yuan,Chen Chen,Cheng Cui,Chenlin Yin,Chun Gan,Chunguang Chai,Chuyu Fang,Cuiyun Han,Dan Zhang,Danlei Feng,Danxiang Zhu,Dong Sun,Dongbo Li,Dongdong Li,Dongdong Liu,Dongxue Liu,Fan Ding,Fan Hu,Fan Li,Fan Mo,Feisheng Wu,Fengwei Liu,Gangqiang Hu,Gaofeng Lu,Gaopeng Yong,Gexiao Tian,Guan Wang,Guangchen Ni,Guangshuo Wu,Guanzhong Wang,Guihua Liu,Guishun Li,Haibin Li,Haijian Liang,Haipeng Ming,Haisu Wang,Haiyang Lu,Haiye Lin,Han Zhou,Hangting Lou,Hanwen Du,Hanzhi Zhang,Hao Chen,Hao Du,Hao Liu,Hao Zhou,Haochen Jiang,Haodong Tian,Haoshuang Wang,Haozhe Geng,Heju Yin,Hong Chen,Hongchen Xue,Hongen Liu,Honggeng Zhang,Hongji Xu,Hongwei Chen,Hongyang Zhang,Hongyuan Zhang,Hua Lu,Huan Chen,Huan Wang,Huang He,Hui Liu,Hui Zhong,Huibin Ruan,Jiafeng Lu,Jiage Liang,Jiahao Hu,Jiahao Hu,Jiajie Yang,Jialin Li,Jian Chen,Jian Wu,Jianfeng Yang,Jianguang Jiang,Jianhua Wang,Jianye Chen,Jiaodi Liu,Jiarui Zhou,Jiawei Lv,Jiaxin Zhou,Jiaxuan Liu,Jie Han,Jie Sun,Jiefan Fang,Jihan Liu,Jihua Liu,Jing Hu,Jing Qian,Jing Yan,Jingdong Du,Jingdong Wang,Jingjing Wu,Jingyong Li,Jinheng Wang,Jinjin Li,Jinliang Lu,Jinlin Yu,Jinnan Liu,Jixiang Feng,Jiyi Huang,Jiyuan Zhang,Jun Liang,Jun Xia,Jun Yu,Junda Chen,Junhao Feng,Junhong Xiang,Junliang Li,Kai Liu,Kailun Chen,Kairan Su,Kang Hu,Kangkang Zhou,Ke Chen,Ke Wei,Kui Huang,Kun Wu,Kunbin Chen,Lei Han,Lei Sun,Lei Wen,Linghui Meng,Linhao Yu,Liping Ouyang,Liwen Zhang,Longbin Ji,Longzhi Wang,Meng Sun,Meng Tian,Mengfei Li,Mengqi Zeng,Mengyu Zhang,Ming Hong,Mingcheng Zhou,Mingming Huang,Mingxin Chen,Mingzhu Cai,Naibin Gu,Nemin Qiu,Nian Wang,Peng Qiu,Peng Zhao,Pengyu Zou,Qi Wang,Qi Xin,Qian Wang,Qiang Zhu,Qianhui Luo,Qianwei Yang,Qianyue He,Qifei Wu,Qinrui Li,Qiwen Bao,Quan Zhang,Quanxiang Liu,Qunyi Xie,Rongrui Zhan,Rufeng Dai,Rui Peng,Ruian Liu,Ruihao Xu,Ruijie Wang,Ruixi Zhang,Ruixuan Liu,Runsheng Shi,Ruting Wang,Senbo Kang,Shan Lu,Shaofei Yu,Shaotian Gong,Shenwei Hu,Shifeng Zheng,Shihao Guo,Shilong Fan,Shiqin Liu,Shiwei Gu,Shixi Zhang,Shuai Yao,Shuang Zhang,Shuangqiao Liu,Shuhao Liang,Shuwei He,Shuwen Yang,Sijun He,Siming Dai,Siming Wu,Siyi Long,Songhe Deng,Suhui Dong,Suyin Liang,Teng Hu,Tianchan Xu,Tianliang Lv,Tianmeng Yang,Tianyi Wei,Tiezhu Gao,Ting Sun,Ting Zhang,Tingdan Luo,Wei He,Wei Luan,Wei Yin,Wei Zhang,Wei Zhou,Weibao Gong,Weibin Li,Weicheng Huang,Weichong Dang,Weiguo Zhu,Weilong Zhang,Weiqi Tan,Wen Huang,Wenbin Chang,Wenjing Du,Wenlong Miao,Wenpei Luo,Wenquan Wu,Xi Shi,Xi Zhao,Xiang Gao,Xiangguo Zhang,Xiangrui Yu,Xiangsen Wang,Xiangzhe Wang,Xianlong Luo,Xianying Ma,Xiao Tan,Xiaocong Lin,Xiaofei Wang,Xiaofeng Peng,Xiaofeng Wu,Xiaojian Xu,Xiaolan Yuan,Xiaopeng Cui,Xiaotian Han,Xiaoxiong Liu,Xiaoxu Fei,Xiaoxuan Wu,Xiaoyu Wang,Xiaoyu Zhang,Xin Sun,Xin Wang,Xinhui Huang,Xinming Zhu,Xintong Yu,Xinyi Xu,Xinyu Wang,Xiuxian Li,XuanShi Zhu,Xue Xu,Xueying Lv,Xuhong Li,Xulong Wei,Xuyi Chen,Yabing Shi,Yafeng Wang,Yamei Li,Yan Liu,Yanfu Cheng,Yang Gao,Yang Liang,Yang Wang,Yang Wang,Yang Yang,Yanlong Liu,Yannian Fu,Yanpeng Wang,Yanzheng Lin,Yao Chen,Yaozong Shen,Yaqian Han,Yehua Yang,Yekun Chai,Yesong Wang,Yi Song,Yichen Zhang,Yifei Wang,Yifeng Guo,Yifeng Kou,Yilong Chen,Yilong Guo,Yiming Wang,Ying Chen,Ying Wang,Yingsheng Wu,Yingzhan Lin,Yinqi Yang,Yiran Xing,Yishu Lei,Yixiang Tu,Yiyan Chen,Yong Zhang,Yonghua Li,Yongqiang Ma,Yongxing Dai,Yongyue Zhang,Yu Ran,Yu Sun,Yu-Wen Michael Zhang,Yuang Liu,Yuanle Liu,Yuanyuan Zhou,Yubo Zhang,Yuchen Han,Yucheng Wang,Yude Gao,Yuedong Luo,Yuehu Dong,Yufeng Hu,Yuhui Cao,Yuhui Yun,Yukun Chen,Yukun Gao,Yukun Li,Yumeng Zhang,Yun Fan,Yun Ma,Yunfei Zhang,Yunshen Xie,Yuping Xu,Yuqin Zhang,Yuqing Liu,Yurui Li,Yuwen Wang,Yuxiang Lu,Zefeng Cai,Zelin Zhao,Zelun Zhang,Zenan Lin,Zezhao Dong,Zhaowu Pan,Zhaoyu Liu,Zhe Dong,Zhe Zhang,Zhen Zhang,Zhengfan Wu,Zhengrui Wei,Zhengsheng Ning,Zhenxing Li,Zhenyu Li,Zhenyu Qian,Zhenyun Li,Zhi Li,Zhichao Chen,Zhicheng Dong,Zhida Feng,Zhifan Feng,Zhihao Deng,Zhijin Yu,Zhiyang Chen,Zhonghui Zheng,Zhuangzhuang Guo,Zhujun Zhang,Zhuo Sun,Zichang Liu,Zihan Lin,Zihao Huang,Zihe Zhu,Ziheng Zhao,Ziping Chen,Zixuan Zhu,Ziyang Xu,Ziyi Liang,Ziyuan Gao

Main category: cs.CL

TL;DR: ERNIE 5.0 是一个原生自回归的万亿参数多模态基础模型,统一支持文本、图像、视频和音频的理解与生成;采用超稀疏MoE架构与模态无关专家路由,并引入弹性训练范式以适配不同资源约束,同时解决了超稀疏MoE下多模态强化学习微调的稳定性问题。

Details Motivation: 解决现有模型在多模态统一建模、大规模部署灵活性及超稀疏MoE架构下强化学习微调不稳定等实际挑战。 Method: 提出统一的下一组token预测目标,基于模态无关专家路由的超稀疏MoE架构;设计弹性训练范式,单次预训练生成多种深度、专家容量和稀疏度的子模型;系统优化多模态强化学习微调流程。 Result: 在多模态理解与生成任务上实现强而均衡的性能;是目前公开披露中首个达到生产规模的万亿参数统一自回归多模态模型。 Conclusion: ERNIE 5.0验证了统一自回归建模多模态数据的可行性与可扩展性,其弹性训练与模态无关MoE设计为资源受限场景下的大模型部署提供了新范式。 Abstract: In this report, we introduce ERNIE 5.0, a natively autoregressive foundation model desinged for unified multimodal understanding and generation across text, image, video, and audio. All modalities are trained from scratch under a unified next-group-of-tokens prediction objective, based on an ultra-sparse mixture-of-experts (MoE) architecture with modality-agnostic expert routing. To address practical challenges in large-scale deployment under diverse resource constraints, ERNIE 5.0 adopts a novel elastic training paradigm. Within a single pre-training run, the model learns a family of sub-models with varying depths, expert capacities, and routing sparsity, enabling flexible trade-offs among performance, model size, and inference latency in memory- or time-constrained scenarios. Moreover, we systematically address the challenges of scaling reinforcement learning to unified foundation models, thereby guaranteeing efficient and stable post-training under ultra-sparse MoE architectures and diverse multimodal settings. Extensive experiments demonstrate that ERNIE 5.0 achieves strong and balanced performance across multiple modalities. To the best of our knowledge, among publicly disclosed models, ERNIE 5.0 represents the first production-scale realization of a trillion-parameter unified autoregressive model that supports both multimodal understanding and generation. To facilitate further research, we present detailed visualizations of modality-agnostic expert routing in the unified model, alongside comprehensive empirical analysis of elastic training, aiming to offer profound insights to the community.

[59] LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Yike Sun,Haotong Yang,Zhouchen Lin,Muhan Zhang

Main category: cs.CL

TL;DR: 本文研究了BPE分词器中'中间合并残留'现象,提出LiteToken方法移除低频残留token,在不显著影响性能的前提下提升鲁棒性和效率。

Details Motivation: BPE分词器中存在大量在构建词汇表时高频出现、但在实际分词中极少使用的'中间合并残留'token,造成词汇容量浪费并降低对异常输入的鲁棒性。 Method: 系统实证分析多种常用BPE分词器中的残留token现象,并提出LiteToken方法——识别并移除这些低频残留token;利用其极低使用率特性,使预训练模型通常无需额外微调即可适配新tokenizer。 Result: LiteToken有效减少token碎片化、降低模型参数量、提升对噪声和拼写错误输入的鲁棒性,同时保持整体性能不变。 Conclusion: 中间合并残留是BPE分词器中被忽视但影响显著的问题;LiteToken是一种轻量、实用且无需重训练的优化方案,可广泛应用于现有语言模型分词器改进。 Abstract: Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.

[60] Linguistically Informed Evaluation of Multilingual ASR for African Languages

Fei-Yueh Chen,Lateef Adeleke,C. M. Downey

Main category: cs.CL

TL;DR: 本文提出了一种面向非洲语言语音识别评估的音系特征错误率(FER)及声调感知扩展(TER),指出传统词错误率(WER)掩盖了模型在音段与声调等细粒度语言特征上的真实表现,尤其揭示了声调(如中调、降阶调)是当前模型最难建模的特征。

Details Motivation: 传统WER指标将音系、声调等不同性质的语言错误混为单一词汇错误,无法反映ASR模型在非洲语言(尤其是声调语言)中的细粒度缺陷;亟需更富语言学意义的评估指标。 Method: 在约鲁巴语和濒危语言Uneme上,对三种语音编码器进行评估,综合使用WER、CER、FER及新提出的声调感知错误率(TER);FER基于音系特征(如发音部位、方式、声调)计算错误,TER则专门针对声调特征优化。 Result: FER和TER显著揭示了模型在音段特征上表现较好,但在声调(特别是中调和降阶调)上错误率高;Yoruba上WER=0.788而FER仅0.151;Uneme上近100% WER对应FER=0.267,表明大量错误实为单个音系特征误判而非整词失败。 Conclusion: FER/TER比WER/CER更能准确刻画ASR模型在非洲声调语言中的能力短板,应成为评估和改进多语言语音模型的重要补充指标;声调建模是当前关键挑战。 Abstract: Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.

[61] "Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

Madison Van Doren,Casey Ford,Jennifer Barajas,Cory Holland

Main category: cs.CL

TL;DR: 本文提出首个大规模、多语言、人工标注的文化本地化评估基准,聚焦机器翻译中文化细微差别(如习语、双关语、节日、文化概念)的处理能力,发现当前多语言大模型在文化适配方面存在显著不足,尤其在习语和双关语翻译上表现最差。

Details Motivation: 现有机器翻译基准过度关注词级和语法准确性,忽视实际本地化所需的语用与文化能力,亟需面向文化适配的人类评估新范式。 Method: 基于87条跨20种语言的试点翻译,构建覆盖15种目标语言、7个主流多语言大模型的评估集;每种语言由5名母语者对全文及文化敏感片段(习语、双关、节日、文化概念)按0–3分制打分,并设NA选项处理未翻译段落。 Result: 全文平均质量仅1.68/3;GPT-5(2.10)、Claude Sonnet 3.7(1.97)、Mistral Medium 3.1(1.84)表现最优;细分显示节日(2.20)和文化概念(2.19)优于习语(1.65)和双关(1.45),且习语最常被遗漏。 Conclusion: 当前多语言大模型虽具基础语法能力,但缺乏深层文化理解与表达能力,需引入文化感知训练数据、增强跨语言语用建模,并建立更贴近真实交际需求的评估体系。 Abstract: We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but of ten overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.

[62] Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

Sameh Khattab,Jean-Philippe Corbeil,Osman Alperen Koraş,Amin Dada,Julian Friedrich,François Beaulieu,Paul Vozila,Jens Kleesiek

Main category: cs.CL

TL;DR: 本文提出Synthesize-Train-Merge(STM)框架,通过合成难负样本、检索提示优化和模型合并,将通用大语言模型高效适配为高性能生物医学领域专用检索器,在MTEB子集上平均提升7.5%,最高达23.5%,且不损害通用能力。

Details Motivation: 现有LLM-based检索器在专业领域(如生物医学)的适配方法尚不完善,缺乏对如何将通用大模型高效转化为领域专用检索器的系统探索。 Method: 提出模块化框架STM,包含三部分:1)生成合成难负样本以增强判别能力;2)优化检索相关提示词;3)融合多个任务专家模型。基于decoder-only LLM实现,无需大规模预训练。 Result: 在MTEB中12个医疗与通用任务子集上,STM使任务专家平均性能提升7.5%,最高达23.5%;合并后的模型优于单任务专家及强基线模型。 Conclusion: STM提供了一种可扩展、高效的方法,能在保留通用能力的同时,显著提升LLM在专业领域的检索性能,为领域适配提供了新范式。 Abstract: Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.

[63] Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Casey Ford,Madison Van Doren,Emily Dix

Main category: cs.CL

TL;DR: 本文评估了多模态大语言模型(MLLMs)在对抗性提示下的安全性,发现不同模型家族的安全性存在显著且持续的差异,且随模型迭代出现对齐漂移和模态效应变化,强调需建立长期、多模态的安全基准。

Details Motivation: 多模态大语言模型(MLLMs)在现实系统中日益广泛应用,但其在对抗性提示下的安全性仍缺乏系统性研究。 Method: 采用两阶段评估方法,使用由26名专业红队成员编写的726个对抗性提示构成的固定基准;第一阶段测试GPT-4o、Claude Sonnet 3.5、Pixtral 12B和Qwen VL Plus;第二阶段测试其后续版本(GPT-5、Claude Sonnet 4.5、Pixtral Large、Qwen Omni),共获得82,256条人工危害评分。 Result: Pixtral模型始终最易受攻击,Claude模型因高拒绝率而最安全;GPT与Claude模型攻击成功率(ASR)随代际上升,Pixtral与Qwen则略有下降;模态效应发生转变:Phase 1中纯文本提示更有效,Phase 2中GPT-5和Claude 4.5在各模态下脆弱性趋于均等。 Conclusion: MLLM的安全性既非统一也非稳定,会随模型更新而动态变化,亟需纵向、多模态的安全评估基准以持续追踪其安全行为演化。 Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.

[64] Exploiting contextual information to improve stance detection in informal political discourse with LLMs

Arman Engin Sucu,Yixiang Zhou,Mario A. Nascimento,Tony Mullen

Main category: cs.CL

TL;DR: 本文研究了在非正式在线政治讨论中,利用大语言模型(LLM)进行立场检测,并通过引入用户历史发帖生成的结构化档案(含意识形态倾向、话题偏好和语言模式)作为上下文提示,显著提升了分类准确率(+17.5%~+38.5%,最高达74%),且发现精选政治相关内容比大量随机文本更有效。

Details Motivation: 非正式网络政治话语中语言常具讽刺性、歧义性和强语境依赖性,导致现有LLM在立场检测任务中表现受限,亟需引入用户级上下文以提升鲁棒性。 Method: 基于真实政治论坛数据集,构建用户结构化档案(涵盖意识形态倾向、高频话题与语言特征),在七种SOTA LLM上开展基线与上下文增强的跨模型对比实验,并分析档案规模与帖子选择策略的影响。 Result: 上下文增强显著提升准确率(+17.5%至+38.5%),最高达74%,超越先前方法;精选政治相关内容比更大规模随机上下文更有效。 Conclusion: 在政治立场检测任务中,融入用户历史行为提炼的结构化上下文能显著增强LLM性能,凸显用户级语境建模的关键价值。 Abstract: This study investigates the use of Large Language Models (LLMs) for political stance detection in informal online discourse, where language is often sarcastic, ambiguous, and context-dependent. We explore whether providing contextual information, specifically user profile summaries derived from historical posts, can improve classification accuracy. Using a real-world political forum dataset, we generate structured profiles that summarize users' ideological leaning, recurring topics, and linguistic patterns. We evaluate seven state-of-the-art LLMs across baseline and context-enriched setups through a comprehensive cross-model evaluation. Our findings show that contextual prompts significantly boost accuracy, with improvements ranging from +17.5\% to +38.5\%, achieving up to 74\% accuracy that surpasses previous approaches. We also analyze how profile size and post selection strategies affect performance, showing that strategically chosen political content yields better results than larger, randomly selected contexts. These findings underscore the value of incorporating user-level context to enhance LLM performance in nuanced political classification tasks.

[65] When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Xinyu Zhou,Chang Jin,Carsten Eickhoff,Zhijiang Guo,Seyed Ali Bahrainian

Main category: cs.CL

TL;DR: 本文提出了一种结合思维链监督与基于弃答感知奖励的强化学习(RL)的新方法,以提升大语言模型在时序问答任务中的弃答能力与推理可靠性;实验表明该方法显著优于监督微调(SFT),并在TimeQA基准上超越GPT-4o,同时揭示了SFT易导致过度自信、而隐式时序线索对弃答帮助有限等关键发现。

Details Motivation: 大型语言模型在时序问答中常忽略时间敏感证据、混淆不同时期事实,且难以主动承认不确定性(即弃答),现有校准方法在复杂推理中不可靠。 Method: 将弃答建模为可教授技能,构建融合思维链(CoT)监督与弃答感知奖励的强化学习训练流程,并系统比较不同信息源(如原始上下文、时间子上下文、知识图谱)和训练方式(SFT vs RL)的影响。 Result: 基于Qwen2.5-1.5B-Instruct初始化的RL模型在TimeQA-Easy/Hard上Exact Match分别超过GPT-4o达3.46%和5.80%,对不可回答问题的真阳性率比纯SFT高20%;SFT导致过自信,RL提升准确性但仍有风险;隐式时序线索对弃答推理助益有限。 Conclusion: 弃答与推理可协同优化,RL框架比SFT更适于构建兼具准确性与可靠性的时序问答模型,为提升LLM可信度提供了新路径。 Abstract: Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.

[66] Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation

Luis Frentzen Salim,Esteban Carlin,Alexandre Morinvil,Xi Ai,Lun-Wei Ku

Main category: cs.CL

TL;DR: 本文研究了在低资源机器翻译中,利用大语言模型(LLM)的上下文学习(ICL)能力,通过扩展上下文长度至100万token来提升性能,发现效果随上下文增加迅速饱和甚至下降,且不同语料类型(单语、指令式、平行语料)影响显著。

Details Motivation: 低资源语言机器翻译面临高质量数据稀缺问题,尽管大语言模型提升了MT性能,但适配小语种仍困难;上下文学习(ICL)可能提供新路径,但其在长上下文下的可扩展性尚不明确。 Method: 在爪哇语和巽他语上开展实验,将ICL上下文预算扩展至100万token,对比三类语料作为上下文监督信号:单语无监督数据、指令式数据、英-目标语/印尼-目标语平行数据。 Result: 增大全量上下文带来的性能增益快速饱和,并在接近最大上下文窗口时出现性能下降;不同语料类型表现差异显著,部分单语语料效果可媲美平行语料。 Conclusion: 长上下文ICL在低资源MT中存在有效上限,性能提升不随上下文线性增长,且高度依赖所用语料类型;单纯扩大上下文窗口并非提升翻译质量的有效策略。 Abstract: Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.

[67] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Yue Ding,Yiyan Ji,Jungang Li,Xuyang Liu,Xinlong Chen,Junfei Wu,Bozhou Li,Bohan Zeng,Yang Shi,Yushuo Guan,Yuanxing Zhang,Jiaheng Liu,Qiang Liu,Pengfei Wan,Liang Wang

Main category: cs.CL

TL;DR: 本文提出OmniSIFT,一种面向全模态大语言模型(Omni-LLMs)的模态不对称细粒度令牌压缩框架,通过时空视频剪枝和视觉引导音频选择两阶段策略,在显著减少令牌数量(仅25%上下文)的同时,保持甚至超越全令牌模型性能。

Details Motivation: Omni-LLMs在音视频理解任务中表现出色,但其依赖长多模态令牌序列导致巨大计算开销,而针对Omni-LLMs的令牌压缩方法仍十分有限。 Method: 提出OmniSIFT框架:第一阶段为时空视频剪枝模块,消除帧内结构与帧间重叠冗余;第二阶段为视觉引导音频选择模块,筛选音频令牌;整个框架采用可微直通估计器进行端到端优化。 Result: 在五个基准上验证了OmniSIFT的有效性与鲁棒性;以仅4.85M参数开销,在Qwen2.5-Omni-7B上实现低于训练无关基线(如OmniZip)的延迟;使用25%原始令牌即全面超越所有压缩基线,并在多个任务上超过全令牌模型。 Conclusion: OmniSIFT是一种高效、轻量、无需额外训练的全模态令牌压缩方法,显著缓解Omni-LLMs的计算瓶颈,同时提升推理效率与任务性能。 Abstract: Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.

[68] SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Jiarui Yuan,Tailin Jin,Weize Chen,Zeyuan Liu,Zhiyuan Liu,Maosong Sun

Main category: cs.CL

TL;DR: 本文提出SE-Bench诊断环境,用于评估AI代理在无文档支持下内化新知识(如伪新颖NumPy API)的能力,揭示了‘开卷训练抑制记忆’、‘强化学习难以完全内化知识’及‘自博弈配合监督微调可行’三大发现。

Details Motivation: 现有方法难以严格衡量AI代理的自我演化能力,主要受限于先验知识与推理复杂度的混淆:新知识可能已存在于预训练数据中,而任务失败可能源于难度而非知识遗忘。 Method: 构建SE-Bench——一个将NumPy及其文档混淆为伪新颖包(含随机标识符)的诊断环境;代理需在无文档访问条件下完成简单编码任务,从而隔离知识内化能力。结合闭卷训练、PPO分析与自博弈+SFT/RL对比实验。 Result: 发现三个关键现象:(1) 开卷训练反而抑制知识内化,需闭卷训练压缩知识至权重;(2) 标准RL(如PPO)因裁剪和负梯度无法完全内化知识;(3) 自博弈配合监督微调(SFT)可实现内化,但纯RL不可行。 Conclusion: SE-Bench为评估AI自我演化与知识内化提供了首个严谨、解耦的诊断平台,并为提升模型终身学习能力指明了训练范式方向。 Abstract: True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at https://github.com/thunlp/SE-Bench.

[69] Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Dhruv Madhwal,Lyuxin David Zhang,Dan Roth,Tomer Wolfson,Vivek Gupta

Main category: cs.CL

TL;DR: 本文研究分解式提示对大语言模型在封闭式问答中可靠性的影响,发现跨提示策略的分歧可作为模型不确定性的精确信号,并据此提出一种无需训练的拒绝回答策略,显著提升了错误检测性能。

Details Motivation: 大型语言模型在封闭式问答中常因无法识别自身知识边界而产生自信的幻觉;尽管分解式提示常用于提升准确性,但其对可靠性的实际影响尚不明确。 Method: 评估三种等效任务的提示策略(直接式、辅助式、增量式),在不同模型规模和多跳问答基准上分析其准确性与分歧模式,并利用跨策略分歧设计无需训练的 abstention(拒绝回答)策略。 Result: 前沿模型中分解带来的准确率增益减弱,但提示策略间的分歧仍高度预示错误;基于分歧的拒绝策略在F1和AUROC指标上均优于标准不确定性基线。 Conclusion: 分解式提示可作为实用的诊断探针,有效揭示模型在封闭式问答中的可靠性问题,且无需检索或微调即可实现高效错误检测。 Abstract: Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.

[70] CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Zhao Tong,Chunlin Gong,Yiping Zhang,Qiang Liu,Xingcheng Xu,Shu Wu,Haichao Shi,Xiao-Yu Zhang

Main category: cs.CL

TL;DR: 本文挑战了大型语言模型(LLM)拒绝有害请求即代表推理安全的假设,发现即使模型输出拒绝,其思维链(CoT)内部仍可能隐含并传播不安全叙事;为此提出一个基于雅可比谱分析的统一安全分析框架,定义稳定性、几何性与能量三个可解释指标,定位导致风险的关键注意力头,揭示风险集中于少数中层连续层。

Details Motivation: 现有评估通常假设模型拒绝有害请求即意味着整个推理过程安全,但该假设未被验证;本文旨在检验这一假设,并探究拒绝响应下CoT内部是否仍存在潜在不安全推理。 Method: 提出基于雅可比矩阵谱分析的统一安全分析框架,逐层解构思维链生成过程,量化各注意力头对欺骗性推理的响应;定义三个新指标——稳定性(stability)、几何性(geometry)和能量(energy),用于刻画注意力头在嵌入与传播不安全叙事中的作用。 Result: 实验证明,在启用思维链模式时,生成风险显著上升,且关键风险路由决策高度集中于少数中等深度的连续网络层;成功识别出若干特定注意力头是导致安全-不安全推理路径分化的主因。 Conclusion: 模型的拒绝响应不能作为推理安全的充分证据;必须深入分析中间推理过程(尤其是注意力机制)才能识别和缓解潜藏的推理风险;所提框架为LLM可信推理提供了新的可解释性分析范式与干预路径。 Abstract: From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.

[71] Reinforced Attention Learning

Bangzheng Li,Jianmo Ni,Chen Qu,Ian Miao,Liu Yang,Xingyu Fu,Muhao Chen,Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: 本文提出了一种名为Reinforced Attention Learning (RAL)的新框架,通过直接优化多模态大语言模型(MLLMs)内部的注意力分布,而非传统的输出序列,在强化学习后训练中显著提升了感知与跨模态对齐能力。

Details Motivation: 现有基于强化学习的后训练方法在多模态大语言模型上效果有限,尤其在感知任务中增益小甚至导致性能下降,主因是依赖冗长文本推理而非有效建模模态间注意机制。 Method: 提出RAL——一种基于策略梯度的框架,直接优化模型内部注意力权重;并引入On-Policy Attention Distillation,将习得的注意力行为蒸馏到学生模型中。 Result: 在多个图像和视频基准上一致优于GRPO等基线;注意力蒸馏显著提升跨模态对齐效果,优于标准知识蒸馏。 Conclusion: 注意力策略可作为多模态后训练的一种原理清晰、通用性强的新范式,替代传统token-level强化学习。 Abstract: Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.

cs.CV [Back]

[72] Intellectual Property Protection for 3D Gaussian Splatting Assets: A Survey

Longjie Zhao,Ziming Hong,Jiaxin Huang,Runnan Chen,Mingming Gong,Tongliang Liu

Main category: cs.CV

TL;DR: 本文是首篇关于3D高斯泼溅(3DGS)知识产权保护的系统性综述,提出一个自底向上的分析框架,涵盖高斯扰动机制、主被动保护范式及生成式AI时代下的鲁棒性威胁,并指出了六大未来研究方向。

Details Motivation: 3D高斯泼溅(3DGS)因商业价值提升与显式参数化结构,引发IP保护需求激增,但现有研究零散,缺乏统一机制理解、保护范式梳理与鲁棒性挑战分析。 Method: 构建自底向上的系统性分析框架,从高斯扰动机制、被动/主动保护范式、生成式AI背景下的鲁棒性威胁三方面进行归纳与剖析。 Result: 首次系统梳理了3DGS IP保护的研究现状,揭示了技术基础薄弱与鲁棒性表征不足等关键缺口,并提炼出六类具有潜力的研究方向。 Conclusion: 3DGS IP保护亟需统一理论框架与实证评估体系;未来工作应聚焦鲁棒性增强、效率优化与新保护范式探索,以支撑可信、可靠的3D内容产权保障。 Abstract: 3D Gaussian Splatting (3DGS) has become a mainstream representation for real-time 3D scene synthesis, enabling applications in virtual and augmented reality, robotics, and 3D content creation. Its rising commercial value and explicit parametric structure raise emerging intellectual property (IP) protection concerns, prompting a surge of research on 3DGS IP protection. However, current progress remains fragmented, lacking a unified view of the underlying mechanisms, protection paradigms, and robustness challenges. To address this gap, we present the first systematic survey on 3DGS IP protection and introduce a bottom-up framework that examines (i) underlying Gaussian-based perturbation mechanisms, (ii) passive and active protection paradigms, and (iii) robustness threats under emerging generative AI era, revealing gaps in technical foundations and robustness characterization and indicating opportunities for deeper investigation. Finally, we outline six research directions across robustness, efficiency, and protection paradigms, offering a roadmap toward reliable and trustworthy IP protection for 3DGS assets.

[73] TruKAN: Towards More Efficient Kolmogorov-Arnold Networks Using Truncated Power Functions

Ali Bayeh,Samira Sadaoui,Malek Mouhoub

Main category: cs.CV

TL;DR: 本文提出TruKAN,一种基于Kolmogorov-Arnold网络(KAN)结构并采用可学习激活函数的新架构,用截断幂函数替代B样条基函数,在保持表达力的同时提升精度与训练效率,并增强可解释性;该模型集成于EfficientNet-V2框架,在视觉基准数据集上验证其在精度、计算效率和内存占用方面优于现有KAN变体。

Details Motivation: 解决KAN在计算效率与理论原则遵循之间的权衡问题,同时提升模型的可解释性与实际性能。 Method: 提出TruKAN架构:以k阶样条理论导出的截断幂函数替代KAN中的B样条基;每层结合截断幂项与多项式项,支持共享或独立结点;集成至EfficientNet-V2框架,采用混合优化与层归一化策略,并对比多种基线模型(MLP、KAN、SineKAN)。 Result: TruKAN在多个计算机视觉基准数据集上展现出更优的准确率、更短的训练时间及更低的内存占用,且具备更强的可解释性;其优势在复杂视觉任务中超越以往KAN研究的有限设定。 Conclusion: TruKAN通过简化且可解释的基函数设计,在保持KAN理论优势的同时显著提升实用性,为可解释神经网络在视觉任务中的应用提供了新范式。 Abstract: To address the trade-off between computational efficiency and adherence to Kolmogorov-Arnold Network (KAN) principles, we propose TruKAN, a new architecture based on the KAN structure and learnable activation functions. TruKAN replaces the B-spline basis in KAN with a family of truncated power functions derived from k-order spline theory. This change maintains the KAN's expressiveness while enhancing accuracy and training time. Each TruKAN layer combines a truncated power term with a polynomial term and employs either shared or individual knots. TruKAN exhibits greater interpretability than other KAN variants due to its simplified basis functions and knot configurations. By prioritizing interpretable basis functions, TruKAN aims to balance approximation efficacy with transparency. We develop the TruKAN model and integrate it into an advanced EfficientNet-V2-based framework, which is then evaluated on computer vision benchmark datasets. To ensure a fair comparison, we develop various models: MLP-, KAN-, SineKAN and TruKAN-based EfficientNet frameworks and assess their training time and accuracy across small and deep architectures. The training phase uses hybrid optimization to improve convergence stability. Additionally, we investigate layer normalization techniques for all the models and assess the impact of shared versus individual knots in TruKAN. Overall, TruKAN outperforms other KAN models in terms of accuracy, computational efficiency and memory usage on the complex vision task, demonstrating advantages beyond the limited settings explored in prior KAN studies.

[74] DiGAN: Diffusion-Guided Attention Network for Early Alzheimer's Disease Detection

Maxx Richard Rahman,Mostafa Hammouda,Wolfgang Maass

Main category: cs.CV

TL;DR: 本文提出Diffusion-Guided Attention Network (DiGAN),结合潜在扩散模型与注意力引导卷积网络,解决阿尔茨海默病(AD)早期诊断中纵向数据稀缺、时序不连续及模态不规则等问题;在合成数据和ADNI数据集上验证其优于现有方法。

Details Motivation: 早期阿尔茨海默病诊断困难,因脑结构变化微弱且时间进程不规则;现有深度学习方法依赖大量纵向数据,难以建模真实临床数据中的时序连续性与模态不规则性。 Method: 提出DiGAN:利用扩散模型从有限数据合成逼真的纵向神经影像轨迹,增强时序上下文并提升对访视间隔不均的鲁棒性;再通过注意力-卷积层捕捉区分正常认知、轻度认知障碍与主观认知下降的关键结构-时序模式。 Result: 在合成数据集和ADNI数据集上的实验表明,DiGAN性能优于现有最先进方法,展现出用于AD早期检测的潜力。 Conclusion: DiGAN有效缓解了纵向数据稀缺与时序不规则带来的挑战,为基于少量不规则临床数据实现AD早期精准诊断提供了新范式。 Abstract: Early diagnosis of Alzheimer's disease (AD) remains a major challenge due to the subtle and temporally irregular progression of structural brain changes in the prodromal stages. Existing deep learning approaches require large longitudinal datasets and often fail to model the temporal continuity and modality irregularities inherent in real-world clinical data. To address these limitations, we propose the Diffusion-Guided Attention Network (DiGAN), which integrates latent diffusion modelling with an attention-guided convolutional network. The diffusion model synthesizes realistic longitudinal neuroimaging trajectories from limited training data, enriching temporal context and improving robustness to unevenly spaced visits. The attention-convolutional layer then captures discriminative structural--temporal patterns that distinguish cognitively normal subjects from those with mild cognitive impairment and subjective cognitive decline. Experiments on synthetic and ADNI datasets demonstrate that DiGAN outperforms existing state-of-the-art baselines, showing its potential for early-stage AD detection.

[75] PriorProbe: Recovering Individual-Level Priors for Personalizing Neural Networks in Facial Expression Recognition

Haijiang Yan,Nick Chater,Adam Sanborn

Main category: cs.CV

TL;DR: 本文提出PriorProbe方法,利用人机协同的马尔可夫链蒙特卡洛技术,精准提取个体认知先验,并成功将其融入神经网络以提升对模糊刺激的个性化识别性能。

Details Motivation: 现有方法难以准确、无偏地提取个体层面的认知先验,限制了神经网络的个性化能力。 Method: 提出PriorProbe方法,基于'与人协同的马尔可夫链蒙特卡洛(MCMC with People)',在面部表情识别任务中对个体参与者进行先验 elicitation,并将恢复的先验整合进先进神经网络中评估效果。 Result: PriorProbe提取的个体先验显著提升了模型对模糊刺激的个体分类预测性能,优于基线模型及其他先验来源,同时不损害模型对真实标签的推理能力。 Conclusion: PriorProbe为深度神经网络的个性化提供了通用且可解释的新框架。 Abstract: Incorporating individual-level cognitive priors offers an important route to personalizing neural networks, yet accurately eliciting such priors remains challenging: existing methods either fail to uniquely identify them or introduce systematic biases. Here, we introduce PriorProbe, a novel elicitation approach grounded in Markov Chain Monte Carlo with People that recovers fine-grained, individual-specific priors. Focusing on a facial expression recognition task, we apply PriorProbe to individual participants and test whether integrating the recovered priors with a state-of-the-art neural network improves its ability to predict an individual's classification on ambiguous stimuli. The PriorProbe-derived priors yield substantial performance gains, outperforming both the neural network alone and alternative sources of priors, while preserving the network's inference on ground-truth labels. Together, these results demonstrate that PriorProbe provides a general and interpretable framework for personalizing deep neural networks.

[76] Explainable Computer Vision Framework for Automated Pore Detection and Criticality Assessment in Additive Manufacturing

Akshansh Mishra,Rakesh Morisetty

Main category: cs.CV

TL;DR: 本文提出了一种可解释的计算机视觉框架,用于三维断层扫描图像中的孔隙检测与关键性评估,通过几何特征和SHAP分析揭示表面距离是决定孔隙关键性的最主要因素。

Details Motivation: 内部孔隙是增材制造部件的关键缺陷模式,现有自动缺陷检测方法缺乏可解释性,工程师无法理解关键性预测的物理基础。 Method: 基于灰度切片重建三维体数据,采用阈值分割与连通域分析识别孔隙;提取尺寸、长宽比、范围及距边界的归一化距离等几何特征;构建基于百分位欧氏距离的孔隙交互网络;利用机器学习模型预测关键性分数,并通过SHAP分析量化各特征贡献。 Result: 归一化表面距离对模型预测的贡献远超其他所有特征(高出一个数量级以上);孔隙尺寸影响极小,其余几何参数影响可忽略;表面邻近性与关键性呈强负相关,表明边界驱动失效机制。 Conclusion: 该可解释框架实现了透明化缺陷评估,为增材制造的工艺优化与质量控制提供了可操作的物理洞见。 Abstract: Internal porosity remains a critical defect mode in additively manufactured components, compromising structural performance and limiting industrial adoption. Automated defect detection methods exist but lack interpretability, preventing engineers from understanding the physical basis of criticality predictions. This study presents an explainable computer vision framework for pore detection and criticality assessment in three-dimensional tomographic volumes. Sequential grayscale slices were reconstructed into volumetric datasets, and intensity-based thresholding with connected component analysis identified 500 individual pores. Each pore was characterized using geometric descriptors including size, aspect ratio, extent, and spatial position relative to the specimen boundary. A pore interaction network was constructed using percentile-based Euclidean distance criteria, yielding 24,950 inter-pore connections. Machine learning models predicted pore criticality scores from extracted features, and SHAP analysis quantified individual feature contributions. Results demonstrate that normalized surface distance dominates model predictions, contributing more than an order of magnitude greater importance than all other descriptors. Pore size provides minimal influence, while geometric parameters show negligible impact. The strong inverse relationship between surface proximity and criticality reveals boundary-driven failure mechanisms. This interpretable framework enables transparent defect assessment and provides actionable insights for process optimization and quality control in additive manufacturing.

[77] 4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping

Xindan Zhang,Weilong Yan,Yufei Shi,Xuerui Qiu,Tao He,Ying Li,Ming Li,Hehe Fan

Main category: cs.CV

TL;DR: 本文提出了首个面向动态点云理解的多模态大语言模型4DPC²hat,构建了大规模跨模态数据集4DPC²hat-200K,并引入Mamba增强的时间推理模块与失败感知自举学习策略,显著提升了动作理解和时序推理能力。

Details Motivation: 现有方法主要关注静态点云,而动态点云序列的理解因缺乏大规模跨模态数据集和难以建模时空运动模式而未被充分探索。 Method: 构建了包含44K动态物体序列、700K点云帧和200K QA对的大规模数据集4DPC²hat-200K;提出Mamba增强的时序推理MLLM;设计失败感知的自举学习策略以迭代提升模型推理能力。 Result: 在动作理解和时序推理任务上显著优于现有模型,为4D动态点云理解奠定坚实基础。 Conclusion: 4DPC²hat是首个专为动态点云理解设计的MLLM,通过新数据集、新架构与新训练策略,有效推动了4D点云跨模态理解的发展。 Abstract: Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.

[78] Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation

Jinxing Zhou,Yanghao Zhou,Yaoting Wang,Zongyan Han,Jiaqi Ma,Henghui Ding,Rao Muhammad Anwer,Hisham Cholakkal

Main category: cs.CV

TL;DR: 本文提出了MQA-RefAVS新任务,即在无真值标注情况下评估语言引导的音视频分割(Ref-AVS)结果的质量,包括IoU估计、错误类型识别与质量控制建议;构建了MQ-RAVSBench基准,并设计了多模态大模型MQ-Auditor进行评估,实验证明其有效性。

Details Motivation: 现有Ref-AVS研究聚焦于生成分割掩码,而缺乏对掩码质量的可解释、无需真值的自动评估机制,限制了系统鲁棒性与可调试性。 Method: 提出MQA-RefAVS任务框架,构建涵盖几何与语义错误的MQ-RAVSBench基准,并设计基于多模态大语言模型(MLLM)的MQ-Auditor,联合建模音视频、文本及掩码特征以实现定量(IoU预测)与定性(错误分类+决策建议)评估。 Result: MQ-Auditor在MQ-RAVSBench上显著优于开源及商用多模态大模型;可无缝集成至现有Ref-AVS系统中,有效检测分割失败并支持后续优化。 Conclusion: MQA-RefAVS填补了Ref-AVS中无监督掩码质量评估的空白,MQ-Auditor为多模态感知系统的可信评估与闭环优化提供了新范式。 Abstract: Language-referred audio-visual segmentation (Ref-AVS) aims to segment target objects described by natural language by jointly reasoning over video, audio, and text. Beyond generating segmentation masks, providing rich and interpretable diagnoses of mask quality remains largely underexplored. In this work, we introduce Mask Quality Assessment in the Ref-AVS context (MQA-RefAVS), a new task that evaluates the quality of candidate segmentation masks without relying on ground-truth annotations as references at inference time. Given audio-visual-language inputs and each provided segmentation mask, the task requires estimating its IoU with the unobserved ground truth, identifying the corresponding error type, and recommending an actionable quality-control decision. To support this task, we construct MQ-RAVSBench, a benchmark featuring diverse and representative mask error modes that span both geometric and semantic issues. We further propose MQ-Auditor, a multimodal large language model (MLLM)-based auditor that explicitly reasons over multimodal cues and mask information to produce quantitative and qualitative mask quality assessments. Extensive experiments demonstrate that MQ-Auditor outperforms strong open-source and commercial MLLMs and can be integrated with existing Ref-AVS systems to detect segmentation failures and support downstream segmentation improvement. Data and codes will be released at https://github.com/jasongief/MQA-RefAVS.

[79] GPAIR: Gaussian-Kernel-Based Ultrafast 3D Photoacoustic Iterative Reconstruction

Yibing Wang,Shuang Li,Tingting Huang,Yu Zhang,Chulhong Kim,Seongwook Choi,Changhui Li

Main category: cs.CV

TL;DR: 本文提出了一种名为GPAIR的超快速三维光声迭代重建方法,通过高斯核变换和GPU加速的可微Triton算子,将重建时间缩短至亚秒级,显著推动三维光声计算机断层成像的临床应用。

Details Motivation: 传统迭代重建(IR)算法在三维光声断层成像(PACT)中虽能有效校正伪影,但计算耗时过长(数百秒至数小时),严重限制其实际应用。 Method: 提出高斯核基超快速三维光声迭代重建(GPAIR):将传统空间网格用连续各向同性高斯核表示,并推导压力波的解析闭式表达式;结合GPU加速的可微Triton算子实现高效计算。 Result: 在动物实验中,对含840万体素的三维目标实现亚秒级重建,速度提升达数量级。 Conclusion: GPAIR实现了近实时的大规模三维光声重建,极大促进了三维PACT向临床应用的转化。 Abstract: Although the iterative reconstruction (IR) algorithm can substantially correct reconstruction artifacts in photoacoustic (PA) computed tomography (PACT), it suffers from long reconstruction times, especially for large-scale three-dimensional (3D) imaging in which IR takes hundreds of seconds to hours. The computing burden severely limits the practical applicability of IR algorithms. In this work, we proposed an ultrafast IR method for 3D PACT, called Gaussian-kernel-based Ultrafast 3D Photoacoustic Iterative Reconstruction (GPAIR), which achieves orders-of-magnitude acceleration in computing. GPAIR transforms traditional spatial grids with continuous isotropic Gaussian kernels. By deriving analytical closed-form expression for pressure waves and implementing powerful GPU-accelerated differentiable Triton operators, GPAIR demonstrates extraordinary ultrafast sub-second reconstruction speed for 3D targets containing 8.4 million voxels in animal experiments. This revolutionary ultrafast image reconstruction enables near-real-time large-scale 3D PA reconstruction, significantly advancing 3D PACT toward clinical applications.

[80] Vision Transformers for Zero-Shot Clustering of Animal Images: A Comparative Benchmarking Study

Hugo Markoff,Stefan Hein Bengtson,Michael Ørsted

Main category: cs.CV

TL;DR: 本研究评估了视觉Transformer(ViT)基础模型在无监督动物图像物种级聚类中的能力,发现DINOv3结合t-SNE与监督层次聚类可实现近乎完美的物种识别(V-measure 0.958),而无监督方法也表现优异(0.943),并能揭示年龄、性别等生态内变异性。

Details Motivation: 手动标注动物图像严重制约生态学研究的规模与效率,亟需自动化、无需大量标注的物种识别方法。 Method: 构建涵盖5种ViT模型、5种降维方法和4种聚类算法(2种监督、2种无监督)的综合基准框架,在60个物种(各200张验证图像)上系统评估;分析聚类成功/失败场景及对种内变异(如性别、年龄)的解析能力。 Result: DINOv3+t-SNE+监督层次聚类达V-measure 0.958;无监督方法达0.943,仅1.14%图像被拒为离群点;方法对长尾分布鲁棒,且过聚类可稳定提取年龄、性二型、毛色等种内差异。 Conclusion: ViT基础模型(尤其DINOv3)结合适当降维与聚类策略,可在无需大量标注前提下高效、鲁棒地实现动物图像物种级分类与种内细粒度解析,为生态监测提供实用开源工具与方法指南。 Abstract: Manual labeling of animal images remains a significant bottleneck in ecological research, limiting the scale and efficiency of biodiversity monitoring efforts. This study investigates whether state-of-the-art Vision Transformer (ViT) foundation models can reduce thousands of unlabeled animal images directly to species-level clusters. We present a comprehensive benchmarking framework evaluating five ViT models combined with five dimensionality reduction techniques and four clustering algorithms, two supervised and two unsupervised, across 60 species (30 mammals and 30 birds), with each test using a random subset of 200 validated images per species. We investigate when clustering succeeds at species-level, where it fails, and whether clustering within the species-level reveals ecologically meaningful patterns such as sex, age, or phenotypic variation. Our results demonstrate near-perfect species-level clustering (V-measure: 0.958) using DINOv3 embeddings with t-SNE and supervised hierarchical clustering methods. Unsupervised approaches achieve competitive performance (0.943) while requiring no prior species knowledge, rejecting only 1.14% of images as outliers requiring expert review. We further demonstrate robustness to realistic long-tailed distributions of species and show that intentional over-clustering can reliably extract intra-specific variation including age classes, sexual dimorphism, and pelage differences. We introduce an open-source benchmarking toolkit and provide recommendations for ecologists to select appropriate methods for sorting their specific taxonomic groups and data.

[81] Benchmarking Bias Mitigation Toward Fairness Without Harm from Vision to LVLMs

Xuwei Tan,Ziyu Hu,Xueru Zhang

Main category: cs.CV

TL;DR: 本文提出了NH-Fair统一公平性基准,涵盖视觉与多模态大模型,在标准化数据、指标和训练协议下评估公平性;发现许多去偏方法未必优于调优良好的ERM基线,而一种复合数据增强方法能稳定提升公平性且不损效用;LVLM虽精度更高但仍存在子群偏差,架构与训练选择的影响常大于规模扩展。

Details Motivation: 现有公平性评估因数据集异构、指标不一致、模型类型隔离及超参调优不足,难以进行公平、可比的偏见缓解方法比较。 Method: 构建NH-Fair统一基准,覆盖视觉模型与大型视觉语言模型(LVLM),支持监督与零样本设定;开展系统性ERM超参调优研究;对比多种去偏方法,并提出并验证一种复合数据增强策略;分析LVLM的子群表现与缩放效应。 Result: (1)识别出显著影响效用与偏差的关键训练选择,形成实用调优指南;(2)多数去偏方法未稳定优于调优后的ERM基线,而复合数据增强法在保持效用前提下持续改善公平性;(3)LVLM虽平均准确率更高,但仍有子群偏差,性能增益更多来自架构/训练设计而非模型规模。 Conclusion: NH-Fair为公平性评估提供了可复现、调优感知、无害导向的标准化框架;强调严谨实验设计与基线调优的重要性,并指出轻量级数据增强是当前兼具实用性与有效性的公平提升策略。 Abstract: Machine learning models trained on real-world data often inherit and amplify biases against certain social groups, raising urgent concerns about their deployment at scale. While numerous bias mitigation methods have been proposed, comparing the effectiveness of bias mitigation methods remains difficult due to heterogeneous datasets, inconsistent fairness metrics, isolated evaluation of vision versus multi-modal models, and insufficient hyperparameter tuning that undermines fair comparisons. We introduce NH-Fair, a unified benchmark for fairness without harm that spans both vision models and large vision-language models (LVLMs) under standardized data, metrics, and training protocols, covering supervised and zero-shot regimes. Our key contributions are: (1) a systematic ERM tuning study that identifies training choices with large influence on both utility and disparities, yielding empirically grounded guidelines to help practitioners reduce expensive hyperparameter tuning space in achieving strong fairness and accuracy; (2) evidence that many debiasing methods do not reliably outperform a well-tuned ERM baseline, whereas a composite data-augmentation method consistently delivers parity gains without sacrificing utility, emerging as a promising practical strategy. (3) an analysis showing that while LVLMs achieve higher average accuracy, they still exhibit subgroup disparities, and gains from scaling are typically smaller than those from architectural or training-protocol choices. NH-Fair provides a reproducible, tuning-aware pipeline for rigorous, harm-aware fairness evaluation.

[82] HY3D-Bench: Generation of 3D Assets

Team Hunyuan3D,:,Bowen Zhang,Chunchao Guo,Dongyuan Guo,Haolin Liu,Hongyu Yan,Huiwen Shi,Jiaao Yu,Jiachen Xu,Jingwei Huang,Kunhong Li,Lifu Wang,Linus,Penghao Wang,Qingxiang Lin,Ruining Tang,Xianghui Yang,Yang Li,Yirui Guan,Yunfei Zhao,Yunhan Yang,Zeqiang Lai,Zhihao Liang,Zibo Zhao

Main category: cs.CV

TL;DR: 本文提出了HY3D-Bench,一个开源的3D生成数据集生态系统,包含25万高质量真实3D物体和12.5万合成资产,并引入结构化部件级分解,旨在解决3D内容生成中的数据瓶颈问题。

Details Motivation: 当前神经表示与生成模型虽推动了3D内容创作,但受限于数据处理瓶颈,缺乏统一、高质量、易用的训练数据基础。 Method: 构建HY3D-Bench:(1)从大规模仓库中筛选并清洗25万高保真3D物体,输出水密网格与多视角渲染;(2)提供结构化的部件级语义分解;(3)设计可扩展AIGC合成流程,生成12.5万合成资产以覆盖长尾类别。 Result: HY3D-Bench已成功支撑Hunyuan3D-2.1-Small模型训练,验证其有效性;数据集开源,显著提升3D感知、机器人及数字内容生成领域的数据可及性与多样性。 Conclusion: HY3D-Bench为3D生成提供了高质量、结构化、规模化的数据基础,有望推动相关领域方法创新与实际应用落地。 Abstract: While recent advances in neural representations and generative models have revolutionized 3D content creation, the field remains constrained by significant data processing bottlenecks. To address this, we introduce HY3D-Bench, an open-source ecosystem designed to establish a unified, high-quality foundation for 3D generation. Our contributions are threefold: (1) We curate a library of 250k high-fidelity 3D objects distilled from large-scale repositories, employing a rigorous pipeline to deliver training-ready artifacts, including watertight meshes and multi-view renderings; (2) We introduce structured part-level decomposition, providing the granularity essential for fine-grained perception and controllable editing; and (3) We bridge real-world distribution gaps via a scalable AIGC synthesis pipeline, contributing 125k synthetic assets to enhance diversity in long-tail categories. Validated empirically through the training of Hunyuan3D-2.1-Small, HY3D-Bench democratizes access to robust data resources, aiming to catalyze innovation across 3D perception, robotics, and digital content creation.

[83] Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

Qiuming Luo,Tao Zeng,Feng Li,Heming Liu,Rui Mao,Chang Kong

Main category: cs.CV

TL;DR: 本文提出了一种熵感知结构对齐网络,通过信息熵先验、双视角部首树和Top-K语义特征融合机制,提升零样本手写汉字识别性能。

Details Motivation: 现有方法将汉字视为扁平的部首序列,忽略了其层次化结构和不同部件信息密度不均的问题。 Method: 提出熵感知结构对齐网络:1)引入信息熵先验动态调制位置嵌入;2)构建双视角部首树提取多粒度结构特征,并用Sigmoid门控网络融合;3)设计Top-K语义特征融合机制,利用语义邻域质心增强解码。 Result: 在零样本手写汉字识别任务上达到新SOTA,显著优于CLIP基线,并展现出优异的数据效率和少样本适应能力。 Conclusion: 该方法通过信息论建模有效弥合视觉-语义鸿沟,验证了结构化建模与信息密度感知对零样本汉字识别的关键作用。 Abstract: Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples.

[84] Phaedra: Learning High-Fidelity Discrete Tokenization for the Physical Science

Levi Lingsch,Georgios Kissas,Johannes Jakubik,Siddhartha Mishra

Main category: cs.CV

TL;DR: 本文提出Phaedra方法,专为科学图像(如偏微分方程解、地球观测数据)设计的新型图像标记器,兼顾物理与频谱保真度,在重建精度和分布外泛化能力上优于现有标记器。

Details Motivation: 现有图像标记器面向真实视觉感知设计,难以满足科学图像对大动态范围、物理一致性及频谱特性的高要求。 Method: 受经典形状-增益量化和本征正交分解启发,提出Phaedra标记器,并在多维度指标(物理空间与频谱空间的PDE性质保真度)下系统评估各类标记器性能。 Result: Phaedra在多个PDE数据集上显著提升重建精度,并在三类递进复杂任务中展现强分布外泛化能力:不同条件的已知PDE、未知PDE、真实地球观测与气象数据。 Conclusion: 针对科学图像的标记器需兼顾细粒度结构与精确幅值建模;Phaedra为该任务提供了更优的通用表征方案。 Abstract: Tokens are discrete representations that allow modern deep learning to scale by transforming high-dimensional data into sequences that can be efficiently learned, generated, and generalized to new tasks. These have become foundational for image and video generation and, more recently, physical simulation. As existing tokenizers are designed for the explicit requirements of realistic visual perception of images, it is necessary to ask whether these approaches are optimal for scientific images, which exhibit a large dynamic range and require token embeddings to retain physical and spectral properties. In this work, we investigate the accuracy of a suite of image tokenizers across a range of metrics designed to measure the fidelity of PDE properties in both physical and spectral space. Based on the observation that these struggle to capture both fine details and precise magnitudes, we propose Phaedra, inspired by classical shape-gain quantization and proper orthogonal decomposition. We demonstrate that Phaedra consistently improves reconstruction across a range of PDE datasets. Additionally, our results show strong out-of-distribution generalization capabilities to three tasks of increasing complexity, namely known PDEs with different conditions, unknown PDEs, and real-world Earth observation and weather data.

[85] SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

Azmine Toushik Wasi,Wahid Faisal,Abdur Rahman,Mahfuz Ahmed Anik,Munem Shahriar,Mohsin Mahmud Topu,Sadia Tasnim Meem,Rahatun Nesa Priti,Sabrina Afroz Mitu,Md. Iqramul Hoque,Shahriyar Zaman Ridoy,Mohammed Eunus Ali,Majd Hawasly,Mohammad Raza,Md Rizwan Parvez

Main category: cs.CV

TL;DR: 本文提出了SpatiaLab,一个用于评估视觉语言模型(VLMs)在真实、无约束场景中空间推理能力的综合性基准,涵盖六大类共30种空间任务;实验表明当前VLMs与人类存在显著性能差距,尤其在深度感知、导航和3D几何理解方面。

Details Motivation: 现有VLM空间推理评估多依赖合成或LLM生成的简化环境,无法反映真实世界中的复杂性、视觉噪声与多样空间关系,亟需更贴近现实的评测基准。 Method: 构建SpatiaLab基准:包含1400个视觉问答对,覆盖6大空间类别(相对位置、遮挡与深度、朝向、尺度、空间导航、3D几何),每类5个子类共30种任务类型,支持多选与开放作答两种评估方式。 Result: 在多选设置下,最强模型InternVL3.5-72B准确率仅54.93%(人类87.57%);开放作答下所有模型下降10–25%,GPT-5-mini最高为40.93%(人类64.93%),凸显模型在深度、导航及3D几何等任务上的明显短板。 Conclusion: SpatiaLab揭示了当前VLM在真实空间推理任务中的关键局限,为推动具备鲁棒、类人空间理解能力的模型发展提供了重要评测工具与研究方向。 Abstract: Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.

[86] Entropy Reveals Block Importance in Masked Self-Supervised Vision Transformers

Peihao Xiang,Kaida Wu,Ou Bai

Main category: cs.CV

TL;DR: 本文提出Gardener方法,通过信息熵无数据评估掩码自监督视觉Transformer中各模块的重要性,实现高效块级剪枝,在视频识别任务中显著压缩模型且保持性能。

Details Motivation: 掩码自监督视觉Transformer模型庞大,难以在资源受限场景部署和高效迁移学习;需探究各Transformer模块对下游任务的重要性是否均等。 Method: 提出Gardener:利用预训练模块权重的信息熵估计其重要性,无需任何数据、单次计算即可识别冗余模块并剪枝。 Result: 在VideoMAE-B上验证,Gardener在零数据、极低开销下,剪枝91.7%模块后仍保持有竞争力的下游识别性能,性能媲美或超越现有无数据剪枝基线,并逼近基于敏感度的剪枝效果。 Conclusion: 掩码自监督视觉Transformer存在显著的模块级冗余;信息熵可作为模块重要性的可靠代理,为模型压缩与高效迁移学习提供原理清晰、计算高效的路径。 Abstract: Masked self-supervised vision transformers have become a dominant pretraining paradigm, yet their substantial model size poses significant challenges for resource-constrained deployment and efficient transfer learning. A fundamental question remains: are all transformer blocks equally important for downstream performance? In this paper, we show that block importance in masked self-supervised vision transformers can be accurately estimated without access to any data. Our key finding is that the information entropy of pretrained block weights strongly correlates with oracle sensitivity obtained via iterative block removal and finetuning. This observation enables Gardener, a data-free, one-shot, block-level pruning principle that identifies redundant blocks through simple information-theoretic measurements. We evaluate Gardener on VideoMAE-B across multiple pruning ratios and downstream video recognition benchmarks. Despite its negligible computational overhead, Gardener consistently matches or outperforms existing data-free pruning baselines and closely approaches sensitivity-based pruning. Remarkably, even after pruning up to 91.7\% of blocks, the pruned model retains competitive transfer performance. Our results reveal substantial block-level redundancy in masked self-supervised vision transformers and demonstrate that information-theoretic analysis offers a principled and efficient pathway for model compression and resource-efficient transfer learning.

[87] TiCLS : Tightly Coupled Language Text Spotter

Leeje Jang,Yijun Lin,Yao-Yi Chiang,Jerod Weinman

Main category: cs.CV

TL;DR: 本文提出TiCLS,一种端到端的场景文本检测与识别方法,通过引入字符级预训练语言模型(PLM)的外部语言知识,提升对模糊或碎片化文本的识别鲁棒性,在ICDAR 2015和Total-Text上达到SOTA性能。

Details Motivation: 现有方法主要依赖视觉线索,隐式建模局部字符依赖,忽略了外部语言知识的价值;而先前融合语言模型的方法要么缺乏外部知识,要么预训练模型与词级粒度不匹配。 Method: 提出TiCLS框架,包含一个可由字符级预训练语言模型初始化的‘语言解码器’,显式融合视觉与语言特征。 Result: 在ICDAR 2015和Total-Text数据集上取得当前最优性能(state-of-the-art)。 Conclusion: 显式引入字符级预训练语言模型的外部语言知识,能有效提升场景文本识别的鲁棒性与准确性,验证了PLM引导的语言集成策略的有效性。 Abstract: Scene text spotting aims to detect and recognize text in real-world images, where instances are often short, fragmented, or visually ambiguous. Existing methods primarily rely on visual cues and implicitly capture local character dependencies, but they overlook the benefits of external linguistic knowledge. Prior attempts to integrate language models either adapt language modeling objectives without external knowledge or apply pretrained models that are misaligned with the word-level granularity of scene text. We propose TiCLS, an end-to-end text spotter that explicitly incorporates external linguistic knowledge from a character-level pretrained language model. TiCLS introduces a linguistic decoder that fuses visual and linguistic features, yet can be initialized by a pretrained language model, enabling robust recognition of ambiguous or fragmented text. Experiments on ICDAR 2015 and Total-Text demonstrate that TiCLS achieves state-of-the-art performance, validating the effectiveness of PLM-guided linguistic integration for scene text spotting.

[88] AnyStyle: Single-Pass Multimodal Stylization for 3D Gaussian Splatting

Joanna Kaleta,Bartosz Świrta,Kacper Kania,Przemysław Spurek,Marek Kowalski

Main category: cs.CV

TL;DR: 本文提出AnyStyle,一种支持文本和图像多模态条件输入的前馈式3D重建与风格化框架,实现无需位姿、零样本的3D场景风格控制,兼顾高质量几何重建与强风格可控性。

Details Motivation: 现有无位姿3D重建方法缺乏灵活可控的外观/风格编辑能力,图像条件方式限制大,亟需支持自然语言或参考图等更灵活的风格控制手段。 Method: 提出模块化风格化架构AnyStyle,兼容现有前馈3D重建主干(如3DGS),支持文本和视觉双模态风格输入,仅需最小架构修改即可集成。 Result: 在保持高精度几何重建的同时,显著提升风格可控性与保真度;用户研究表明其风格化质量优于当前SOTA方法。 Conclusion: AnyStyle为无位姿、前馈式3D重建提供了首个高效、灵活、零样本的多模态风格化解决方案,推动了可编辑3D内容生成的发展。 Abstract: The growing demand for rapid and scalable 3D asset creation has driven interest in feed-forward 3D reconstruction methods, with 3D Gaussian Splatting (3DGS) emerging as an effective scene representation. While recent approaches have demonstrated pose-free reconstruction from unposed image collections, integrating stylization or appearance control into such pipelines remains underexplored. Existing attempts largely rely on image-based conditioning, which limits both controllability and flexibility. In this work, we introduce AnyStyle, a feed-forward 3D reconstruction and stylization framework that enables pose-free, zero-shot stylization through multimodal conditioning. Our method supports both textual and visual style inputs, allowing users to control the scene appearance using natural language descriptions or reference images. We propose a modular stylization architecture that requires only minimal architectural modifications and can be integrated into existing feed-forward 3D reconstruction backbones. Experiments demonstrate that AnyStyle improves style controllability over prior feed-forward stylization methods while preserving high-quality geometric reconstruction. A user study further confirms that AnyStyle achieves superior stylization quality compared to an existing state-of-the-art approach. Repository: https://github.com/joaxkal/AnyStyle.

[89] A Parameterizable Convolution Accelerator for Embedded Deep Learning Applications

Panagiotis Mousouliotis,Georgios Keramidas

Main category: cs.CV

TL;DR: 本文提出了一种面向多约束(延迟、功耗、面积、成本)的CNN加速器硬件-软件协同设计方法,基于高层次综合(HLS)实现可参数化设计,提升了优化灵活性与跨应用扩展性。

Details Motivation: 现有FPGA上的CNN加速器多以峰值性能(GOPS)为单一优化目标,难以满足实际嵌入式DL应用对延迟、功耗、面积和成本等多维度约束的需求。 Method: 采用硬件-软件协同设计方法,利用高层次综合(HLS)工具描述CNN加速器,支持设计参数灵活配置,从而在多个约束下进行系统级优化。 Result: 实验表明,该可参数化设计方法优于非参数化方法,并具备向其他类型深度学习应用扩展的能力。 Conclusion: 基于HLS的可参数化HW/SW协同设计是兼顾多目标约束、提升CNN加速器实用性和可移植性的有效途径。 Abstract: Convolutional neural network (CNN) accelerators implemented on Field-Programmable Gate Arrays (FPGAs) are typically designed with a primary focus on maximizing performance, often measured in giga-operations per second (GOPS). However, real-life embedded deep learning (DL) applications impose multiple constraints related to latency, power consumption, area, and cost. This work presents a hardware-software (HW/SW) co-design methodology in which a CNN accelerator is described using high-level synthesis (HLS) tools that ease the parameterization of the design, facilitating more effective optimizations across multiple design constraints. Our experimental results demonstrate that the proposed design methodology is able to outperform non-parameterized design approaches, and it can be easily extended to other types of DL applications.

[90] Fast, Unsupervised Framework for Registration Quality Assessment of Multi-stain Histological Whole Slide Pairs

Shikha Dubey,Patricia Raciti,Kristopher Standish,Albert Juan Ramon,Erik Ames Burlingame

Main category: cs.CV

TL;DR: 本文提出了一种无需真值标注、快速且无监督的全切片图像(WSI)配准质量评估(RQA)框架,结合组织掩膜和形变场指标,实现高效、高保真、低计算开销的大规模数字病理质控。

Details Motivation: 现有WSI配准质量评估方法依赖人工标注或强度相似性度量,耗时、不可靠且计算成本高,难以用于大规模应用;尤其在缺乏真值标注(GT)的情况下,亟需一种可靠、自动、高效的评估方案。 Method: 提出一种联合使用下采样组织掩膜指标和形变场指标的无监督RQA框架:掩膜指标衡量全局结构一致性,形变指标评估局部平滑性、连续性与变换合理性。 Result: 在多种IHC标记物及多位专家评估验证下,自动化指标与人工评分呈现强相关性;该方法在无GT条件下实现近实时、高保真、低计算资源消耗的RQA。 Conclusion: 该框架为数字病理中大规模WSI配准提供了可靠、实用、可扩展的质量控制工具,显著提升集成分子分析的可信度与效率。 Abstract: High-fidelity registration of histopathological whole slide images (WSIs), such as hematoxylin & eosin (H&E) and immunohistochemistry (IHC), is vital for integrated molecular analysis but challenging to evaluate without ground-truth (GT) annotations. Existing WSI-level assessments -- using annotated landmarks or intensity-based similarity metrics -- are often time-consuming, unreliable, and computationally intensive, limiting large-scale applicability. This study proposes a fast, unsupervised framework that jointly employs down-sampled tissue masks- and deformations-based metrics for registration quality assessment (RQA) of registered H&E and IHC WSI pairs. The masks-based metrics measure global structural correspondence, while the deformations-based metrics evaluate local smoothness, continuity, and transformation realism. Validation across multiple IHC markers and multi-expert assessments demonstrate a strong correlation between automated metrics and human evaluations. In the absence of GT, this framework offers reliable, real-time RQA with high fidelity and minimal computational resources, making it suitable for large-scale quality control in digital pathology.

[91] Artifact Removal and Image Restoration in AFM:A Structured Mask-Guided Directional Inpainting Approach

Juntao Zhang,Angona Biswas,Jaydeep Rade,Charchit Shukla,Juan Ren,Anwesha Sarkar,Adarsh Krishnamurthy,Aditya Balu

Main category: cs.CV

TL;DR: 本文提出了一种轻量级、全自动的AFM图像伪影检测与修复框架,结合分类、语义分割、方向性插值与局部高斯平滑,实现高保真纳米结构恢复。

Details Motivation: AFM图像常受环境噪声、扫描缺陷和针尖-样品相互作用影响而引入伪影,影响高分辨率表面分析的可靠性。 Method: 采用两阶段流程:首先用分类模型判断图像是否含伪影;若存在,则用专为AFM数据设计的轻量语义分割网络生成伪影掩码,再按结构方向自适应扩展掩码,接着使用方向性邻域插值进行修复,并辅以局部高斯平滑实现无缝重建;整个系统集成于支持实时调参与批量处理的GUI中。 Result: 实验表明该方法能有效去除伪影,同时保持纳米尺度结构细节,提升AFM数据解释的鲁棒性与几何保真度。 Conclusion: 该框架为AFM图像分析提供了一种轻量、自动、几何感知的高质量修复解决方案,适用于实际科研与工业场景。 Abstract: Atomic Force Microscopy (AFM) enables high-resolution surface imaging at the nanoscale, yet the output is often degraded by artifacts introduced by environmental noise, scanning imperfections, and tip-sample interactions. To address this challenge, a lightweight and fully automated framework for artifact detection and restoration in AFM image analysis is presented. The pipeline begins with a classification model that determines whether an AFM image contains artifacts. If necessary, a lightweight semantic segmentation network, custom-designed and trained on AFM data, is applied to generate precise artifact masks. These masks are adaptively expanded based on their structural orientation and then inpainted using a directional neighbor-based interpolation strategy to preserve 3D surface continuity. A localized Gaussian smoothing operation is then applied for seamless restoration. The system is integrated into a user-friendly GUI that supports real-time parameter adjustments and batch processing. Experimental results demonstrate the effective artifact removal while preserving nanoscale structural details, providing a robust, geometry-aware solution for high-fidelity AFM data interpretation.

[92] Seeing Through Clutter: Structured 3D Scene Reconstruction via Iterative Object Removal

Rio Aguina-Kang,Kevin James Blackburn-Matzen,Thibault Groueix,Vladimir Kim,Matheus Gadelha

Main category: cs.CV

TL;DR: SeeingThroughClutter 提出一种无需任务特定训练的单图3D结构化重建方法,通过VLM驱动的迭代式对象移除与重建流程,提升复杂遮挡场景下的分割与建模鲁棒性。

Details Motivation: 现有方法依赖语义分割和深度估计等中间任务,在遮挡和杂乱场景中性能不足。 Method: 采用基于视觉语言模型(VLM)的迭代对象移除与重建流程:逐个检测、分割、移除前景物体,并进行3D拟合,从而将复杂场景分解为更易处理的子任务。 Result: 在3D-Front和ADE20K数据集上达到SOTA鲁棒性,且无需任务特定训练,可直接受益于基础模型进展。 Conclusion: 该方法有效缓解了遮挡与杂乱对单图3D重建的影响,验证了迭代对象移除范式的有效性与通用性。 Abstract: We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: https://rioak.github.io/seeingthroughclutter/

[93] iSight: Towards expert-AI co-assessment for improved immunohistochemistry staining interpretation

Jacob S. Leiby,Jialu Yao,Pan Lu,George Hu,Anna Davidian,Shunsuke Koga,Olivia Leung,Pravin Patel,Isabella Tondi Resta,Rebecca Rojansky,Derek Sung,Eric Yang,Paul J. Zhang,Emma Lundberg,Dokyoon Kim,Serena Yeung-Levy,James Zou,Thomas Montine,Jeffrey Nirschl,Zhi Huang

Main category: cs.CV

TL;DR: 本文提出了HPA10M数据集(含超千万张IHC图像)和iSight多任务AI模型,用于自动评估免疫组化染色的强度、定位、数量、组织类型及恶性状态;在多项指标上超越现有基础模型,并在病理医生评估中展现出辅助提升诊断一致性与准确性的能力。

Details Motivation: 现有基于H&E染色的AI模型难以直接迁移到免疫组化(IHC)图像分析,因IHC存在显著的域特异性差异,亟需专用于IHC的大规模标注数据集与适配模型。 Method: 构建了包含1049万张IHC图像、涵盖45种正常组织与20种癌症类型的HPA10M数据集;在此基础上提出iSight多任务学习框架,融合全切片图像视觉特征与组织元数据,通过token级注意力机制同步预测染色强度、位置、数量、组织类型和恶性状态。 Result: iSight在独立测试集上达到85.5%(定位)、76.6%(强度)、75.7%(数量)准确率,优于PLIP/CONCH等微调基础模型2.5–10.2%;校准误差低(0.0150–0.0408);用户研究显示其性能超越初级病理医生评估,并提升病理医生间一致性(Cohen's κ分别从0.63→0.70、0.74→0.76)。 Conclusion: HPA10M与iSight为IHC智能分析提供了关键数据基础与技术范式,验证了AI-专家协同评估可提升IHC诊断的准确性、一致性与临床可靠性,具备向实际病理工作流集成的潜力。 Abstract: Immunohistochemistry (IHC) provides information on protein expression in tissue sections and is commonly used to support pathology diagnosis and disease triage. While AI models for H\&E-stained slides show promise, their applicability to IHC is limited due to domain-specific variations. Here we introduce HPA10M, a dataset that contains 10,495,672 IHC images from the Human Protein Atlas with comprehensive metadata included, and encompasses 45 normal tissue types and 20 major cancer types. Based on HPA10M, we trained iSight, a multi-task learning framework for automated IHC staining assessment. iSight combines visual features from whole-slide images with tissue metadata through a token-level attention mechanism, simultaneously predicting staining intensity, location, quantity, tissue type, and malignancy status. On held-out data, iSight achieved 85.5\% accuracy for location, 76.6\% for intensity, and 75.7\% for quantity, outperforming fine-tuned foundation models (PLIP, CONCH) by 2.5--10.2\%. In addition, iSight demonstrates well-calibrated predictions with expected calibration errors of 0.0150-0.0408. Furthermore, in a user study with eight pathologists evaluating 200 images from two datasets, iSight outperformed initial pathologist assessments on the held-out HPA dataset (79\% vs 68\% for location, 70\% vs 57\% for intensity, 68\% vs 52\% for quantity). Inter-pathologist agreement also improved after AI assistance in both held-out HPA (Cohen's $κ$ increased from 0.63 to 0.70) and Stanford TMAD datasets (from 0.74 to 0.76), suggesting expert--AI co-assessment can improve IHC interpretation. This work establishes a foundation for AI systems that can improve IHC diagnostic accuracy and highlights the potential for integrating iSight into clinical workflows to enhance the consistency and reliability of IHC assessment.

[94] VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding

Junbo Zou,Ziheng Huang,Shengjie Zhang,Liwen Zhang,Weining Shen

Main category: cs.CV

TL;DR: 本文提出了VideoBrain框架,通过学习采样策略使视觉语言模型(VLM)能自适应地获取长视频中的视觉信息,结合语义检索与密集时序采样,在减少30-40%帧数的同时,在多个长视频基准上提升3.5%-9.0%性能。

Details Motivation: 长视频理解面临计算资源限制与需覆盖数千帧信息之间的矛盾;现有方法存在均匀采样易丢失信息、单次关键帧选择无法纠错等问题。 Method: 提出端到端的VideoBrain框架,包含两个互补代理:基于CLIP的语义检索代理和均匀时间采样代理;VLM直接感知帧并判断信息充分性;引入行为感知奖励函数与数据分类流程,约束代理调用时机。 Result: 在四个长视频基准上相比基线提升3.5%至9.0%,且仅使用30-40%的帧数;在短视频基准上展现出强跨数据集泛化能力。 Conclusion: VideoBrain验证了让VLM自主学习视觉信息采集策略的有效性,为高效、自适应的长视频理解提供了新范式。 Abstract: Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents indiscriminately to maximize rewards, we introduce a behavior-aware reward function coupled with a data classification pipeline that teaches the model when agent invocation is genuinely beneficial. Experiments on four long video benchmarks demonstrate that VideoBrain achieves +3.5% to +9.0% improvement over the baseline while using 30-40% fewer frames, with strong cross-dataset generalization to short video benchmarks.

[95] DMS2F-HAD: A Dual-branch Mamba-based Spatial-Spectral Fusion Network for Hyperspectral Anomaly Detection

Aayushma Pant,Lakpa Tamang,Tsz-Kwan Lee,Sunil Aryal

Main category: cs.CV

TL;DR: 本文提出了一种基于Mamba的双分支模型DMS2F-HAD,用于高效准确的高光谱异常检测,在14个数据集上达到98.78%平均AUC,推理速度提升4.6倍。

Details Motivation: 现有深度学习方法在高光谱异常检测中难以兼顾长程光谱依赖建模与计算效率:CNN难以捕获长程依赖,Transformer计算成本高。 Method: 提出双分支Mamba架构(DMS2F-HAD),分别建模空间和光谱特征,并通过动态门控融合机制整合;利用Mamba的线性时间复杂度实现高效建模。 Result: 在14个基准高光谱图像数据集上取得98.78%的平均AUC,推理速度比同类深度学习方法快4.6倍。 Conclusion: DMS2F-HAD兼具高性能、高效率与强泛化能力,适用于实际高光谱异常检测任务。 Abstract: Hyperspectral anomaly detection (HAD) aims to identify rare and irregular targets in high-dimensional hyperspectral images (HSIs), which are often noisy and unlabelled data. Existing deep learning methods either fail to capture long-range spectral dependencies (e.g., convolutional neural networks) or suffer from high computational cost (e.g., Transformers). To address these challenges, we propose DMS2F-HAD, a novel dual-branch Mamba-based model. Our architecture utilizes Mamba's linear-time modeling to efficiently learn distinct spatial and spectral features in specialized branches, which are then integrated by a dynamic gated fusion mechanism to enhance anomaly localization. Across fourteen benchmark HSI datasets, our proposed DMS2F-HAD not only achieves a state-of-the-art average AUC of 98.78%, but also demonstrates superior efficiency with an inference speed 4.6 times faster than comparable deep learning methods. The results highlight DMS2FHAD's strong generalization and scalability, positioning it as a strong candidate for practical HAD applications.

[96] SuperPoint-E: local features for 3D reconstruction via tracking adaptation in endoscopy

O. Leon Barbed,José M. M. Montiel,Pascal Fua,Ana C. Murillo

Main category: cs.CV

TL;DR: 本文提出SuperPoint-E,一种针对内窥镜视频优化的局部特征提取方法,通过Tracking Adaptation监督策略提升特征检测与描述质量,显著改善结构光三维重建(SfM)效果。

Details Motivation: 提升内窥镜视频中Structure-from-Motion(SfM)的特征提取性能,以获得更密集、更鲁棒的3D重建结果。 Method: 提出SuperPoint-E方法,结合Tracking Adaptation监督策略,增强特征检测密度、生存率及描述子判别力。 Result: 相比原始SuperPoint和COLMAP,在真实内窥镜视频上实现了更密集、覆盖更长片段的3D重建;特征检测更密集、生存率更高、匹配几乎无需引导。 Conclusion: SuperPoint-E显著提升了内窥镜视频SfM的重建质量与鲁棒性,为医学图像分析提供了更优的特征提取方案。 Abstract: In this work, we focus on boosting the feature extraction to improve the performance of Structure-from-Motion (SfM) in endoscopy videos. We present SuperPoint-E, a new local feature extraction method that, using our proposed Tracking Adaptation supervision strategy, significantly improves the quality of feature detection and description in endoscopy. Extensive experimentation on real endoscopy recordings studies our approach's most suitable configuration and evaluates SuperPoint-E feature quality. The comparison with other baselines also shows that our 3D reconstructions are denser and cover more and longer video segments because our detector fires more densely and our features are more likely to survive (i.e. higher detection precision). In addition, our descriptor is more discriminative, making the guided matching step almost redundant. The presented approach brings significant improvements in the 3D reconstructions obtained, via SfM on endoscopy videos, compared to the original SuperPoint and the gold standard SfM COLMAP pipeline.

[97] JSynFlow: Japanese Synthesised Flowchart Visual Question Answering Dataset built with Large Language Models

Hiroshi Sasaki

Main category: cs.CV

TL;DR: 本文介绍了一个名为JSynFlow的合成视觉问答数据集,用于提升日语流程图理解能力,该数据集由大语言模型生成,包含任务描述、DSL渲染的流程图图像及对应问答对,并验证了其在微调VLM上的有效性。

Details Motivation: 现有VLM难以精确理解流程图,而构建大规模真实流程图图文数据集耗时耗力,亟需高效合成方法。 Method: 利用大语言模型(LLM)自动生成日语业务场景的任务描述,再通过领域特定语言(DSL)代码渲染流程图图像,并配套生成问答对,构建JSynFlow合成数据集。 Result: 在流程图问答任务上,使用JSynFlow微调VLM显著提升了模型性能;该数据集已开源发布。 Conclusion: JSynFlow为流程图理解提供了高质量、可扩展的合成数据方案,有效缓解了真实标注数据稀缺问题,推动多模态文档理解发展。 Abstract: Vision and language models (VLMs) are expected to analyse complex documents, such as those containing flowcharts, through a question-answering (QA) interface. The ability to recognise and interpret these flowcharts is in high demand, as they provide valuable insights unavailable in text-only explanations. However, developing VLMs with precise flowchart understanding requires large-scale datasets of flowchart images and corresponding text, the creation of which is highly time-consuming. To address this challenge, we introduce JSynFlow, a synthesised visual QA dataset for Japanese flowcharts, generated using large language models (LLMs). Our dataset comprises task descriptions for various business occupations, the corresponding flowchart images rendered from domain-specific language (DSL) code, and related QA pairs. This paper details the dataset's synthesis procedure and demonstrates that fine-tuning with JSynFlow significantly improves VLM performance on flowchart-based QA tasks. Our dataset is publicly available at https://huggingface.co/datasets/jri-advtechlab/jsynflow.

[98] Context Determines Optimal Architecture in Materials Segmentation

Mingjian Lu,Pawan K. Tripathi,Mark Shteyn,Debargha Ganguly,Roger H. French,Vipin Chaudhary,Yinghui Wu

Main category: cs.CV

TL;DR: 本文提出了一种跨模态材料图像分割评估框架,涵盖SEM、AFM、XCT和光学显微镜四种成像方式,发现不同成像模态下最优分割架构不同,并提供分布外检测与反事实解释以增强模型可靠性与可解释性。

Details Motivation: 现有分割架构通常仅在单一成像模态上评测,掩盖了实际部署中因模态差异导致的性能变化,材料研究人员缺乏针对特定成像条件选择合适架构及评估模型可信度的工具。 Method: 构建覆盖SEM、AFM、XCT和光学显微镜的跨模态分割评估框架;在七个数据集上系统评测六种编码器-解码器组合;引入分布外检测与反事实解释机制,揭示驱动预测的关键微观结构特征。 Result: UNet在高对比度2D成像中表现最优,DeepLabv3+在最具挑战性的案例中更优;框架能提供架构选型指导、模型可靠性信号(如OOD检测)和可解释性反馈。 Conclusion: 该框架弥合了材料表征中模型实用性与评估脱节的 gap,为面向具体成像条件的架构选择与可信部署提供了系统化支持。 Abstract: Segmentation architectures are typically benchmarked on single imaging modalities, obscuring deployment-relevant performance variations: an architecture optimal for one modality may underperform on another. We present a cross-modal evaluation framework for materials image segmentation spanning SEM, AFM, XCT, and optical microscopy. Our evaluation of six encoder-decoder combinations across seven datasets reveals that optimal architectures vary systematically by context: UNet excels for high-contrast 2D imaging while DeepLabv3+ is preferred for the hardest cases. The framework also provides deployment feedback via out-of-distribution detection and counterfactual explanations that reveal which microstructural features drive predictions. Together, the architecture guidance, reliability signals, and interpretability tools address a practical gap in materials characterization, where researchers lack tools to select architectures for their specific imaging setup or assess when models can be trusted on new samples.

[99] Improving 2D Diffusion Models for 3D Medical Imaging with Inter-Slice Consistent Stochasticity

Chenhe Du,Qing Wu,Xuanyu Tian,Jingyi Yu,Hongjiang Wei,Yuyao Zhang

Main category: cs.CV

TL;DR: 本文提出了一种名为Inter-Slice Consistent Stochasticity (ISCS)的策略,通过在扩散采样过程中控制切片间随机噪声的一致性,来缓解2D训练扩散模型用于3D医学图像重建时出现的层间不连续问题,无需额外损失或计算开销,具有即插即用特性。

Details Motivation: 2D训练的扩散模型用于3D医学图像重建时,因采样随机性导致层间不连续;现有连续性正则方法依赖敏感超参且易过平滑。 Method: 提出ISCS策略,在扩散采样中显式约束不同切片共享一致的随机噪声成分,使采样轨迹对齐,不引入新损失或优化步骤,保持即插即用和零额外计算成本。 Result: 在多个医学成像任务上验证了ISCS能显著提升基于2D扩散模型的3D重建质量,改善层间连续性与整体结构保真度。 Conclusion: 控制层间随机性是一种原理清晰、实用高效的途径,可提升仅用2D扩散先验实现高保真3D医学成像的能力。 Abstract: 3D medical imaging is in high demand and essential for clinical diagnosis and scientific research. Currently, diffusion models (DMs) have become an effective tool for medical imaging reconstruction thanks to their ability to learn rich, high-quality data priors. However, learning the 3D data distribution with DMs in medical imaging is challenging, not only due to the difficulties in data collection but also because of the significant computational burden during model training. A common compromise is to train the DMs on 2D data priors and reconstruct stacked 2D slices to address 3D medical inverse problems. However, the intrinsic randomness of diffusion sampling causes severe inter-slice discontinuities of reconstructed 3D volumes. Existing methods often enforce continuity regularizations along the z-axis, which introduces sensitive hyper-parameters and may lead to over-smoothing results. In this work, we revisit the origin of stochasticity in diffusion sampling and introduce Inter-Slice Consistent Stochasticity (ISCS), a simple yet effective strategy that encourages interslice consistency during diffusion sampling. Our key idea is to control the consistency of stochastic noise components during diffusion sampling, thereby aligning their sampling trajectories without adding any new loss terms or optimization steps. Importantly, the proposed ISCS is plug-and-play and can be dropped into any 2D trained diffusion based 3D reconstruction pipeline without additional computational cost. Experiments on several medical imaging problems show that our method can effectively improve the performance of medical 3D imaging problems based on 2D diffusion models. Our findings suggest that controlling inter-slice stochasticity is a principled and practically attractive route toward high-fidelity 3D medical imaging with 2D diffusion priors. The code is available at: https://github.com/duchenhe/ISCS

[100] Point2Insert: Video Object Insertion via Sparse Point Guidance

Yu Zhou,Xiaoyan Yang,Bojia Zi,Lihan Zhang,Ruijie Sun,Weishi Zheng,Haibin Huang,Chi Zhang,Xuelong Li

Main category: cs.CV

TL;DR: Point2Insert是一种基于稀疏点的视频对象插入框架,仅需少量正负点提示即可实现精确、灵活的对象定位,无需繁琐的掩码标注,并通过两阶段训练和知识蒸馏提升性能。

Details Motivation: 现有基于掩码的方法需要大量人工标注,而基于指令的方法难以精确定位;同时用户对低 effort、高精度的对象插入需求日益增长。 Method: 提出两阶段训练策略:第一阶段训练支持稀疏点(正/负点)或二值掩码引导的对象插入模型;第二阶段利用对象移除模型合成配对视频进行视频插入适配;并采用掩码引导模型作为教师模型,对点引导模型进行知识蒸馏。 Result: Point2Insert在多项实验中持续超越强基线方法,甚至优于参数量高10倍的模型。 Conclusion: 稀疏点提示是一种高效、灵活且用户友好的视频对象插入范式,Point2Insert在精度、易用性和效率上实现了显著提升。 Abstract: This paper introduces Point2Insert, a sparse-point-based framework for flexible and user-friendly object insertion in videos, motivated by the growing popularity of accurate, low-effort object placement. Existing approaches face two major challenges: mask-based insertion methods require labor-intensive mask annotations, while instruction-based methods struggle to place objects at precise locations. Point2Insert addresses these issues by requiring only a small number of sparse points instead of dense masks, eliminating the need for tedious mask drawing. Specifically, it supports both positive and negative points to indicate regions that are suitable or unsuitable for insertion, enabling fine-grained spatial control over object locations. The training of Point2Insert consists of two stages. In Stage 1, we train an insertion model that generates objects in given regions conditioned on either sparse-point prompts or a binary mask. In Stage 2, we further train the model on paired videos synthesized by an object removal model, adapting it to video insertion. Moreover, motivated by the higher insertion success rate of mask-guided editing, we leverage a mask-guided insertion model as a teacher to distill reliable insertion behavior into the point-guided model. Extensive experiments demonstrate that Point2Insert consistently outperforms strong baselines and even surpasses models with $\times$10 more parameters.

[101] Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

Yi-Kuan Hsieh,Jun-Wei Hsieh,Xin li,Ming-Ching Chang,Yu-Chee Tseng

Main category: cs.CV

TL;DR: 本文提出PRISMamba,一种旋转鲁棒的图像扫描方法,通过环形分割、环内无序聚合与径向SSM跨环传播,结合部分通道过滤提升Vision SSM的精度、效率与旋转不变性。

Details Motivation: 现有Vision SSM将2D图像按固定扫描顺序展平为1D序列,该方式破坏空间邻接性和物体连续性,且对旋转等几何变换敏感,影响模型性能。 Method: 提出Partial RIng Scan Mamba(PRISMamba):将图像划分为同心环,在每环内进行顺序无关的特征聚合,并通过短径向状态空间模型(SSM)在环间传递上下文;同时引入部分通道过滤机制,仅将信息量最大的通道送入递归环路径,其余走轻量残差分支。 Result: 在ImageNet-1K上达到84.5% Top-1准确率,仅需3.9G FLOPs和A100上3054 img/s吞吐量,精度与速度均优于VMamba,且旋转下性能稳定(下降仅0~0.5%),而固定路径扫描下降1~2%。 Conclusion: 扫描顺序设计与通道过滤是提升Vision SSM准确性、计算效率及旋转鲁棒性的关键但被忽视因素。 Abstract: State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

[102] HoloEv-Net: Efficient Event-based Action Recognition via Holographic Spatial Embedding and Global Spectral Gating

Weidong Hao

Main category: cs.CV

TL;DR: 本文提出了一种高效的事件驱动动作识别框架HoloEv-Net,通过紧凑全息时空表示(CHSR)减少计算与结构冗余,并引入全局频谱门控(GSG)模块利用频域信息提升全局运动建模能力,在多个基准上达到SOTA性能,同时具备轻量级变体以支持边缘部署。

Details Motivation: 现有事件驱动动作识别方法存在体素表示计算冗余、多分支架构结构冗余以及频谱信息利用不足三大问题。 Method: 提出HoloEv-Net框架:1)Compact Holographic Spatiotemporal Representation(CHSR),将水平空间线索隐式嵌入T-H视图,以2D表示保留3D时空上下文;2)Global Spectral Gating(GSG)模块,利用FFT在频域进行全局token混合,增强表征能力且参数开销极小。 Result: HoloEv-Net-Base在THU-EACT-50-CHL、HARDVS和DailyDVS-200上分别超越现有方法10.29%、1.71%和6.25%;轻量版HoloEv-Net-Small参数减少5.4倍、FLOPs降低300倍、延迟降低2.4倍,精度仍具竞争力。 Conclusion: HoloEv-Net有效缓解了事件数据处理中的多重冗余问题,兼顾高性能与高效率,为边缘场景下的事件驱动动作识别提供了新范式。 Abstract: Event-based Action Recognition (EAR) has attracted significant attention due to the high temporal resolution and high dynamic range of event cameras. However, existing methods typically suffer from (i) the computational redundancy of dense voxel representations, (ii) structural redundancy inherent in multi-branch architectures, and (iii) the under-utilization of spectral information in capturing global motion patterns. To address these challenges, we propose an efficient EAR framework named HoloEv-Net. First, to simultaneously tackle representation and structural redundancies, we introduce a Compact Holographic Spatiotemporal Representation (CHSR). Departing from computationally expensive voxel grids, CHSR implicitly embeds horizontal spatial cues into the Time-Height (T-H) view, effectively preserving 3D spatiotemporal contexts within a 2D representation. Second, to exploit the neglected spectral cues, we design a Global Spectral Gating (GSG) module. By leveraging the Fast Fourier Transform (FFT) for global token mixing in the frequency domain, GSG enhances the representation capability with negligible parameter overhead. Extensive experiments demonstrate the scalability and effectiveness of our framework. Specifically, HoloEv-Net-Base achieves state-of-the-art performance on THU-EACT-50-CHL, HARDVS and DailyDVS-200, outperforming existing methods by 10.29%, 1.71% and 6.25%, respectively. Furthermore, our lightweight variant, HoloEv-Net-Small, delivers highly competitive accuracy while offering extreme efficiency, reducing parameters by 5.4 times, FLOPs by 300times, and latency by 2.4times compared to heavy baselines, demonstrating its potential for edge deployment.

[103] Natural Language Instructions for Scene-Responsive Human-in-the-Loop Motion Planning in Autonomous Driving using Vision-Language-Action Models

Angel Martinez-Sanchez,Parthib Roy,Ross Greer

Main category: cs.CV

TL;DR: 本文提出了一种基于指令的端到端自动驾驶规划方法,利用doScenes真实世界数据集和OpenEMMA多模态大语言模型框架,将乘客自然语言指令融入视觉-语言接口,显著提升轨迹预测鲁棒性与对齐度。

Details Motivation: 现有指令跟随式驾驶规划器依赖仿真或固定指令词表,泛化能力差;缺乏真实世界中自由形式、具指代性的语言指令与精确运动真值配对的数据集。 Method: 将doScenes数据集中的自由形式乘客指令作为prompt集成进OpenEMMA(基于MLLM的端到端驾驶框架)的视觉-语言接口,在前视图像与自车状态输入基础上实现语言条件化轨迹生成(10步速度-曲率输出)。 Result: 在849个标注场景上评估显示:指令条件化使平均ADE降低98.7%,极大缓解极端失败;剔除异常值后,优质指令仍可进一步提升ADE达5.1%;并分析了影响效果的关键指令特征。 Conclusion: 自然语言指令能显著增强端到端驾驶规划的鲁棒性与可控性;指令质量(如明确性、指代清晰度)直接影响轨迹对齐性能;本工作建立了首个可复现的指令感知规划基准,并开源提示与评测脚本。 Abstract: Instruction-grounded driving, where passenger language guides trajectory planning, requires vehicles to understand intent before motion. However, most prior instruction-following planners rely on simulation or fixed command vocabularies, limiting real-world generalization. doScenes, the first real-world dataset linking free-form instructions (with referentiality) to nuScenes ground-truth motion, enables instruction-conditioned planning. In this work, we adapt OpenEMMA, an open-source MLLM-based end-to-end driving framework that ingests front-camera views and ego-state and outputs 10-step speed-curvature trajectories, to this setting, presenting a reproducible instruction-conditioned baseline on doScenes and investigate the effects of human instruction prompts on predicted driving behavior. We integrate doScenes directives as passenger-style prompts within OpenEMMA's vision-language interface, enabling linguistic conditioning before trajectory generation. Evaluated on 849 annotated scenes using ADE, we observe that instruction conditioning substantially improves robustness by preventing extreme baseline failures, yielding a 98.7% reduction in mean ADE. When such outliers are removed, instructions still influence trajectory alignment, with well-phrased prompts improving ADE by up to 5.1%. We use this analysis to discuss what makes a "good" instruction for the OpenEMMA framework. We release the evaluation prompts and scripts to establish a reproducible baseline for instruction-aware planning. GitHub: https://github.com/Mi3-Lab/doScenes-VLM-Planning

[104] DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang,Zhengyu Li,Kwong Weng Loh,Mingxi Xu,Qi Wang,Zhengyu Wen,Xiaoyu He,Wei Zhao,Kehong Gong,Mingyuan Zhang

Main category: cs.CV

TL;DR: DiMo是一种离散扩散式双向文本-动作理解与生成框架,通过迭代掩码标记优化统一T2M、M2T和M2M任务,并引入RVQ和GRPO提升动作质量与对齐控制能力。

Details Motivation: 现有掩码建模动作生成方法主要聚焦于文本到动作(T2M),缺乏对动作到文本(M2T)及无文本动作生成(M2M)的统一支持。 Method: 提出DiMo框架:采用离散扩散式迭代掩码标记细化机制;结合残差向量量化(RVQ)提升动作标记保真度;引入分组相对策略优化(GRPO)增强跨模态对齐与可控性。 Result: 在HumanML3D和KIT-ML数据集上验证了高质量动作生成与强双向理解能力;支持无文本动作补全、文本引导动作预测、动作字幕修正等新任务,无需修改架构。 Conclusion: DiMo实现了文本-动作双向理解与生成的统一建模,兼顾生成质量、推理效率与任务泛化性,为多模态人机交互提供了更灵活的基础模型范式。 Abstract: Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps.We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change.Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

[105] Continuous Degradation Modeling via Latent Flow Matching for Real-World Super-Resolution

Hyeonjae Kim,Dongjin Kim,Eugene Jin,Tae Hyun Kim

Main category: cs.CV

TL;DR: 本文提出了一种基于流匹配(flow matching)在潜在退化空间中合成真实低分辨率图像的新框架,以解决现有超分辨率方法在真实复杂退化场景下泛化能力差的问题。

Details Motivation: 现有深度学习超分辨率方法在合成退化(如双三次下采样)上表现好,但在含噪声、模糊、压缩伪影等真实复杂非线性退化的真实图像上效果不佳;构建真实LR-HR配对数据集费时且受限于特定下采样因子。 Method: 提出一种基于流匹配的潜在退化空间建模方法,仅需单张HR图像即可合成具有真实退化特征(如噪声、模糊、压缩伪影)的LR图像,并支持未见退化程度的任意尺度合成。 Result: 合成的LR图像在定量与定性评估中均高度还原真实退化;使用该数据训练的传统及任意尺度超分辨率模型,在真实图像上重建质量显著提升。 Conclusion: 该框架为构建大规模、多样化、贴近真实场景的SR训练数据提供了高效可行的新范式,有效提升了模型在真实世界图像上的泛化性能。 Abstract: While deep learning-based super-resolution (SR) methods have shown impressive outcomes with synthetic degradation scenarios such as bicubic downsampling, they frequently struggle to perform well on real-world images that feature complex, nonlinear degradations like noise, blur, and compression artifacts. Recent efforts to address this issue have involved the painstaking compilation of real low-resolution (LR) and high-resolution (HR) image pairs, usually limited to several specific downscaling factors. To address these challenges, our work introduces a novel framework capable of synthesizing authentic LR images from a single HR image by leveraging the latent degradation space with flow matching. Our approach generates LR images with realistic artifacts at unseen degradation levels, which facilitates the creation of large-scale, real-world SR training datasets. Comprehensive quantitative and qualitative assessments verify that our synthetic LR images accurately replicate real-world degradations. Furthermore, both traditional and arbitrary-scale SR models trained using our datasets consistently yield much better HR outcomes.

[106] VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Feng Wang,Yichun Shi,Ceyuan Yang,Qiushan Guo,Jingxiang Sun,Alan Yuille,Peng Wang

Main category: cs.CV

TL;DR: VTok是一种统一的视频标记化框架,通过解耦空间与时间表征(保留关键帧空间特征+后续帧编码为残差标记)实现紧凑而富有表现力的视频标记化,在视频理解和生成任务中均提升性能与效率。

Details Motivation: 现有视觉-语言系统采用简单的帧采样策略进行视频标记化,缺乏对时空信息的有效建模,导致表示冗余且表达能力不足。 Method: 提出VTok框架:选取一个关键帧保留其完整空间特征,其余帧分别编码为单个残差标记,从而将视频表示复杂度从帧数×每帧标记数降至二者之和。 Result: 在TV-Align和VBench等基准上分别提升3.4%准确率和1.9%分数;生成视频具有更连贯运动和更强文本遵循能力;显著缩短每视频标记序列长度。 Conclusion: VTok提供了一种高效、紧凑且表达力强的视频标记化范式,有望成为视频理解与生成领域未来研究的标准方案。 Abstract: This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.

[107] AGMA: Adaptive Gaussian Mixture Anchors for Prior-Guided Multimodal Human Trajectory Forecasting

Chao Li,Rui Zhang,Siyuan Huang,Xian Zhong,Hongbo Jiang

Main category: cs.CV

TL;DR: 本文提出AGMA方法,通过自适应高斯混合锚点构建高质量先验,以解决行人轨迹预测中先验错位问题,显著提升预测准确性和多样性。

Details Motivation: 现有方法在人类轨迹预测中存在先验错位问题,其学习或固定的先验无法充分覆盖未来可能的分布,限制了预测精度和多样性;理论分析表明预测误差受先验质量下界约束,因此先验建模是关键瓶颈。 Method: 提出AGMA(Adaptive Gaussian Mixture Anchors),分两阶段构建表达性强的先验:首先从训练数据中提取多样化的行人行为模式,再将其蒸馏为场景自适应的全局先验用于推理。 Result: 在ETH-UCY、Stanford Drone和JRDB数据集上的大量实验表明,AGMA达到当前最优性能。 Conclusion: 高质量先验对人类轨迹预测至关重要,AGMA通过自适应建模先验有效缓解了先验错位问题,提升了预测性能。 Abstract: Human trajectory forecasting requires capturing the multimodal nature of pedestrian behavior. However, existing approaches suffer from prior misalignment. Their learned or fixed priors often fail to capture the full distribution of plausible futures, limiting both prediction accuracy and diversity. We theoretically establish that prediction error is lower-bounded by prior quality, making prior modeling a key performance bottleneck. Guided by this insight, we propose AGMA (Adaptive Gaussian Mixture Anchors), which constructs expressive priors through two stages: extracting diverse behavioral patterns from training data and distilling them into a scene-adaptive global prior for inference. Extensive experiments on ETH-UCY, Stanford Drone, and JRDB datasets demonstrate that AGMA achieves state-of-the-art performance, confirming the critical role of high-quality priors in trajectory forecasting.

[108] Adaptive 1D Video Diffusion Autoencoder

Yao Teng,Minxuan Lin,Xian Liu,Shuai Wang,Xiao Yang,Xihui Liu

Main category: cs.CV

TL;DR: 本文提出One-DVA,一种基于Transformer的自适应一维视频自编码器,结合查询式视觉Transformer编码器与像素空间扩散Transformer解码器,解决现有视频自编码器固定码率、架构僵化和解码确定性等缺陷,并支持下游生成任务。

Details Motivation: 现有视频自编码器存在固定码率浪费、CNN架构无法支持可变长度潜变量建模、确定性解码难以恢复细节三大问题。 Method: 提出One-DVA框架:编码器采用查询式视觉Transformer提取时空特征并配合可变长度dropout机制;解码器为以潜变量为条件的像素空间扩散Transformer;采用两阶段训练策略,并对潜变量分布正则化、微调解码器以适配生成任务。 Result: One-DVA在相同压缩比下重建指标媲美3D-CNN VAE,并支持更高自适应压缩比;经正则化与微调后,显著缓解生成过程中的伪影,提升下游视频生成质量。 Conclusion: One-DVA通过引入自适应1D编码与扩散解码,有效克服传统视频自编码器局限,为高效、高质量视频生成提供了更优潜空间基础。 Abstract: Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.

[109] An Intuitionistic Fuzzy Logic Driven UNet architecture: Application to Brain Image segmentation

Hanuman Verma,Kiho Im,Pranabesh Maji,Akshansh Gupta

Main category: cs.CV

TL;DR: 本文提出了一种结合直觉模糊逻辑的UNet模型(IF-UNet),用于提升MRI脑图像分割中对部分容积效应和边界不确定性等模糊性的处理能力,并在IBSR数据集上验证了其优于传统方法的分割性能。

Details Motivation: 现有基于CNN(尤其是UNet)的脑MRI分割方法难以有效应对由部分容积效应引起的组织模糊性和边界不确定性,即模型对不确定性建模能力不足。 Method: 提出IF-UNet框架,将直觉模糊逻辑(含隶属度、非隶属度和犹豫度)嵌入UNet结构中,使网络能显式建模并处理图像中的模糊性。 Result: 在IBSR数据集上,IF-UNet在准确率、Dice系数和IoU等指标上均优于基准方法,验证了其提升分割质量与不确定性处理能力的有效性。 Conclusion: 引入直觉模糊逻辑可增强UNet对脑MRI中组织模糊性的建模能力,为医学图像分割中不确定性处理提供了新思路。 Abstract: Accurate segmentation of MRI brain images is essential for image analysis, diagnosis of neuro-logical disorders and medical image computing. In the deep learning approach, the convolutional neural networks (CNNs), especially UNet, are widely applied in medical image segmentation. However, it is difficult to deal with uncertainty due to the partial volume effect in brain images. To overcome this limitation, we propose an enhanced framework, named UNet with intuitionistic fuzzy logic (IF-UNet), which incorporates intuitionistic fuzzy logic into UNet. The model processes input data in terms of membership, nonmembership, and hesitation degrees, allowing it to better address tissue ambiguity resulting from partial volume effects and boundary uncertainties. The proposed architecture is evaluated on the Internet Brain Segmentation Repository (IBSR) dataset, and its performance is computed using accuracy, Dice coefficient, and intersection over union (IoU). Experimental results confirm that IF-UNet improves segmentation quality with handling uncertainty in brain images.

[110] SPOT-Occ: Sparse Prototype-guided Transformer for Camera-based 3D Occupancy Prediction

Suzeyu Chen,Leheng Li,Ying-Cong Chen

Main category: cs.CV

TL;DR: 本文提出了一种基于原型的稀疏Transformer解码器(SPOT-Occ),通过两阶段引导特征选择与聚焦聚合,高效解决相机输入下的实时高精度3D占据预测问题,兼顾速度与精度。

Details Motivation: 从相机图像实现高精度、实时的3D占据预测对自动驾驶至关重要;现有稀疏3D表示虽缓解编码瓶颈,但其解码器面临在非均匀稀疏体素特征上进行高效信息聚合的挑战,传统密集注意力计算开销大。 Method: 提出原型引导的稀疏Transformer解码器:第一阶段为稀疏原型选择机制,每个查询自适应选取最具判别性的少量体素特征作为原型;第二阶段为聚焦聚合;引入基于真值掩码的去噪范式,确保跨解码层的查询-原型关联稳定有效。 Result: SPOT-Occ在显著提升推理速度的同时,准确率也优于先前方法。 Conclusion: 原型引导的稀疏注意力机制可有效平衡效率与性能,为实时3D占据预测提供新范式。 Abstract: Achieving highly accurate and real-time 3D occupancy prediction from cameras is a critical requirement for the safe and practical deployment of autonomous vehicles. While this shift to sparse 3D representations solves the encoding bottleneck, it creates a new challenge for the decoder: how to efficiently aggregate information from a sparse, non-uniformly distributed set of voxel features without resorting to computationally prohibitive dense attention. In this paper, we propose a novel Prototype-based Sparse Transformer Decoder that replaces this costly interaction with an efficient, two-stage process of guided feature selection and focused aggregation. Our core idea is to make the decoder's attention prototype-guided. We achieve this through a sparse prototype selection mechanism, where each query adaptively identifies a compact set of the most salient voxel features, termed prototypes, for focused feature aggregation. To ensure this dynamic selection is stable and effective, we introduce a complementary denoising paradigm. This approach leverages ground-truth masks to provide explicit guidance, guaranteeing a consistent query-prototype association across decoder layers. Our model, dubbed SPOT-Occ, outperforms previous methods with a significant margin in speed while also improving accuracy. Source code is released at https://github.com/chensuzeyu/SpotOcc.

[111] ACIL: Active Class Incremental Learning for Image Classification

Aditya R. Bhattacharya,Debanjan Goswami,Shayok Chakraborty

Main category: cs.CV

TL;DR: 本文提出ACIL框架,将主动学习引入类增量学习,通过不确定性与多样性准则选择关键样本进行标注,以降低标注成本并缓解灾难性遗忘。

Details Motivation: 现有类增量学习方法假设每轮训练数据均完全标注,导致高昂的标注成本和资源浪费;而主动学习可减少人工标注需求,因此有必要将其与类增量学习结合。 Method: 提出ACIL框架,在每轮增量学习中基于不确定性与多样性准则筛选需标注的代表性样本,并将其加入下一轮训练数据池。 Result: 在多个视觉数据集上的实验表明,ACIL显著降低标注成本,同时有效缓解灾难性遗忘,性能优于相关基线方法。 Conclusion: ACIL成功融合主动学习与类增量学习,兼顾标注效率与模型持续学习能力,为实际部署低资源持续学习系统提供了可行方案。 Abstract: Continual learning (or class incremental learning) is a realistic learning scenario for computer vision systems, where deep neural networks are trained on episodic data, and the data from previous episodes are generally inaccessible to the model. Existing research in this domain has primarily focused on avoiding catastrophic forgetting, which occurs due to the continuously changing class distributions in each episode and the inaccessibility of the data from previous episodes. However, these methods assume that all the training samples in every episode are annotated; this not only incurs a huge annotation cost, but also results in a wastage of annotation effort, since most of the samples in a given episode will not be accessible to the model in subsequent episodes. Active learning algorithms identify the salient and informative samples from large amounts of unlabeled data and are instrumental in reducing the human annotation effort in inducing a deep neural network. In this paper, we propose ACIL, a novel active learning framework for class incremental learning settings. We exploit a criterion based on uncertainty and diversity to identify the exemplar samples that need to be annotated in each episode, and will be appended to the data in the next episode. Such a framework can drastically reduce annotation cost and can also avoid catastrophic forgetting. Our extensive empirical analyses on several vision datasets corroborate the promise and potential of our framework against relevant baselines.

[112] Depth-Guided Metric-Aware Temporal Consistency for Monocular Video Human Mesh Recovery

Jiaxin Cen,Xudong Mao,Guanghui Yue,Wei Zhou,Ruomei Wang,Fan Zhou,Baoquan Zhao

Main category: cs.CV

TL;DR: 本文提出了一种深度引导的单目视频人体网格恢复框架,通过多尺度几何-外观融合、度量感知的姿态与形状估计、以及运动-深度对齐的细化模块,显著提升了深度一致性、时间稳定性与遮挡鲁棒性。

Details Motivation: 单目视频人体网格恢复面临深度模糊和尺度不确定性导致的度量不一致与时间不稳定问题,现有RGB+时序平滑方法难以解决深度排序、尺度漂移和遮挡引发的抖动。 Method: 提出三组件深度引导框架:1)深度引导的多尺度融合模块(置信度门控融合几何先验与RGB特征);2)深度校准的度量感知姿态与形状估计器(D-MAPS,利用深度标定的骨骼统计实现尺度一致初始化);3)运动-深度对齐细化模块(MoDAR,通过运动动态与几何线索间的跨模态注意力保障时间一致性)。 Result: 在三个挑战性基准上取得最优性能,显著提升强遮挡下的鲁棒性与空间精度,同时保持计算高效性。 Conclusion: 深度信息可作为关键几何约束有效缓解单目重建中的尺度与深度歧义,所提多组件协同框架为实现度量准确、时间稳定的人体网格恢复提供了新范式。 Abstract: Monocular video human mesh recovery faces fundamental challenges in maintaining metric consistency and temporal stability due to inherent depth ambiguities and scale uncertainties. While existing methods rely primarily on RGB features and temporal smoothing, they struggle with depth ordering, scale drift, and occlusion-induced instabilities. We propose a comprehensive depth-guided framework that achieves metric-aware temporal consistency through three synergistic components: A Depth-Guided Multi-Scale Fusion module that adaptively integrates geometric priors with RGB features via confidence-aware gating; A Depth-guided Metric-Aware Pose and Shape (D-MAPS) estimator that leverages depth-calibrated bone statistics for scale-consistent initialization; A Motion-Depth Aligned Refinement (MoDAR) module that enforces temporal coherence through cross-modal attention between motion dynamics and geometric cues. Our method achieves superior results on three challenging benchmarks, demonstrating significant improvements in robustness against heavy occlusion and spatial accuracy while maintaining computational efficiency.

[113] Decoupled Hierarchical Distillation for Multimodal Emotion Recognition

Yong Li,Yuanzhi Wang,Yi Ding,Shiqing Zhang,Ke Lu,Cuntai Guan

Main category: cs.CV

TL;DR: 本文提出了一种名为解耦分层多模态蒸馏(DHMD)的新框架,用于人类多模态情感识别(MER),通过解耦模态特征并采用两级知识蒸馏策略提升跨模态对齐与表示判别性。

Details Motivation: 现有MER方法难以应对多模态固有的异质性及各模态贡献度不一致的问题。 Method: DHMD首先利用自回归机制将各模态特征解耦为模态无关(同质)和模态独有(异质)两部分;再通过图蒸馏单元(GD-Unit)在解耦空间中进行粗粒度知识蒸馏,并借助动态图实现模态间自适应蒸馏;最后通过跨模态字典匹配机制实现细粒度语义对齐。 Result: DHMD在CMU-MOSI和CMU-MOSEI数据集上显著优于SOTA方法,ACC7、ACC2和F1指标分别取得1.3%/2.4%、1.3%/1.9%和1.9%/1.8%的相对提升;可视化结果表明图边和字典激活在两类特征空间中呈现有意义的分布模式。 Conclusion: DHMD通过解耦与分层蒸馏有效缓解了多模态异质性问题,提升了跨模态特征对齐能力与情感识别性能。 Abstract: Human multimodal emotion recognition (MER) seeks to infer human emotions by integrating information from language, visual, and acoustic modalities. Although existing MER approaches have achieved promising results, they still struggle with inherent multimodal heterogeneities and varying contributions from different modalities. To address these challenges, we propose a novel framework, Decoupled Hierarchical Multimodal Distillation (DHMD). DHMD decouples each modality's features into modality-irrelevant (homogeneous) and modality-exclusive (heterogeneous) components using a self-regression mechanism. The framework employs a two-stage knowledge distillation (KD) strategy: (1) coarse-grained KD via a Graph Distillation Unit (GD-Unit) in each decoupled feature space, where a dynamic graph facilitates adaptive distillation among modalities, and (2) fine-grained KD through a cross-modal dictionary matching mechanism, which aligns semantic granularities across modalities to produce more discriminative MER representations. This hierarchical distillation approach enables flexible knowledge transfer and effectively improves cross-modal feature alignment. Experimental results demonstrate that DHMD consistently outperforms state-of-the-art MER methods, achieving 1.3\%/2.4\% (ACC$_7$), 1.3\%/1.9\% (ACC$_2$) and 1.9\%/1.8\% (F1) relative improvement on CMU-MOSI/CMU-MOSEI dataset, respectively. Meanwhile, visualization results reveal that both the graph edges and dictionary activations in DHMD exhibit meaningful distribution patterns across modality-irrelevant/-exclusive feature spaces.

[114] KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

Siyu Jiang,Feiyang Chen,Xiaojin Zhang,Kun He

Main category: cs.CV

TL;DR: 本文提出KVSmooth方法,通过注意力熵引导的自适应平滑技术,在不需训练和修改模型的前提下,有效缓解多模态大语言模型(MLLMs)在生成过程中的视觉不一致性幻觉问题。

Details Motivation: 多模态大语言模型(MLLMs)在生成过程中易出现视觉不一致的幻觉(如错误的对象、属性或关系),尤其在解码序列变长时因语义漂移而加剧,现有方法难以兼顾效率与效果。 Method: KVSmooth是一种无需训练、即插即用的方法:对KV缓存中的键(keys)和值(values)应用指数移动平均(EMA),并利用各token注意力分布的熵动态量化其‘沉降程度’,从而自适应调整平滑强度。 Result: 在CHAIR_S指标上幻觉率从41.8显著降至18.2,F1分数从77.5提升至79.2,同时提高精确率与召回率;优于需重训练或对比解码等计算昂贵的基线方法。 Conclusion: KVSmooth以轻量、高效、通用的方式提升了MLLMs的视觉忠实性与整体性能,为幻觉缓解提供了新范式。 Abstract: Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.

[115] SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization

Lifan Wu,Ruijie Zhu,Yubo Ai,Tianzhu Zhang

Main category: cs.CV

TL;DR: 本文提出了SkeletonGaussian框架,通过单目视频输入生成可编辑的动态3D高斯表示,结合骨架驱动的刚性运动与六面体平面(hexplane)细化的非刚性形变,提升了4D生成的可控性与可编辑性。

Details Motivation: 现有4D生成方法多采用隐式形变场表示运动,导致难以直接控制和编辑动态3D内容。 Method: 提出分层关节化表示:先用鲁棒骨架提取与线性混合蒙皮(LBS)建模刚性运动,再以hexplane进行非刚性形变细化。 Result: 在生成质量上超越现有方法,并支持直观的运动编辑(如重定向、插值等)。 Conclusion: SkeletonGaussian为可编辑的4D生成提供了新范式,兼顾表达能力、可解释性与可控性。 Abstract: 4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability. To address this issue, we propose SkeletonGaussian, a novel framework for generating editable dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical articulated representation that decomposes motion into sparse rigid motion explicitly driven by a skeleton and fine-grained non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations, enhancing interpretability and editability. Experimental results demonstrate that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation. Project page: https://wusar.github.io/projects/skeletongaussian/

[116] Light Up Your Face: A Physically Consistent Dataset and Diffusion Model for Face Fill-Light Enhancement

Jue Gong,Zihan Zhou,Jingkai Wang,Xiaohong Liu,Yulun Zhang,Xiaokang Yang

Main category: cs.CV

TL;DR: 本文提出了一种面向人脸补光增强(FFE)的可控、高保真、低计算成本的方法FiLitDiff,并构建了大规模物理一致配对数据集LYF-160K。

Details Motivation: 现有面部重打光方法多关注整体光照重塑,易破坏原始场景光照一致性,难以满足实际人脸补光增强(FFE)需求,即仅提亮人脸而不改变背景和原场景光照。 Method: 构建了大规模物理一致配对数据集LightYourFace-160K(LYF-160K),提出物理感知光照提示(PALP)预训练模块,并基于扩散模型构建了一步式可控补光扩散模型FiLitDiff,以六维解耦光照参数为条件输入。 Result: 在保留背景光照前提下,FiLitDiff在保持高感知质量与竞争性全参考指标的同时,实现了高效、可控、高保真的人脸补光增强。 Conclusion: FiLitDiff结合物理建模与扩散生成,在人脸补光任务中实现了背景一致性、可控性与效率的统一,LYF-160K数据集为后续研究提供了可靠基准。 Abstract: Face fill-light enhancement (FFE) brightens underexposed faces by adding virtual fill light while keeping the original scene illumination and background unchanged. Most face relighting methods aim to reshape overall lighting, which can suppress the input illumination or modify the entire scene, leading to foreground-background inconsistency and mismatching practical FFE needs. To support scalable learning, we introduce LightYourFace-160K (LYF-160K), a large-scale paired dataset built with a physically consistent renderer that injects a disk-shaped area fill light controlled by six disentangled factors, producing 160K before-and-after pairs. We first pretrain a physics-aware lighting prompt (PALP) that embeds the 6D parameters into conditioning tokens, using an auxiliary planar-light reconstruction objective. Building on a pretrained diffusion backbone, we then train a fill-light diffusion (FiLitDiff), an efficient one-step model conditioned on physically grounded lighting codes, enabling controllable and high-fidelity fill lighting at low computational cost. Experiments on held-out paired sets demonstrate strong perceptual quality and competitive full-reference metrics, while better preserving background illumination. The dataset and model will be at https://github.com/gobunu/Light-Up-Your-Face.

[117] Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Zipeng Zhu,Zhanghao Hu,Qinglin Zhu,Yuxi Hong,Yijun Liu,Jingyong Su,Yulan He,Lin Gui

Main category: cs.CV

TL;DR: 本文提出了一种动态视觉定位方法LASER,通过层敏感性分析发现不同任务需在不同网络层进行视觉信息重激活,并设计VAQ指标自适应选择最相关层,无需额外训练即提升多类VQA任务性能。

Details Motivation: 现有LVLMs因固定视觉token预算导致图像降质、细节丢失和幻觉;现有注意力增强方法依赖静态‘magic layer’,难以泛化到复杂推理任务。 Method: 提出动态视觉接地视角,通过层敏感性分析发现任务复杂度决定视觉信息重激活的网络深度;设计VAQ(Visual Activation by Query)指标衡量各层注意力对查询的敏感性;基于VAQ构建无需训练的推理框架LASER,自适应选择最优层进行视觉定位与问答增强。 Result: 在多个VQA基准上实验表明,LASER显著提升了不同复杂度任务的准确率,且不依赖额外训练。 Conclusion: 视觉接地是动态过程,应依据任务需求动态选择网络层;LASER提供了一种通用、高效、训练无关的推理增强范式。 Abstract: Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

[118] JOintGS: Joint Optimization of Cameras, Bodies and 3D Gaussians for In-the-Wild Monocular Reconstruction

Zihan Lou,Jinlong Fan,Sihan Ma,Yuxiang Yang,Jing Zhang

Main category: cs.CV

TL;DR: JOintGS是一种联合优化相机外参、人体姿态和3D高斯表示的统一框架,通过前景-背景解耦与协同优化机制,在单目RGB视频中实现高保真、可驱动的3D人体重建,显著提升鲁棒性与重建质量。

Details Motivation: 现有方法在野外场景下因相机参数和人体姿态估计不准而性能受限;3D高斯泼溅(3DGS)依赖精确标定与姿态标注,难以适用于真实复杂场景。 Method: 提出JOintGS框架:1)联合优化相机外参、人体姿态与3D高斯;2)引入前景-背景显式解耦以实现相互增强;3)设计时序动力学模块建模姿态相关形变;4)引入残差颜色场处理光照变化。 Result: 在NeuMan和EMDB数据集上达到SOTA:NeuMan上PSNR提升2.1 dB,支持实时渲染,且对噪声初始化鲁棒性显著增强。 Conclusion: JOintGS通过协同联合优化与结构化建模,有效缓解单目视频中人体重建对精确先验的依赖,为野外场景下的高保真可驱动3D人体建模提供了新范式。 Abstract: Reconstructing high-fidelity animatable 3D human avatars from monocular RGB videos remains challenging, particularly in unconstrained in-the-wild scenarios where camera parameters and human poses from off-the-shelf methods (e.g., COLMAP, HMR2.0) are often inaccurate. Splatting (3DGS) advances demonstrate impressive rendering quality and real-time performance, they critically depend on precise camera calibration and pose annotations, limiting their applicability in real-world settings. We present JOintGS, a unified framework that jointly optimizes camera extrinsics, human poses, and 3D Gaussian representations from coarse initialization through a synergistic refinement mechanism. Our key insight is that explicit foreground-background disentanglement enables mutual reinforcement: static background Gaussians anchor camera estimation via multi-view consistency; refined cameras improve human body alignment through accurate temporal correspondence; optimized human poses enhance scene reconstruction by removing dynamic artifacts from static constraints. We further introduce a temporal dynamics module to capture fine-grained pose-dependent deformations and a residual color field to model illumination variations. Extensive experiments on NeuMan and EMDB datasets demonstrate that JOintGS achieves superior reconstruction quality, with 2.1~dB PSNR improvement over state-of-the-art methods on NeuMan dataset, while maintaining real-time rendering. Notably, our method shows significantly enhanced robustness to noisy initialization compared to the baseline.Our source code is available at https://github.com/MiliLab/JOintGS.

[119] Multiview Self-Representation Learning across Heterogeneous Views

Jie Chen,Zhu Wang,Chuanbin Liu,Xi Peng

Main category: cs.CV

TL;DR: 本文提出了一种多视角自表示学习(MSRL)方法,通过利用异构多视角特征的自表示特性,在无监督迁移学习下学习不变表征,结合信息传递机制和分配概率分布一致性约束,提升跨预训练模型的表征鲁棒性与泛化性。

Details Motivation: 不同预训练模型提取的同一样本特征分布差异大,导致在大规模无标签视觉数据上以完全无监督方式学习不变表征极具挑战。 Method: 提出多视角自表示学习(MSRL):对多个冻结预训练骨干网络分别接独立线性层,构建异构多视角特征;引入基于自表示学习的信息传递机制实现特征聚合;设计分配概率分布一致性方案,利用多视角互补信息引导学习并强制线性模型间表征不变性;并提供相关理论分析。 Result: 在多个基准视觉数据集上的大量实验表明,MSRL持续优于多种当前最优方法。 Conclusion: MSRL能有效融合异构多视角特征,通过自表示与一致性约束提升无监督表征的不变性与判别力,为多源预训练模型协同学习提供了新范式。 Abstract: Features of the same sample generated by different pretrained models often exhibit inherently distinct feature distributions because of discrepancies in the model pretraining objectives or architectures. Learning invariant representations from large-scale unlabeled visual data with various pretrained models in a fully unsupervised transfer manner remains a significant challenge. In this paper, we propose a multiview self-representation learning (MSRL) method in which invariant representations are learned by exploiting the self-representation property of features across heterogeneous views. The features are derived from large-scale unlabeled visual data through transfer learning with various pretrained models and are referred to as heterogeneous multiview data. An individual linear model is stacked on top of its corresponding frozen pretrained backbone. We introduce an information-passing mechanism that relies on self-representation learning to support feature aggregation over the outputs of the linear model. Moreover, an assignment probability distribution consistency scheme is presented to guide multiview self-representation learning by exploiting complementary information across different views. Consequently, representation invariance across different linear models is enforced through this scheme. In addition, we provide a theoretical analysis of the information-passing mechanism, the assignment probability distribution consistency and the incremental views. Extensive experiments with multiple benchmark visual datasets demonstrate that the proposed MSRL method consistently outperforms several state-of-the-art approaches.

[120] Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Qian-Wei Wang,Guanghao Meng,Ren Cai,Yaguang Song,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出了一种名为Collaborative Fine-Tuning (CoFT)的无监督自适应框架,用于在无需大量标注数据的情况下提升大规模视觉语言模型(如CLIP)在下游任务上的性能。CoFT通过双模型、跨模态协作机制,结合正负文本提示建模伪标签质量,并采用两阶段训练策略;其扩展版本CoFT+进一步引入迭代微调、动量对比学习和大语言模型生成提示,显著优于现有无监督及少量监督方法。

Details Motivation: 大型视觉语言模型(如CLIP)虽具备强零样本泛化能力,但下游适配通常依赖昂贵的标注数据;现有无监督自训练方法受限于不可靠的置信度过滤、确认偏差及低置信度样本利用不足。 Method: 提出Collaborative Fine-Tuning(CoFT):1)双提示学习(正/负文本提示)以样本依赖方式建模伪标签清洁度,避免人工设定阈值或噪声假设;2)负提示正则化轻量视觉适配模块;3)两阶段训练:先高效参数微调高置信样本,再全参数微调协同过滤的伪标签;CoFT+进一步加入迭代微调、动量对比学习与LLM生成提示。 Result: 在多个基准上持续超越现有无监督方法,甚至优于少量样本监督基线。 Conclusion: CoFT及其增强版CoFT+为VLM无监督下游适配提供了更鲁棒、高效且无需人工调参的新范式,显著缓解了伪标签噪声与低置信样本利用难题。 Abstract: Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

[121] Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Qian-Wei Wang,Yaguang Song,Shu-Tao Xia

Main category: cs.CV

TL;DR: 本文提出了一种基于双提示调优的鲁棒不确定性建模框架,用于在标注预算有限的情况下主动适应预训练视觉-语言模型(如CLIP)进行图像分类。通过引入正向和负向可学习文本提示,分别提升分类判别力与显式建模预测正确概率,从而提供更可靠的不确定性信号以指导样本选择。实验表明该方法在多种微调范式下均优于现有主动学习方法。

Details Motivation: 现有主动学习方法在适配CLIP等视觉-语言模型时,缺乏从模型视角显式建模不确定性,仅依赖熵或聚类等启发式策略,难以在标注受限下高效选择信息量最大的样本。 Method: 提出双提示调优框架:在CLIP文本分支中引入两个可学习提示——正向提示用于增强任务相关文本嵌入的判别性,负向提示则以逆向方式训练,显式估计预测标签正确的概率,作为不确定性信号。 Result: 在多种微调范式和不同标注预算下,该方法在多个图像分类基准上持续超越现有主动学习方法,验证了其有效性与鲁棒性。 Conclusion: 显式建模预测置信度的双提示机制能更有效地指导主动学习中的样本选择,为CLIP等大模型的低资源适配提供了新思路。 Abstract: Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.

[122] Finding NeMO: A Geometry-Aware Representation of Template Views for Few-Shot Perception

Sebastian Jung,Leonard Klüpfel,Rudolph Triebel,Maximilian Durner

Main category: cs.CV

TL;DR: 本文提出了Neural Memory Object (NeMO),一种新颖的以对象为中心的表示方法,能够仅使用RGB图像检测、分割并估计训练中未见过物体的6自由度姿态。

Details Motivation: 解决在训练中未见过物体的少样本感知问题,提升对新物体的交互能力、可扩展性和效率。 Method: 提出NeMO,包括一个编码器(利用少量RGB模板视图生成含语义与几何信息的稀疏点云)和一个解码器(结合对象编码与查询图像生成多种密集预测)。 Result: 在BOP基准多个数据集和感知任务上取得竞争性及SOTA结果,无需相机特定参数或目标数据重训练。 Conclusion: NeMO通过将对象信息外包至内存对象并用单一网络处理多任务,实现了快速对象接入,提升了少样本物体感知的通用性、可扩展性与效率。 Abstract: We present Neural Memory Object (NeMO), a novel object-centric representation that can be used to detect, segment and estimate the 6DoF pose of objects unseen during training using RGB images. Our method consists of an encoder that requires only a few RGB template views depicting an object to generate a sparse object-like point cloud using a learned UDF containing semantic and geometric information. Next, a decoder takes the object encoding together with a query image to generate a variety of dense predictions. Through extensive experiments, we show that our method can be used for few-shot object perception without requiring any camera-specific parameters or retraining on target data. Our proposed concept of outsourcing object information in a NeMO and using a single network for multiple perception tasks enhances interaction with novel objects, improving scalability and efficiency by enabling quick object onboarding without retraining or extensive pre-processing. We report competitive and state-of-the-art results on various datasets and perception tasks of the BOP benchmark, demonstrating the versatility of our approach. https://github.com/DLR-RM/nemo

[123] VecSet-Edit: Unleashing Pre-trained LRM for Mesh Editing from Single Image

Teng-Fang Hsiao,Bo-Kai Ruan,Yu-Lun Liu,Hong-Han Shuai

Main category: cs.CV

TL;DR: 本文提出VecSet-Edit,首个基于高保真VecSet LRM模型的3D网格编辑方法,通过分析VecSet token的空间特性,引入掩码引导的Token播种、注意力对齐的Token门控及漂移感知的Token剪枝策略,仅用2D图像条件即可精准定位并编辑目标区域,并通过细节保持的纹理烘焙模块保留原始网格的几何与纹理细节。

Details Motivation: 现有3D编辑方法多集中于3D高斯泼溅或多视角图像,直接编辑3D网格的研究较少;已有方法(如VoxHammer)依赖体素表示,存在分辨率低、需繁琐3D掩码等问题。 Method: 基于VecSet LRM构建编辑管线;提出Mask-guided Token Seeding和Attention-aligned Token Gating实现2D条件下的区域定位;设计Drift-aware Token Pruning应对VecSet扩散过程与体素建模差异;引入Detail-preserving Texture Baking模块保留几何与纹理细节。 Result: 实现了高质量、细粒度的3D网格编辑,仅需2D图像条件即可完成精准区域控制,显著优于基于体素的方法,在几何保真度和纹理一致性上表现突出。 Conclusion: VecSet-Edit是首个面向3D网格的、基于VecSet LRM的端到端编辑框架,有效克服了体素表示的固有缺陷,为高保真可控3D内容创作提供了新范式。 Abstract: 3D editing has emerged as a critical research area to provide users with flexible control over 3D assets. While current editing approaches predominantly focus on 3D Gaussian Splatting or multi-view images, the direct editing of 3D meshes remains underexplored. Prior attempts, such as VoxHammer, rely on voxel-based representations that suffer from limited resolution and necessitate labor-intensive 3D mask. To address these limitations, we propose \textbf{VecSet-Edit}, the first pipeline that leverages the high-fidelity VecSet Large Reconstruction Model (LRM) as a backbone for mesh editing. Our approach is grounded on a analysis of the spatial properties in VecSet tokens, revealing that token subsets govern distinct geometric regions. Based on this insight, we introduce Mask-guided Token Seeding and Attention-aligned Token Gating strategies to precisely localize target regions using only 2D image conditions. Also, considering the difference between VecSet diffusion process versus voxel we design a Drift-aware Token Pruning to reject geometric outliers during the denoising process. Finally, our Detail-preserving Texture Baking module ensures that we not only preserve the geometric details of original mesh but also the textural information. More details can be found in our project page: https://github.com/BlueDyee/VecSet-Edit/tree/main

[124] When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models

Jaehyun Kwak,Nam Cao,Boryeong Cho,Segyu Lee,Sumyeong Ahn,Se-Young Yun

Main category: cs.CV

TL;DR: 本文提出了一种名为SAGA的分阶段注意力引导攻击方法,利用LVLMs中区域注意力分数与对抗损失敏感性的正相关性,逐步将扰动集中在高注意力区域,从而在有限扰动预算下生成更难察觉且攻击成功率更高的对抗样本。

Details Motivation: 现有基于随机裁剪的对抗攻击方法具有随机性,且未能高效利用像素级扰动预算;同时,LVLMs的安全漏洞亟需更精准、高效的攻击方法来暴露。 Method: 基于区域注意力分数与对抗损失敏感性的正相关性及注意力再分布现象,提出分阶段注意力引导攻击(SAGA)框架,动态聚焦于高注意力区域进行扰动优化。 Result: SAGA在十个LVLM上持续达到最先进的攻击成功率,同时生成高度不可察觉的对抗样本,并更高效地利用扰动预算。 Conclusion: 注意力机制可被有效用于指导对抗攻击的设计,SAGA验证了结构化、感知驱动的局部扰动策略优于随机或全局扰动,在LVLM安全评估中具有重要价值。 Abstract: Adversarial attacks against Large Vision-Language Models (LVLMs) are crucial for exposing safety vulnerabilities in modern multimodal systems. Recent attacks based on input transformations, such as random cropping, suggest that spatially localized perturbations can be more effective than global image manipulation. However, randomly cropping the entire image is inherently stochastic and fails to use the limited per-pixel perturbation budget efficiently. We make two key observations: (i) regional attention scores are positively correlated with adversarial loss sensitivity, and (ii) attacking high-attention regions induces a structured redistribution of attention toward subsequent salient regions. Based on these findings, we propose Stage-wise Attention-Guided Attack (SAGA), an attention-guided framework that progressively concentrates perturbations on high-attention regions. SAGA enables more efficient use of constrained perturbation budgets, producing highly imperceptible adversarial examples while consistently achieving state-of-the-art attack success rates across ten LVLMs. The source code is available at https://github.com/jackwaky/SAGA.

[125] SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Zekun Li,Ning Wang,Tongxin Bai,Changwang Mei,Peisong Wang,Shuang Qiu,Jian Cheng

Main category: cs.CV

TL;DR: 本文提出SparVAR,一种无需训练的视觉自回归(VAR)模型加速框架,通过利用注意力机制的三个特性(强注意力汇点、跨尺度激活相似性、显著局部性),实现高效稀疏注意力计算,在不牺牲高分辨率图像细节的前提下大幅提升生成速度。

Details Motivation: 现有VAR模型在高分辨率下计算复杂度呈四次方增长,导致推理延迟严重;而现有加速方法常跳过高分辨率尺度,损害图像质量。 Method: SparVAR基于三个关键观察设计:(i)强注意力汇点,(ii)跨尺度激活相似性,(iii)显著局部性;通过动态预测高分辨率尺度的稀疏注意力模式、构建尺度自相似稀疏注意力(使用高效索引映射机制)、引入跨尺度局部稀疏注意力及块级稀疏核,实现训练-free加速。 Result: SparVAR将8B参数VAR模型生成1024×1024图像的时间降至1秒以内;相比FlashAttention加速的VAR基线,提速1.57倍且几乎保留全部高频细节;结合尺度跳过策略可达2.28倍加速,同时保持有竞争力的视觉质量。 Conclusion: SparVAR是一种高效、无损、训练无关的VAR加速方案,兼顾速度与图像保真度,为高分辨率视觉生成提供了实用新路径。 Abstract: Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

[126] Enabling Real-Time Colonoscopic Polyp Segmentation on Commodity CPUs via Ultra-Lightweight Architecture

Weihao Gao,Zhuo Deng,Zheng Gong,Lan Ma

Main category: cs.CV

TL;DR: 本文提出UltraSeg系列轻量级模型,专为资源受限环境(如基层医院、胶囊机器人)设计,在<0.3M参数下实现高精度结直肠息肉实时分割,CPU单核达90 FPS,Dice得分保持U-Net的94%以上。

Details Motivation: 现有高精度息肉分割模型依赖GPU,难以在基层医院、移动内镜设备或胶囊机器人等资源受限场景部署。 Method: 提出UltraSeg-108K与UltraSeg-130K两个超轻量模型;通过联合优化编码器-解码器宽度、引入约束空洞卷积扩大感受野、设计跨层轻量融合模块,实现极致压缩与高效推理。 Result: 在7个公开数据集上验证,UltraSeg以仅0.4%的参数量达到31M参数U-Net 94%以上的Dice分数;CPU单核推理速度达90 FPS;支持单中心与多中心、多模态泛化。 Conclusion: UltraSeg为临床资源受限场景提供了首个兼具高精度、高效率与即插即用特性的CPU原生分割方案,并为微创手术视觉任务提供了可复现的技术范式。 Abstract: Early detection of colorectal cancer hinges on real-time, accurate polyp identification and resection. Yet current high-precision segmentation models rely on GPUs, making them impractical to deploy in primary hospitals, mobile endoscopy units, or capsule robots. To bridge this gap, we present the UltraSeg family, operating in an extreme-compression regime (<0.3 M parameters). UltraSeg-108K (0.108 M parameters) is optimized for single-center data, while UltraSeg-130K (0.13 M parameters) generalizes to multi-center, multi-modal images. By jointly optimizing encoder-decoder widths, incorporating constrained dilated convolutions to enlarge receptive fields, and integrating a cross-layer lightweight fusion module, the models achieve 90 FPS on a single CPU core without sacrificing accuracy. Evaluated on seven public datasets, UltraSeg retains >94% of the Dice score of a 31 M-parameter U-Net while utilizing only 0.4% of its parameters, establishing a strong, clinically viable baseline for the extreme-compression domain and offering an immediately deployable solution for resource-constrained settings. This work provides not only a CPU-native solution for colonoscopy but also a reproducible blueprint for broader minimally invasive surgical vision applications. Source code is publicly available to ensure reproducibility and facilitate future benchmarking.

[127] Interactive Spatial-Frequency Fusion Mamba for Multi-Modal Image Fusion

Yixin Zhu,Long Lv,Pingping Zhang,Xuehu Liu,Tongdan Tang,Feng Tian,Weibing Sun,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了一种交互式空频融合Mamba框架(ISFM),通过模态特异性提取器、多尺度频率融合和交互式空频融合模块,在多模态图像融合任务中实现了优于现有方法的性能。

Details Motivation: 现有MMIF方法在融合空间与频率信息时缺乏有效交互,导致特征表达不够充分。 Method: 提出ISFM框架,包括模态特异性提取器(MSE)、多尺度频率融合(MFF)和交互式空频融合(ISF)模块,实现跨模态空频特征引导与融合。 Result: 在六个MMIF数据集上实验表明,ISFM性能优于当前主流方法。 Conclusion: 交互式空频融合机制能更有效地利用频率信息增强空间特征表达,提升多模态图像融合质量。 Abstract: Multi-Modal Image Fusion (MMIF) aims to combine images from different modalities to produce fused images, retaining texture details and preserving significant information. Recently, some MMIF methods incorporate frequency domain information to enhance spatial features. However, these methods typically rely on simple serial or parallel spatial-frequency fusion without interaction. In this paper, we propose a novel Interactive Spatial-Frequency Fusion Mamba (ISFM) framework for MMIF. Specifically, we begin with a Modality-Specific Extractor (MSE) to extract features from different modalities. It models long-range dependencies across the image with linear computational complexity. To effectively leverage frequency information, we then propose a Multi-scale Frequency Fusion (MFF). It adaptively integrates low-frequency and high-frequency components across multiple scales, enabling robust representations of frequency features. More importantly, we further propose an Interactive Spatial-Frequency Fusion (ISF). It incorporates frequency features to guide spatial features across modalities, enhancing complementary representations. Extensive experiments are conducted on six MMIF datasets. The experimental results demonstrate that our ISFM can achieve better performances than other state-of-the-art methods. The source code is available at https://github.com/Namn23/ISFM.

[128] LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration

Jue Gong,Zihan Zhou,Jingkai Wang,Shu Li,Libo Liu,Jianliang Lan,Yulun Zhang

Main category: cs.CV

TL;DR: 本文提出LCUDiff,一种稳定的一次性框架,通过将预训练的潜在扩散模型从4通道潜在空间升级到16通道潜在空间,以提升人体中心图像恢复的保真度;采用通道分割蒸馏(CSD)进行VAE微调,并设计先验保持适配(PPA)与解码器路由器(DeR)以提升恢复质量。

Details Motivation: 现有方法在人体中心图像恢复(HBR)中保真度不足,尤其是预训练文本到图像扩散模型中的变分自编码器(VAE)严重限制了恢复保真度。 Method: 提出LCUDiff框架:1)将潜在扩散模型从4通道扩展至16通道;2)使用通道分割蒸馏(CSD)微调VAE,保持前4通道对齐预训练先验,其余通道编码高频细节;3)设计先验保持适配(PPA)缓解通道维度不匹配;4)引入基于恢复质量评分的解码器路由器(DeR)实现样本级自适应解码。 Result: 在合成与真实数据集上实验表明,LCUDiff在轻度退化下相比现有方法具有更高保真度、更少伪影,同时保持单步推理效率。 Conclusion: LCUDiff通过扩展潜在空间维度并协同优化VAE、扩散主干与解码器,有效提升了人体中心图像恢复的质量与鲁棒性,兼顾性能与效率。 Abstract: Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-preserving adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a decoder router (DeR) for per-sample decoder routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while preserving one-step efficiency. The code and model will be at https://github.com/gobunu/LCUDiff.

[129] Med-MMFL: A Multimodal Federated Learning Benchmark in Healthcare

Aavash Chhetri,Bibek Niroula,Pratik Shrestha,Yash Raj Shrestha,Lesley A Anderson,Prashnna K Gyawali,Loris Bazzani,Binod Bhattarai

Main category: cs.CV

TL;DR: 本文提出了Med-MMFL,首个面向医疗领域的多模态联邦学习(MMFL)基准,涵盖多种模态、任务和联邦场景,并评估了六种主流FL算法在真实与合成数据分布下的性能,代码与数据处理流程已开源。

Details Motivation: 现有医学联邦学习基准稀缺,且多局限于单模态或双模态及有限任务,缺乏对多模态联邦学习(MMFL)的标准化评估,阻碍系统性研究进展。 Method: 构建Med-MMFL基准,包含10种医学模态(如文本、病理图像、ECG、X光、放射报告、多序列MRI等)、2–4模态组合的数据集;覆盖自然联邦、合成IID与非IID划分;评估分割、分类、模态对齐(检索)和视觉问答(VQA)四类任务;测试六种代表性FL算法(含不同聚合策略、损失函数与正则化技术)。 Result: 提供了全面、可复现的MMFL评估框架,在多种模态组合、任务类型和数据分布设定下验证了当前FL算法的性能表现与局限性。 Conclusion: Med-MMFL填补了医学多模态联邦学习基准的空白,为推动隐私保护下跨机构协同建模的研究与应用提供了标准化平台与开源基础设施。 Abstract: Federated learning (FL) enables collaborative model training across decentralized medical institutions while preserving data privacy. However, medical FL benchmarks remain scarce, with existing efforts focusing mainly on unimodal or bimodal modalities and a limited range of medical tasks. This gap underscores the need for standardized evaluation to advance systematic understanding in medical MultiModal FL (MMFL). To this end, we introduce Med-MMFL, the first comprehensive MMFL benchmark for the medical domain, encompassing diverse modalities, tasks, and federation scenarios. Our benchmark evaluates six representative state-of-the-art FL algorithms, covering different aggregation strategies, loss formulations, and regularization techniques. It spans datasets with 2 to 4 modalities, comprising a total of 10 unique medical modalities, including text, pathology images, ECG, X-ray, radiology reports, and multiple MRI sequences. Experiments are conducted across naturally federated, synthetic IID, and synthetic non-IID settings to simulate real-world heterogeneity. We assess segmentation, classification, modality alignment (retrieval), and VQA tasks. To support reproducibility and fair comparison of future multimodal federated learning (MMFL) methods under realistic medical settings, we release the complete benchmark implementation, including data processing and partitioning pipelines, at https://github.com/bhattarailab/Med-MMFL-Benchmark .

[130] TrajVG: 3D Trajectory-Coupled Visual Geometry Learning

Xingyu Miao,Weiguang Zhao,Tao Lu,Linning Yu,Mulin Yu,Yang Long,Jiangmiao Pang,Junting Dong

Main category: cs.CV

TL;DR: 本文提出TrajVG框架,通过显式预测相机坐标系下的3D轨迹来增强多帧3D重建,结合稀疏轨迹、局部点图和相对位姿,并引入双向一致性与静态锚点驱动的姿态一致性自监督目标,在无真实3D轨迹标签下实现高性能重建。

Details Motivation: 前馈多帧3D重建模型在含物体运动的视频中性能下降,因全局参考模糊、局部点图依赖估计位姿易漂移,导致跨帧错位与结构重复。 Method: 提出TrajVG框架,显式预测相机坐标系下的3D轨迹;耦合稀疏轨迹、每帧局部点图和相对相机位姿;设计两个几何一致性目标:(i) 双向轨迹-点图一致性(控制梯度流),(ii) 由静态轨迹锚点驱动的姿态一致性目标(抑制动态区域梯度);将约束转化为仅需伪2D轨迹的自监督目标,支持混合监督训练。 Result: 在3D跟踪、位姿估计、点图重建和视频深度任务上,TrajVG全面超越当前前馈方法基线。 Conclusion: 显式建模跨帧3D轨迹并结合几何一致性和自监督策略,可有效缓解运动场景下的重建漂移与错位问题,提升前馈多帧重建鲁棒性与精度。 Abstract: Feed-forward multi-frame 3D reconstruction models often degrade on videos with object motion. Global-reference becomes ambiguous under multiple motions, while the local pointmap relies heavily on estimated relative poses and can drift, causing cross-frame misalignment and duplicated structures. We propose TrajVG, a reconstruction framework that makes cross-frame 3D correspondence an explicit prediction by estimating camera-coordinate 3D trajectories. We couple sparse trajectories, per-frame local point maps, and relative camera poses with geometric consistency objectives: (i) bidirectional trajectory-pointmap consistency with controlled gradient flow, and (ii) a pose consistency objective driven by static track anchors that suppresses gradients from dynamic regions. To scale training to in-the-wild videos where 3D trajectory labels are scarce, we reformulate the same coupling constraints into self-supervised objectives using only pseudo 2D tracks, enabling unified training with mixed supervision. Extensive experiments across 3D tracking, pose estimation, pointmap reconstruction, and video depth show that TrajVG surpasses the current feedforward performance baseline.

[131] SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

Weiguang Zhao,Haoran Xu,Xingyu Miao,Qin Zhao,Rui Zhang,Kaizhu Huang,Ning Gao,Peizhou Cao,Mingze Sun,Mulin Yu,Tao Lu,Linning Xu,Junting Dong,Jiangmiao Pang

Main category: cs.CV

TL;DR: 本文提出SynthVerse,一个大规模、多样化的合成数据集,专为点跟踪任务设计,旨在解决现有数据集多样性不足和轨迹标注不完善的问题。

Details Motivation: 现有点跟踪数据集缺乏高质量、多样性的数据,导致模型泛化能力受限。 Method: 构建SynthVerse合成数据集,涵盖动画电影风格、具身操作、场景导航和关节物体等新领域,并建立多样化点跟踪基准用于系统评估。 Result: 实验表明,使用SynthVerse训练显著提升了点跟踪器在多种场景下的泛化能力,并揭示了现有方法在域偏移下的局限性。 Conclusion: SynthVerse有效推动了通用点跟踪的发展,为模型训练与评估提供了更全面、高质量的数据基础。 Abstract: Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations. To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects. SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts. Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.

Tianming Liang,Qirui Du,Jian-Fang Hu,Haichao Jiang,Zicheng Lin,Wei-Shi Zheng

Main category: cs.CV

TL;DR: 本文提出Seg-ReSearch,一种通过结合外部搜索与交错推理来突破多模态大语言模型(MLLM)固有知识限制的新型分割范式,显著提升开放世界动态查询下的视频对象分割性能。

Details Motivation: 现有基于MLLM的分割方法受限于模型冻结的内部知识,难以应对需要最新信息或领域特定概念的真实场景。 Method: 提出Seg-ReSearch范式,支持交错推理与外部搜索;设计分层奖励机制以平衡初始引导与渐进激励,缓解稀疏结果信号与刚性步进监督之间的矛盾。 Result: 在新构建的需外部知识的OK-VOS视频分割基准及两个现有推理分割基准上,Seg-ReSearch显著超越当前最优方法。 Conclusion: Seg-ReSearch有效克服了MLLM知识瓶颈,为开放世界、动态场景下的语义分割提供了可扩展、可检索增强的新范式。 Abstract: Segmentation based on language has been a popular topic in computer vision. While recent advances in multimodal large language models (MLLMs) have endowed segmentation systems with reasoning capabilities, these efforts remain confined by the frozen internal knowledge of MLLMs, which limits their potential for real-world scenarios that involve up-to-date information or domain-specific concepts. In this work, we propose \textbf{Seg-ReSearch}, a novel segmentation paradigm that overcomes the knowledge bottleneck of existing approaches. By enabling interleaved reasoning and external search, Seg-ReSearch empowers segmentation systems to handle dynamic, open-world queries that extend beyond the frozen knowledge of MLLMs. To effectively train this capability, we introduce a hierarchical reward design that harmonizes initial guidance with progressive incentives, mitigating the dilemma between sparse outcome signals and rigid step-wise supervision. For evaluation, we construct OK-VOS, a challenging benchmark that explicitly requires outside knowledge for video object segmentation. Experiments on OK-VOS and two existing reasoning segmentation benchmarks demonstrate that our Seg-ReSearch improves state-of-the-art approaches by a substantial margin. Code and data will be released at https://github.com/iSEE-Laboratory/Seg-ReSearch.

[133] Temporal Slowness in Central Vision Drives Semantic Object Learning

Timothy Schaumlöffel,Arthur Aubret,Gemma Roig,Jochen Triesch

Main category: cs.CV

TL;DR: 本研究探讨了中央视觉和时间慢变性在人类从自我中心视觉流中学习语义物体表征中的作用,通过模拟五个月的人类视觉经验并结合注视点预测与自监督学习模型,发现二者结合能提升物体表征的语义编码能力。

Details Motivation: 人类能从低监督的自我中心视觉流中习得语义物体表征,且视觉系统具有高分辨率中央视野和对时间邻近输入产生相似表征的特性,强调注视点附近缓慢变化的信息;本文旨在探究中央视觉与慢变性学习在此过程中的作用。 Method: 基于Ego4D数据集模拟五个月人类样视觉经验,使用先进注视预测模型生成注视坐标,提取中央视野图像块,并在其上训练时间对比式自监督学习模型。 Result: 结合时间慢变性与中央视觉可提升物体表征多个语义维度的编码能力:中央视觉强化前景物体特征提取,而慢变性(尤其在微小眼动期间)有助于编码更广泛的物体语义信息。 Conclusion: 中央视觉与时间慢变性协同作用,可能是人类从自然视觉经验中发展语义物体表征的关键机制。 Abstract: Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.

[134] SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-Sharpening

Junjie Li,Congyang Ou,Haokui Zhang,Guoting Wei,Shengqin Jiang,Ying Li,Chunhua Shen

Main category: cs.CV

TL;DR: SALAD-Pan 是一种传感器无关的潜在空间扩散模型,通过带状单通道 VAE 编码、双向物理引导控制结构和轻量级跨光谱注意力模块,实现高效、高精度、跨传感器泛化的遥感图像全色锐化。

Details Motivation: 现有扩散模型多在像素空间操作,且需为不同传感器数据单独训练模型,导致推理延迟高、泛化性差。 Method: 提出 SALAD-Pan:1)带状单通道 VAE 将 HRMS 映射至紧凑潜在空间,支持任意通道数;2)融合光谱物理先验与 PAN/MS 图像,通过单向/双向交互式控制结构注入扩散主干;3)在扩散模型中心层引入轻量级跨光谱注意力模块以增强光谱一致性。 Result: 在 GaoFen-2、QuickBird 和 WorldView-3 数据集上全面超越现有扩散方法,推理速度提升 2–3 倍,并具备强零样本(跨传感器)能力。 Conclusion: 在潜在空间建模与物理引导控制机制下,SALAD-Pan 实现了高效、精准、通用的 pansharpening,为扩散模型在遥感图像融合中的实用化提供了新范式。 Abstract: Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.

[135] Vision-aligned Latent Reasoning for Multi-modal Large Language Model

Byungwoo Jeon,Yoonwoo Jeong,Hyunseok Lee,Minsu Cho,Jinwoo Shin

Main category: cs.CV

TL;DR: 本文提出Vision-aligned Latent Reasoning (VaLR)框架,通过在每步思维链推理前动态生成与视觉对齐的潜在token,缓解多模态大模型在长上下文推理中视觉信息衰减问题,显著提升长上下文理解与精细视觉感知能力。

Details Motivation: 现有多模态大语言模型(MLLMs)在需多步推理的任务上表现不佳,主因是长上下文生成过程中视觉信息逐步稀释,限制了测试时缩放能力。 Method: VaLR框架在每步Chain of Thought推理前动态生成视觉对齐的潜在token,训练时将MLLM中间嵌入与视觉编码器嵌入对其,以在推理中保持视觉知识。 Result: VaLR在多个需长上下文理解或精确视觉感知的基准上持续超越现有方法,并展现出此前MLLM未见的测试时缩放行为;在VSI-Bench上性能从33.0%提升至52.9%,相对Qwen2.5-VL提升19.9个百分点。 Conclusion: VaLR是一种简单而有效的方法,能显著增强MLLM的多步视觉推理能力,尤其适用于长上下文与高精度视觉感知任务。 Abstract: Despite recent advancements in Multi-modal Large Language Models (MLLMs) on diverse understanding tasks, these models struggle to solve problems which require extensive multi-step reasoning. This is primarily due to the progressive dilution of visual information during long-context generation, which hinders their ability to fully exploit test-time scaling. To address this issue, we introduce Vision-aligned Latent Reasoning (VaLR), a simple, yet effective reasoning framework that dynamically generates vision-aligned latent tokens before each Chain of Thought reasoning step, guiding the model to reason based on perceptual cues in the latent space. Specifically, VaLR is trained to preserve visual knowledge during reasoning by aligning intermediate embeddings of MLLM with those from vision encoders. Empirical results demonstrate that VaLR consistently outperforms existing approaches across a wide range of benchmarks requiring long-context understanding or precise visual perception, while exhibiting test-time scaling behavior not observed in prior MLLMs. In particular, VaLR improves the performance significantly from 33.0% to 52.9% on VSI-Bench, achieving a 19.9%p gain over Qwen2.5-VL.

[136] S-MUSt3R: Sliding Multi-view 3D Reconstruction

Leonid Antsfeld,Boris Chidlovskii,Yohann Cabon,Vincent Leroy,Jerome Revaud

Main category: cs.CV

TL;DR: 本文提出S-MUSt3R,一种无需重训练、基于序列分段与对齐及轻量闭环优化的高效单目3D重建方法,扩展了MUSt3R基础模型在长视频流上的可扩展性,在多个数据集上实现了媲美传统方法的轨迹与重建精度,并直接输出度量空间结果。

Details Motivation: 现有3D基础模型难以处理长时RGB视频流的3D重建,主要受限于内存瓶颈;需在不重训练前提下提升其可扩展性。 Method: 提出S-MUSt3R流水线:将长RGB序列分段处理,对各段分别用MUSt3R重建,再通过段间对齐与轻量级闭环优化实现全局一致性。 Result: 在TUM、7-Scenes及自研机器人导航数据集上验证有效,支持长序列运行,重建精度与轨迹估计媲美复杂传统方法,且输出为度量空间。 Conclusion: S-MUSt3R证明了仅通过工程化策略(分段+对齐+闭环)即可显著提升基础模型的实际可扩展性,为真实场景中轻量、准确、度量一致的单目3D重建提供了新路径。 Abstract: The recent paradigm shift in 3D vision led to the rise of foundation models with remarkable capabilities in 3D perception from uncalibrated images. However, extending these models to large-scale RGB stream 3D reconstruction remains challenging due to memory limitations. This work proposes S-MUSt3R, a simple and efficient pipeline that extends the limits of foundation models for monocular 3D reconstruction. Our approach addresses the scalability bottleneck of foundation models through a simple strategy of sequence segmentation followed by segment alignment and lightweight loop closure optimization. Without model retraining, we benefit from remarkable 3D reconstruction capacities of MUSt3R model and achieve trajectory and reconstruction performance comparable to traditional methods with more complex architecture. We evaluate S-MUSt3R on TUM, 7-Scenes and proprietary robot navigation datasets and show that S-MUSt3R runs successfully on long RGB sequences and produces accurate and consistent 3D reconstruction. Our results highlight the potential of leveraging the MUSt3R model for scalable monocular 3D scene in real-world settings, with an important advantage of making predictions directly in the metric space.

[137] SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

Muhammad Taha Mukhtar,Syed Musa Ali Kazmi,Khola Naseem,Muhammad Ali Chattha,Andreas Dengel,Sheraz Ahmed,Muhammad Naseer Bajwa,Muhammad Imran Malik

Main category: cs.CV

TL;DR: 本文提出了一种新的半监督分割框架,用于大规模、高质量地绘制低收入和中等收入国家城市中的非正规住区,解决了标注稀缺、光谱混淆和标注噪声等问题,并在跨城市、跨大陆的多个数据集上验证了其鲁棒性和泛化能力。

Details Motivation: 快速城市扩张导致低收入和中等收入国家大城市中非正规住区激增,但大规模制图受限于标注稀缺、正式与非正式建筑间光谱混淆严重以及标注噪声大等数据质量挑战。 Method: 构建了拉合尔基准数据集及卡拉奇、孟买的配套数据集;提出一种新型半监督分割框架,包含类感知自适应阈值机制(动态调整置信度阈值以避免少数类抑制)和原型库系统(通过锚定历史高保真特征表征来保证语义一致性)。 Result: 在涵盖三大洲八座城市的多个基准上,该方法显著优于现有半监督方法;仅用10%源域标签训练的模型在未见地理区域上达到0.461 mIoU,且优于全监督模型的零样本泛化性能。 Conclusion: 所提框架有效缓解了半监督学习中的类别不平衡与特征退化问题,在数据质量差、标注稀缺的遥感场景下展现出强泛化性与跨域迁移能力,为非正规住区大规模制图提供了可靠技术路径。 Abstract: Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling 1,869 $\text{km}^2$ of area. To evaluate the global robustness of our framework, we extend our experiments to five additional established benchmarks, encompassing eight cities across three continents, and provide comprehensive data quality assessments of all datasets. We also propose a new semi-supervised segmentation framework designed to mitigate the class imbalance and feature degradation inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression and a Prototype Bank System that enforces semantic consistency by anchoring predictions to historically learned high-fidelity feature representations. Extensive experiments across a total of eight cities spanning three continents demonstrate that our approach outperforms state-of-the-art semi-supervised baselines. Most notably, our method demonstrates superior domain transfer capability whereby a model trained on only 10% of source labels reaches a 0.461 mIoU on unseen geographies and outperforms the zero-shot generalization of fully supervised models.

[138] OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis

Luca Zedda,Andrea Loddo,Cecilia Di Ruberto

Main category: cs.CV

TL;DR: 本文提出了OmniRad,一个基于120万张医学影像自监督预训练的放射学基础模型,强调表征复用和跨任务迁移能力,并在多个公开基准上展现出优于现有模型的分类与分割性能。

Details Motivation: 放射学分析日益依赖预训练视觉表征以支持多模态下游任务,但现有模型在表征复用性与跨任务泛化能力方面仍有不足。 Method: 提出OmniRad模型,采用自监督学习在1.2百万医学图像上预训练;评估策略包括冻结主干+轻量适配器、以及端到端微调;在MedMNISTv2和MedSegBench等多个基准上进行分类与分割任务测试。 Result: 在MedMNISTv2上分类F1提升最高达2.05%;在六个MedSegBench数据集上使用冻结表征实现平均Dice分数提升;潜空间可视化显示特征聚类更优、模态分离更清晰。 Conclusion: OmniRad通过放射学启发的设计原则,在表征质量与下游任务性能之间取得良好平衡,验证了其作为通用放射学基础模型的潜力。 Abstract: Radiological analysis increasingly benefits from pretrained visual representations that can support heterogeneous downstream tasks across imaging modalities. In this work, we introduce OmniRad, a self-supervised radiological foundation model pretrained on 1.2 million medical images, designed with radiology-inspired principles emphasizing representation reuse and cross-task transferability. We evaluate the pretrained encoder under multiple downstream adaptation regimes, including lightweight task-specific adapters with a frozen backbone as well as full end-to-end fine-tuning for classification, allowing us to assess both representation quality and task-specific performance. OmniRad is evaluated on a broad suite of public benchmarks spanning classification and segmentation across multiple modalities. On the MedMNISTv2 collection, OmniRad improves classification F1 by up to 2.05% over competing foundation models. For dense prediction, OmniRad attains mean Dice score improvements across six MedSegBench datasets when using frozen representations. Qualitative analyses and latent-space visualizations suggest improved feature clustering and modality-related separation.

[139] Nix and Fix: Targeting 1000x Compression of 3D Gaussian Splatting with Diffusion Models

Cem Eteke,Enzo Tartaglione

Main category: cs.CV

TL;DR: 本文提出NiFi方法,通过基于扩散模型的一步蒸馏技术,在极低码率(低至0.1 MB)下实现3D高斯泼溅(3DGS)的高效压缩与视觉质量恢复,达到SOTA感知质量并实现约1000倍码率提升。

Details Motivation: 3D高斯泼溅(3DGS)虽实现实时新视角合成,但存储开销大,制约其在沉浸式通信等场景的应用;现有压缩方法在低码率下易引入明显视觉伪影。 Method: 提出NiFi:一种面向伪影感知、基于扩散模型的一步蒸馏方法,用于极端3DGS压缩后的质量恢复。 Result: 在极低码率(如0.1 MB)下取得SOTA感知质量,相比原始3DGS实现约1000倍码率压缩,同时保持可比的视觉保真度。 Conclusion: NiFi有效解决了低码率下3DGS压缩导致的严重视觉退化问题,为资源受限场景下的实时3D渲染提供了实用可行的高压缩比方案。 Abstract: 3D Gaussian Splatting (3DGS) revolutionized novel view rendering. Instead of inferring from dense spatial points, as implicit representations do, 3DGS uses sparse Gaussians. This enables real-time performance but increases space requirements, hindering applications such as immersive communication. 3DGS compression emerged as a field aimed at alleviating this issue. While impressive progress has been made, at low rates, compression introduces artifacts that degrade visual quality significantly. We introduce NiFi, a method for extreme 3DGS compression through restoration via artifact-aware, diffusion-based one-step distillation. We show that our method achieves state-of-the-art perceptual quality at extremely low rates, down to 0.1 MB, and towards 1000x rate improvement over 3DGS at comparable perceptual performance. The code will be open-sourced upon acceptance.

[140] Understanding Degradation with Vision Language Model

Guanzhou Lan,Chenyi Liao,Yuqi Yang,Qianli Ma,Zhigang Wang,Dong Wang,Bin Zhao,Xuelong Li

Main category: cs.CV

TL;DR: 本文提出DU-VLM模型,将图像退化理解重新定义为分层结构化预测任务,统一建模退化类型、参数键及连续物理值,并结合监督微调与强化学习训练;同时构建大规模物理标注数据集DU-110k,验证其在零样本控制扩散模型进行高保真图像恢复方面的有效性与泛化能力。

Details Motivation: 现有视觉语言模型虽能定性描述图像退化,但难以理解其背后的参数化物理机制,亟需一种能联合估计退化类型、参数键及其连续物理值的统一建模方法。 Method: 将退化理解建模为分层结构化预测任务,统一于自回归next-token预测范式;提出DU-VLM多模态链式思维模型,采用监督微调与基于结构化奖励的强化学习联合训练;构建DU-110k物理标注数据集。 Result: DU-VLM在退化理解任务上显著优于通用基线,在准确率和鲁棒性上表现突出,并具备对未见分布的泛化能力;可作为零样本控制器驱动预训练扩散模型实现高质量图像恢复。 Conclusion: 退化理解可通过结构化token预测统一建模,DU-VLM不仅提升了物理层面的理解能力,还拓展了其在可控图像生成与恢复中的实际应用价值。 Abstract: Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.

[141] PEPR: Privileged Event-based Predictive Regularization for Domain Generalization

Gabriele Magrini,Federico Becattini,Niccolò Biondi,Pietro Pala

Main category: cs.CV

TL;DR: 本文提出了一种基于学习使用特权信息(LUPI)范式的跨模态框架,利用事件相机作为训练阶段的特权信息,通过Privileged Event-based Predictive Regularization(PEPR)方法提升单模态RGB模型在域泛化任务中的鲁棒性,避免了直接特征对齐导致的语义损失,在目标检测与语义分割任务中显著优于对齐基线。

Details Motivation: 深度神经网络在视觉感知任务中易受域偏移影响,难以在训练数据分布之外的实际场景中稳健部署;现有基于跨模态对齐的方法会迫使RGB编码器模仿稀疏的事件表征,损失语义细节。 Method: 提出Privileged Event-based Predictive Regularization(PEPR),将LUPI重构为共享隐空间中的预测问题:RGB编码器被训练来预测事件模态的隐特征,而非进行直接特征对齐,从而在不牺牲语义丰富性的前提下蒸馏出域不变鲁棒性。 Result: 所获独立RGB模型在昼夜变化等多种域偏移场景下鲁棒性显著提升,在目标检测和语义分割任务上持续优于基于对齐的基线方法。 Conclusion: PEPR是一种更有效的LUPI实现方式,通过预测式正则化而非强制对齐,成功融合事件相机的域不变先验,提升了单模态RGB模型的域泛化能力。 Abstract: Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.

[142] SalFormer360: a transformer-based saliency estimation model for 360-degree videos

Mahmoud Z. A. Wahba,Francesco Barbato,Sara Baldoni,Federica Battisti

Main category: cs.CV

TL;DR: 本文提出了SalFormer360,一种基于Transformer架构的360度视频显著性估计模型,结合SegFormer编码器与自定义解码器,并引入注视中心偏差建模,显著提升了在多个基准数据集上的性能。

Details Motivation: 360度视频中显著性估计对视口预测和沉浸式内容优化至关重要,但现有方法性能仍有提升空间。 Method: 提出SalFormer360模型,以SegFormer(经微调适配360度内容)为编码器,搭配自定义解码器,并引入Viewing Center Bias建模用户在360度环境中的注意力偏好。 Result: 在Sport360、PVS-HM和VR-EyeTracking三个主流数据集上,Pearson相关系数分别比先前SOTA提升8.4%、2.5%和18.6%。 Conclusion: SalFormer360有效提升了360度视频显著性预测精度,验证了Transformer架构与中心偏差建模结合的有效性。 Abstract: Saliency estimation has received growing attention in recent years due to its importance in a wide range of applications. In the context of 360-degree video, it has been particularly valuable for tasks such as viewport prediction and immersive content optimization. In this paper, we propose SalFormer360, a novel saliency estimation model for 360-degree videos built on a transformer-based architecture. Our approach is based on the combination of an existing encoder architecture, SegFormer, and a custom decoder. The SegFormer model was originally developed for 2D segmentation tasks, and it has been fine-tuned to adapt it to 360-degree content. To further enhance prediction accuracy in our model, we incorporated Viewing Center Bias to reflect user attention in 360-degree environments. Extensive experiments on the three largest benchmark datasets for saliency estimation demonstrate that SalFormer360 outperforms existing state-of-the-art methods. In terms of Pearson Correlation Coefficient, our model achieves 8.4% higher performance on Sport360, 2.5% on PVS-HM, and 18.6% on VR-EyeTracking compared to previous state-of-the-art.

[143] ImmuVis: Hyperconvolutional Foundation Model for Imaging Mass Cytometry

Marcin Możejko,Dawid Uchal,Krzysztof Gogolewski,Piotr Kupidura,Szymon Łukasik,Jakub Giezgała,Tomasz Nocoń,Kacper Pietrzyk,Robert Pieniuta,Mateusz Sulimowicz,Michal Orzyłowski,Tomasz Siłkowski,Karol Zagródka,Eike Staub,Ewa Szczurek

Main category: cs.CV

TL;DR: ImmuVis is a marker-adaptive convolutional foundation model for imaging mass cytometry (IMC), enabling flexible handling of variable marker sets via hyperconvolutions, pretrained on IMC17M, and achieving superior performance with lower compute cost and calibrated uncertainty.

Details Motivation: Standard vision models assume fixed channel spaces, but IMC has variable marker sets across studies, requiring a model adaptable to arbitrary subsets without retraining. Method: ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings; it is pretrained self-supervised on the large-scale IMC17M dataset using masked reconstruction. Result: ImmuVis outperforms SOTA baselines in virtual staining and classification tasks at lower compute cost than transformer-based models, and uniquely provides calibrated uncertainty via heteroscedastic likelihood. Conclusion: ImmuVis serves as a practical, efficient, and scalable foundation model for real-world IMC analysis. Abstract: We present ImmuVis, an efficient convolutional foundation model for imaging mass cytometry (IMC), a high-throughput multiplex imaging technology that handles molecular marker measurements as image channels and enables large-scale spatial tissue profiling. Unlike natural images, multiplex imaging lacks a fixed channel space, as real-world marker sets vary across studies, violating a core assumption of standard vision backbones. To address this, ImmuVis introduces marker-adaptive hyperconvolutions that generate convolutional kernels from learned marker embeddings, enabling a single model to operate on arbitrary measured marker subsets without retraining. We pretrain ImmuVis on the largest to-date dataset, IMC17M (28 cohorts, 24,405 images, 265 markers, over 17M patches), using self-supervised masked reconstruction. ImmuVis outperforms SOTA baselines and ablations in virtual staining and downstream classification tasks at substantially lower compute cost than transformer-based alternatives, and is the sole model that provides calibrated uncertainty via a heteroscedastic likelihood objective. These results position ImmuVis as a practical, efficient foundation model for real-world IMC modeling.

[144] A labeled dataset of simulated phlebotomy procedures for medical AI: polygon annotations for object detection and human-object interaction

Raúl Jiménez Cruz,César Torres-Huitzil,Marco Franceschetti,Ronny Seiger,Luciano García-Bañuelos,Barbara Weber

Main category: cs.CV

TL;DR: 本文介绍了一个包含11,884张标注图像的数据集,用于模拟静脉采血(phlebotomy)操作,图像经过SSIM去重和自动面部匿名化处理,并对5类医学相关对象进行了多边形分割标注,适用于YOLOv8等现代目标检测框架。

Details Motivation: 为推进医疗培训自动化和人-物交互研究,提供高质量、标准化、可公开获取的静脉采血过程图像数据集,支持工具检测、步骤识别、流程分析与教育反馈系统开发。 Method: 从高清视频中提取图像,采用结构相似性指数(SSIM)过滤冗余帧,对所有视频进行自动面部匿名化,人工标注五类医学对象的多边形分割标签,并导出为兼容YOLOv8等框架的格式;数据按70%/15%/15%划分为训练/验证/测试集。 Result: 构建并公开发布了一个大规模、高质量、带精细分割标注的静脉采血图像数据集(11,884张图),支持多种下游任务,且已托管于Zenodo平台。 Conclusion: 该数据集填补了医学操作视觉理解领域高质量基准数据的空白,有望促进智能医疗培训系统的发展与评估。 Abstract: This data article presents a dataset of 11,884 labeled images documenting a simulated blood extraction (phlebotomy) procedure performed on a training arm. Images were extracted from high-definition videos recorded under controlled conditions and curated to reduce redundancy using Structural Similarity Index Measure (SSIM) filtering. An automated face-anonymization step was applied to all videos prior to frame selection. Each image contains polygon annotations for five medically relevant classes: syringe, rubber band, disinfectant wipe, gloves, and training arm. The annotations were exported in a segmentation format compatible with modern object detection frameworks (e.g., YOLOv8), ensuring broad usability. This dataset is partitioned into training (70%), validation (15%), and test (15%) subsets and is designed to advance research in medical training automation and human-object interaction. It enables multiple applications, including phlebotomy tool detection, procedural step recognition, workflow analysis, conformance checking, and the development of educational systems that provide structured feedback to medical trainees. The data and accompanying label files are publicly available on Zenodo.

[145] PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective

Haokui Zhang,Congyang Ou,Dawei Yan,Peng Wang,Qingsen Yan,Ying Li,Rong Xiao,Chunhua Shen

Main category: cs.CV

TL;DR: 本文提出PIO-FVLM方法,从推理目标出发进行视觉token压缩,通过梯度显著性重排序和NMS选择关键token,在保持高性能的同时大幅提升推理速度与效率。

Details Motivation: 现有视觉语言模型(VLM)的视觉token压缩方法多依赖启发式相似性度量,存在压缩性能与部署实用性受限的问题。 Method: 提出以输出结果不变性为目标的视觉token压缩框架PIO-FVLM:设计层局部代理损失生成token级梯度显著性,指导视觉token重排序,并基于非极大值抑制(NMS)原则选择最重要token;无需训练,兼容FlashAttention,支持编码器无关或编码器联合压缩两种部署模式。 Result: 在LLaVA-Next-7B上仅保留11.1%视觉token,仍维持97.2%原始性能,实现2.67×预填充加速、2.11×推理加速、6.22×FLOPs降低和6.05×KV Cache开销减少。 Conclusion: PIO-FVLM是一种高效、免训练、易部署的视觉token压缩方法,兼顾压缩率与性能稳定性,显著提升VLM推理效率。 Abstract: Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.

[146] AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

Jin-Chuan Shi,Binhong Ye,Tao Liu,Junzhe He,Yangjinhui Xu,Xiaoyang Liu,Zeju Li,Hao Chen,Chunhua Shen

Main category: cs.CV

TL;DR: 本文提出AGILE框架,通过视觉语言模型引导生成完整、水密的物体网格,并结合锚点跟踪与接触感知优化,实现从单目视频中鲁棒重建动态手-物交互,克服了神经渲染碎片化和SfM初始化脆弱性问题。

Details Motivation: 现有方法依赖神经渲染导致几何不完整、依赖SfM初始化在野外视频中易失败,难以生成仿真就绪的手-物交互模型。 Method: 提出AGILE框架:1)VLM引导生成水密物体网格;2)基于基础模型的单帧锚点初始化+跨帧跟踪;3)融合语义、几何与接触稳定性的物理感知优化。 Result: 在HO3D、DexYCB及野外视频上显著优于基线,几何精度更高,对强遮挡等挑战场景鲁棒性强,并验证了其仿真就绪性与机器人真实到仿真重定向能力。 Conclusion: AGILE通过从重建转向智能生成范式,实现了高保真、物理合理、仿真可用的手-物交互建模,为机器人和VR提供可靠数字孪生支持。 Abstract: Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.

[147] DRMOT: A Dataset and Framework for RGBD Referring Multi-Object Tracking

Sijia Chen,Lijuan Ma,Yanqiu Yu,En Yu,Liman Liu,Wenbing Tao

Main category: cs.CV

TL;DR: 本文提出了RGBD指代多目标跟踪(DRMOT)新任务,构建了包含RGB图像、深度图和语言描述的DRSet数据集,并设计了MLLM引导的DRTrack框架,通过融合RGB-D-L模态实现深度感知的目标定位与鲁棒轨迹关联。

Details Motivation: 现有指代多目标跟踪(RMOT)方法仅依赖2D RGB数据,难以处理涉及复杂空间语义(如“离相机最近的人”)的指代,且在严重遮挡下身份保持能力差,缺乏显式3D空间信息是主要瓶颈。 Method: 提出DRMOT任务,构建DRSet数据集(含187个场景的RGB与深度图、240条语言描述,其中56条含深度信息),并设计DRTrack框架:基于多模态大语言模型(MLLM)引导,联合处理RGB-D-L输入以实现深度感知的目标定位,并利用深度线索增强轨迹关联鲁棒性。 Result: 在自建DRSet数据集上的大量实验验证了DRTrack框架在空间语义对齐和跟踪性能上的有效性,显著提升了对深度相关指代的理解与跟踪稳定性。 Conclusion: 引入深度模态并构建RGB-D-L联合建模框架,是提升指代多目标跟踪中空间语义理解与遮挡鲁棒性的有效途径;DRMOT为交互式AI系统提供了更符合真实3D场景需求的跟踪范式。 Abstract: Referring Multi-Object Tracking (RMOT) aims to track specific targets based on language descriptions and is vital for interactive AI systems such as robotics and autonomous driving. However, existing RMOT models rely solely on 2D RGB data, making it challenging to accurately detect and associate targets characterized by complex spatial semantics (e.g., ``the person closest to the camera'') and to maintain reliable identities under severe occlusion, due to the absence of explicit 3D spatial information. In this work, we propose a novel task, RGBD Referring Multi-Object Tracking (DRMOT), which explicitly requires models to fuse RGB, Depth (D), and Language (L) modalities to achieve 3D-aware tracking. To advance research on the DRMOT task, we construct a tailored RGBD referring multi-object tracking dataset, named DRSet, designed to evaluate models' spatial-semantic grounding and tracking capabilities. Specifically, DRSet contains RGB images and depth maps from 187 scenes, along with 240 language descriptions, among which 56 descriptions incorporate depth-related information. Furthermore, we propose DRTrack, a MLLM-guided depth-referring tracking framework. DRTrack performs depth-aware target grounding from joint RGB-D-L inputs and enforces robust trajectory association by incorporating depth cues. Extensive experiments on the DRSet dataset demonstrate the effectiveness of our framework.

[148] Annotation Free Spacecraft Detection and Segmentation using Vision Language Models

Samet Hicsonmez,Jose Sosa,Dan Pineau,Inder Pal Singh,Arunkumar Rathinam,Abd El Rahman Shabayek,Djamila Aouada

Main category: cs.CV

TL;DR: 本文提出了一种无需人工标注的航天目标检测与分割方法,利用预训练视觉语言模型(VLM)生成伪标签,并通过师生蒸馏训练轻量模型,在多个空间数据集上显著提升分割性能。

Details Motivation: 空间领域中人工标注困难(如低可见性、光照变化、目标与背景融合),亟需无需大量标注的检测与分割方法。 Method: 利用预训练VLM为少量真实无标签数据自动生成伪标签,再通过师生标签蒸馏框架训练轻量模型。 Result: 在SPARK-2024、SPEED+和TANGO数据集的分割任务中,平均精度(AP)最高提升10个点,优于直接零样本VLM推理。 Conclusion: 基于VLM的伪标签生成与蒸馏策略可有效缓解空间图像标注稀缺问题,实现高性能、低依赖的航天目标分割。 Abstract: Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher-student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at https://github.com/giddyyupp/annotation-free-spacecraft-segmentation.

[149] SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation

David F. Ramirez,Tim Overman,Kristen Jaskie,Joe Marvin,Andreas Spanias

Main category: cs.CV

TL;DR: 本文提出了一种面向合成孔径雷达(SAR)自动目标识别(ATR)的视觉上下文图像检索增强生成(ImageRAG)AI智能体,即SAR-RAG,它将多模态大语言模型(MLLM)与语义嵌入向量数据库结合,通过检索相似已知样本提升车辆类型分类与尺寸回归精度。

Details Motivation: SAR图像中军事车辆外观相似、难以区分,传统ATR方法在细粒度识别和定量测量(如尺寸回归)方面存在挑战;需利用历史标注样本增强模型推理能力。 Method: 提出SAR-RAG框架:以MLLM为基座,接入包含已知目标语义嵌入的向量数据库,实现基于视觉-语义对齐的检索增强生成;在推理时动态检索相似图像 exemplars 并注入上下文,辅助目标分类与尺寸回归。 Result: 在搜索/检索指标、类别分类准确率及车辆尺寸数值回归三项任务上,SAR-RAG均显著优于纯MLLM基线;验证了其作为‘ATR记忆银行’的有效性。 Conclusion: 检索增强生成范式可有效提升SAR ATR系统的可解释性、泛化性与定量精度,为多模态遥感智能识别提供了新思路。 Abstract: We present a visual-context image retrieval-augmented generation (ImageRAG) assisted AI agent for automatic target recognition (ATR) of synthetic aperture radar (SAR). SAR is a remote sensing method used in defense and security applications to detect and monitor the positions of military vehicles, which may appear indistinguishable in images. Researchers have extensively studied SAR ATR to improve the differentiation and identification of vehicle types, characteristics, and measurements. Test examples can be compared with known vehicle target types to improve recognition tasks. New methods enhance the capabilities of neural networks, transformer attention, and multimodal large language models. An agentic AI method may be developed to utilize a defined set of tools, such as searching through a library of similar examples. Our proposed method, SAR Retrieval-Augmented Generation (SAR-RAG), combines a multimodal large language model (MLLM) with a vector database of semantic embeddings to support contextual search for image exemplars with known qualities. By recovering past image examples with known true target types, our SAR-RAG system can compare similar vehicle categories, achieving improved ATR prediction accuracy. We evaluate this through search and retrieval metrics, categorical classification accuracy, and numeric regression of vehicle dimensions. These metrics all show improvements when SAR-RAG is added to an MLLM baseline method as an attached ATR memory bank.

[150] How to rewrite the stars: Mapping your orchard over time through constellations of fruits

Gonçalo P. Matos,Carlos Santiago,João P. Costeira,Ricardo L. Saldanha,Ernesto M. Morgado

Main category: cs.CV

TL;DR: 本文提出了一种基于三维质心星座匹配的新范式,用于跨时间视频中果实的匹配与生长跟踪,解决了传统方法对固定相机位置和显著特征依赖的问题,并支持果园建图与机器人自主导航。

Details Motivation: 传统人工测量果实生长费时费力、不可扩展;现有计算机视觉方法难以在不同时间采集的视频间准确匹配同一果实,尤其在相机位姿变化、遮挡严重、纹理特征少的情况下。 Method: 提出基于稀疏3D点云(果实质心)的星座匹配新范式,设计专用描述子,通过匹配果实群组(而非单个果实)提升鲁棒性,支持跨视频、跨时间的果实对应,并可构建果园三维地图及估计相机6DoF位姿。 Result: 该方法成功实现了跨视频果实匹配与长期生长跟踪,同时可构建果园地图并实现相机6DoF定位,为果园机器人自主导航与选择性采摘提供基础。 Conclusion: 基于星座的3D匹配方法有效克服了非刚性形变、遮挡和低纹理等挑战,是一种可扩展、鲁棒的果实生长跟踪与果园空间感知新范式。 Abstract: Following crop growth through the vegetative cycle allows farmers to predict fruit setting and yield in early stages, but it is a laborious and non-scalable task if performed by a human who has to manually measure fruit sizes with a caliper or dendrometers. In recent years, computer vision has been used to automate several tasks in precision agriculture, such as detecting and counting fruits, and estimating their size. However, the fundamental problem of matching the exact same fruits from one video, collected on a given date, to the fruits visible in another video, collected on a later date, which is needed to track fruits' growth through time, remains to be solved. Few attempts were made, but they either assume that the camera always starts from the same known position and that there are sufficiently distinct features to match, or they used other sources of data like GPS. Here we propose a new paradigm to tackle this problem, based on constellations of 3D centroids, and introduce a descriptor for very sparse 3D point clouds that can be used to match fruits across videos. Matching constellations instead of individual fruits is key to deal with non-rigidity, occlusions and challenging imagery with few distinct visual features to track. The results show that the proposed method can be successfully used to match fruits across videos and through time, and also to build an orchard map and later use it to locate the camera pose in 6DoF, thus providing a method for autonomous navigation of robots in the orchard and for selective fruit picking, for example.

[151] Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

Buddhi Wijenayake,Nichula Wasalathilake,Roshan Godaliyadda,Vijitha Herath,Parakrama Ekanayake,Vishal M. Patel

Main category: cs.CV

TL;DR: 本文提出了一种基于提示控制的扩散增强框架,用于缓解遥感图像语义分割中的长尾像素不平衡问题,特别是在Urban/Rural域差异显著的数据集LoveDA上。该方法分两阶段:A阶段用离散扩散模型生成满足指定类别比例和共现结构的语义布局;B阶段用ControlNet引导的Stable Diffusion将布局转为逼真且域一致的图像。合成数据与真实数据混合训练,显著提升了小样本类及跨域泛化性能。

Details Motivation: 高分辨率遥感影像语义分割面临严重的像素级长尾分布问题,尤其在LoveDA数据集中还存在Urban/Rural域间外观差异和类别频率不一致的双重挑战,亟需可控、高质量的数据增强方法。 Method: 提出两阶段提示控制扩散增强框架:Stage A采用域感知、掩码比率条件化的离散扩散模型生成符合目标类别比例并保持语义共现结构的布局;Stage B利用ControlNet引导的Stable Diffusion将布局翻译为光度真实、域一致的图像。 Result: 在多个分割主干网络上,合成数据与真实数据混合训练后,小样本类别性能显著提升,Urban与Rural域泛化能力增强,验证了可控增强对缓解长尾偏差的有效性。 Conclusion: 可控扩散增强是一种切实可行的策略,能有效缓解遥感语义分割中的长尾与域偏移问题,为数据稀缺场景提供新范式。 Abstract: Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label--image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \href{https://github.com/Buddhi19/SyntheticGen.git}{Github}

[152] Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

Chengtao Lv,Yumeng Shi,Yushi Huang,Ruihao Gong,Shen Ren,Wenya Wang

Main category: cs.CV

TL;DR: 本文提出Light Forcing,首个面向自回归(AR)视频生成模型的稀疏注意力方法,通过Chunk-Aware Growth和Hierarchical Sparse Attention机制,在保持高质量生成(VBench 84.5)的同时显著提升效率(端到端加速1.2~1.3×,结合FP8与LightVAE达2.3×、19.7 FPS)。

Details Motivation: 现有稀疏注意力方案在双向模型中有效,但直接用于AR视频生成时因孤立处理块生成和未充分利用历史上下文而性能下降。 Method: 提出Light Forcing:1)Chunk-Aware Growth机制,定量评估各视频块贡献并动态分配稀疏度;2)Hierarchical Sparse Attention,在帧级和块级两级进行粗粒度到细粒度的稀疏掩码选择,兼顾历史与局部上下文。 Result: 在VBench上达84.5分,端到端推理速度提升1.2~1.3倍;结合FP8量化与LightVAE后,在RTX 5090上实现2.3倍加速与19.7 FPS。 Conclusion: Light Forcing是首个专为AR视频生成设计的稀疏注意力方案,兼顾生成质量与部署效率,为高效视频生成提供了新范式。 Abstract: Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

[153] VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Qing'an Liu,Juntong Feng,Yuhao Wang,Xinzhe Han,Yujie Cheng,Yue Zhu,Haiwen Diao,Yunzhi Zhuge,Huchuan Lu

Main category: cs.CV

TL;DR: 本文提出了VISTA-Bench基准,系统评估视觉语言模型(VLMs)对图像中可视化文本的理解能力,发现当前VLMs在处理可视化文本时存在显著性能下降,即‘模态鸿沟’,并揭示其对渲染变化的敏感性。

Details Motivation: 现有VLMs基准多关注纯文本查询,但现实中语言常以图像内嵌文本形式出现,需检验VLMs是否能同等处理可视化文本。 Method: 构建VISTA-Bench基准,涵盖多模态感知、推理与单模态理解,通过在受控渲染条件下对比纯文本与可视化文本问题来评估模型。 Result: 对20多个主流VLMs的广泛评测显示:模型在语义相同但呈现为可视化文本时性能显著下降;该‘模态鸿沟’随感知难度增加而加剧,且对渲染变化高度敏感。 Conclusion: VISTA-Bench为诊断VLMs在文本token与像素间统一表征能力的不足提供了原则性框架,并推动更鲁棒跨模态语言理解的发展。 Abstract: Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.

[154] X2HDR: HDR Image Generation in a Perceptually Uniform Space

Ronghuan Wu,Wanchao Su,Kede Ma,Jing Liao,Rafał K. Mantiuk

Main category: cs.CV

TL;DR: 本文提出一种无需从头训练即可将现有LDR预训练扩散模型适配到HDR图像生成的方法,核心是将HDR数据转换为感知均匀编码(如PU21或PQ)空间,在该空间中仅微调去噪器,从而实现高质量文本到HDR生成和RAW到HDR重建。

Details Motivation: 现有主流图像生成模型(如Stable Diffusion、FLUX)受限于缺乏大规模HDR训练数据,只能输出LDR图像;而HDR图像原生采用线性RGB表示,其亮度与色彩统计特性与sRGB LDR图像差异显著,导致直接迁移困难。 Method: 发现LDR预训练VAE能高保真重建PU21等感知均匀编码的HDR输入,但对线性RGB HDR输入严重失真;据此提出冻结VAE、仅在感知均匀空间中对去噪器进行低秩适配(LoRA)的高效微调策略。 Result: 所提方法在文本到HDR合成与单图RAW到HDR重建任务上均取得显著提升,包括更好的感知保真度、更强的文本-图像对齐能力以及更有效的动态范围表现。 Conclusion: HDR生成无需从零训练大模型,通过在感知均匀空间中轻量适配已有扩散模型即可实现高性能、统一架构的HDR内容生成。 Abstract: High-dynamic-range (HDR) formats and displays are becoming increasingly prevalent, yet state-of-the-art image generators (e.g., Stable Diffusion and FLUX) typically remain limited to low-dynamic-range (LDR) output due to the lack of large-scale HDR training data. In this work, we show that existing pretrained diffusion models can be easily adapted to HDR generation without retraining from scratch. A key challenge is that HDR images are natively represented in linear RGB, whose intensity and color statistics differ substantially from those of sRGB-encoded LDR images. This gap, however, can be effectively bridged by converting HDR inputs into perceptually uniform encodings (e.g., using PU21 or PQ). Empirically, we find that LDR-pretrained variational autoencoders (VAEs) reconstruct PU21-encoded HDR inputs with fidelity comparable to LDR data, whereas linear RGB inputs cause severe degradations. Motivated by this finding, we describe an efficient adaptation strategy that freezes the VAE and finetunes only the denoiser via low-rank adaptation in a perceptually uniform space. This results in a unified computational method that supports both text-to-HDR synthesis and single-image RAW-to-HDR reconstruction. Experiments demonstrate that our perceptually encoded adaptation consistently improves perceptual fidelity, text-image alignment, and effective dynamic range, relative to previous techniques.

[155] XtraLight-MedMamba for Classification of Neoplastic Tubular Adenomas

Aqsa Sultana,Rayan Afsar,Ahmed Rahu,Surendra P. Singh,Brian Shula,Brandon Combs,Derrick Forchetti,Vijayan K. Asari

Main category: cs.CV

TL;DR: 本文提出了一种超轻量级状态空间模型XtraLight-MedMamba,用于从全切片图像中分类低级别管状腺瘤,准确率达97.18%,仅含约3.2万参数,显著优于复杂度更高的Transformer和传统Mamba模型。

Details Motivation: 低级别异型增生的病理判读主观性强,限制了癌前息肉风险分层的准确性;数字病理与深度学习可挖掘人眼难以察觉的恶性进展细微形态特征。 Method: 提出XtraLight-MedMamba模型:结合ConvNext浅层特征提取器与并行视觉Mamba建模长/短程依赖;引入空间-通道注意力桥(SCAB)模块增强多尺度特征提取;采用固定非负正交分类器(FNOClassifier)大幅降低参数量并提升泛化性。 Result: 在基于后续是否发展为结直肠癌分组的低级别管状腺瘤数据集上,模型达到97.18%准确率和0.9767 F1分数,仅使用约32,000参数,性能优于更复杂的Transformer和传统Mamba架构。 Conclusion: XtraLight-MedMamba证明了轻量化状态空间模型在数字病理风险预测中的高效性与可行性,为临床实时、可部署的AI辅助诊断提供了新范式。 Abstract: Accurate risk stratification of precancerous polyps during routine colonoscopy screenings is essential for lowering the risk of developing colorectal cancer (CRC). However, assessment of low-grade dysplasia remains limited by subjective histopathologic interpretation. Advancements in digital pathology and deep learning provide new opportunities to identify subtle and fine morphologic patterns associated with malignant progression that may be imperceptible to the human eye. In this work, we propose XtraLight-MedMamba, an ultra-lightweight state-space-based deep learning framework for classifying neoplastic tubular adenomas from whole-slide images (WSIs). The architecture is a blend of ConvNext based shallow feature extractor with parallel vision mamba to efficiently model both long- and short-range dependencies and image generalization. An integration of Spatial and Channel Attention Bridge (SCAB) module enhances multiscale feature extraction, while Fixed Non-Negative Orthogonal Classifier (FNOClassifier) enables substantial parameter reduction and improved generalization. The model was evaluated on a curated dataset acquired from patients with low-grade tubular adenomas, stratified into case and control cohorts based on subsequent CRC development. XtraLight-MedMamba achieved an accuracy of 97.18% and an F1-score of 0.9767 using approximately 32,000 parameters, outperforming transformer-based and conventional Mamba architectures with significantly higher model complexity.

[156] Toward Reliable and Explainable Nail Disease Classification: Leveraging Adversarial Training and Grad-CAM Visualization

Farzia Hossain,Samanta Ghosh,Shahida Begum,B. M. Shahria Alam,Mohammad Tahmid Noor,Md Parvez Mia,Nishat Tasnim Niloy

Main category: cs.CV

TL;DR: 本文提出了一种基于机器学习的自动化指甲疾病分类模型,使用公开数据集(3835张六类图像),通过对比InceptionV3、DenseNet201、EfficientNetV2和ResNet50四种CNN模型,发现InceptionV3准确率达95.57%;进一步采用对抗训练提升鲁棒性,并用SHAP进行可解释性分析,辅助医生实现更准确、快速的诊断。

Details Motivation: 指甲疾病常被忽视,但早期精准诊断对揭示全身健康问题至关重要;而疾病间视觉差异细微,人工识别困难,亟需自动化辅助诊断工具。 Method: 基于公开指甲图像数据集(3835张,6类,统一缩放至224×224),训练并比较四种主流CNN模型(InceptionV3、DenseNet201、EfficientNetV2、ResNet50);引入对抗训练增强模型鲁棒性;采用SHAP方法进行预测可解释性分析。 Result: InceptionV3取得最高准确率95.57%,DenseNet201次之(94.79%);对抗训练提升了模型对噪声与难例图像的鲁棒性;SHAP可视化有效标识出决策关键区域。 Conclusion: 该自动化分类系统具备高准确率与可解释性,可作为临床医生的可靠辅助工具,提升指甲疾病诊断的效率与准确性。 Abstract: Human nail diseases are gradually observed over all age groups, especially among older individuals, often going ignored until they become severe. Early detection and accurate diagnosis of such conditions are important because they sometimes reveal our body's health problems. But it is challenging due to the inferred visual differences between disease types. This paper presents a machine learning-based model for automated classification of nail diseases based on a publicly available dataset, which contains 3,835 images scaling six categories. In 224x224 pixels, all images were resized to ensure consistency. To evaluate performance, four well-known CNN models-InceptionV3, DenseNet201, EfficientNetV2, and ResNet50 were trained and analyzed. Among these, InceptionV3 outperformed the others with an accuracy of 95.57%, while DenseNet201 came next with 94.79%. To make the model stronger and less likely to make mistakes on tricky or noisy images, we used adversarial training. To help understand how the model makes decisions, we used SHAP to highlight important features in the predictions. This system could be a helpful support for doctors, making nail disease diagnosis more accurate and faster.

[157] LitS: A novel Neighborhood Descriptor for Point Clouds

Jonatan B. Bastos,Francisco F. Rivera,Oscar G. Lorenzo,David L. Vilariño,José C. Cabaleiro,Alberto M. Esmorís,Tomás F. Pena

Main category: cs.CV

TL;DR: 本文提出了一种名为LitS的新型点云邻域描述子,适用于2D和3D点云,通过在单位圆上定义分段常数函数来表征点的局部几何结构,具有适应性强、抗噪和密度变化鲁棒等优点。

Details Motivation: 实际点云分析严重依赖于能准确刻画局部几何结构的邻域描述子,而现有方法在应对点云密度变化和噪声方面存在不足。 Method: 提出LitS——一种定义在单位圆上的分段常数函数,以局部参考系下的方向为自变量,输出对应锥形邻域内的邻点数量;提供'常规'与'累积'两种版本,并含两个可调参数以适配不同场景。 Result: LitS能有效捕捉点局部排列细节,对点云密度变化和噪声具有强鲁棒性,且可通过邻近点间LitS的变化实现从局部到全局结构的理解。 Conclusion: LitS是一种通用、灵活且鲁棒的邻域描述子,显著提升了点云局部几何表征能力,适用于多种科学与工程应用。 Abstract: With the advancement of 3D scanning technologies, point clouds have become fundamental for representing 3D spatial data, with applications that span across various scientific and technological fields. Practical analysis of this data depends crucially on available neighborhood descriptors to accurately characterize the local geometries of the point cloud. This paper introduces LitS, a novel neighborhood descriptor for 2D and 3D point clouds. LitS are piecewise constant functions on the unit circle that allow points to keep track of their surroundings. Each element in LitS' domain represents a direction with respect to a local reference system. Once constructed, evaluating LitS at any given direction gives us information about the number of neighbors in a cone-like region centered around that same direction. Thus, LitS conveys a lot of information about the local neighborhood of a point, which can be leveraged to gain global structural understanding by analyzing how LitS changes between close points. In addition, LitS comes in two versions ('regular' and 'cumulative') and has two parameters, allowing them to adapt to various contexts and types of point clouds. Overall, they are a versatile neighborhood descriptor, capable of capturing the nuances of local point arrangements and resilient to common point cloud data issues such as variable density and noise.

[158] When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Soumya Jahagirdar,Walid Bousselham,Anna Kukleva,Hilde Kuehne

Main category: cs.CV

TL;DR: Mask-LLaVA 提出一种多层级视觉特征融合方法,通过结合掩码对象表征、全局token和局部patch token,在训练中使用全部token,推理时可动态减少掩码对象token数量,实现高效且灵活的视觉token压缩。

Details Motivation: 现有自回归视觉语言模型依赖大量视觉token,导致推理计算开销大,亟需更高效的视觉表征方法。 Method: 提出Mask-LLaVA框架,融合掩码对象表征、全局token和局部patch token;训练时使用全部token,推理时可选择性丢弃部分掩码对象token以降低token数。 Result: 在多个标准基准上达到与当前token高效方法相当、甚至媲美原始LLaVA基线的性能,仅使用其一小部分视觉token。 Conclusion: 多层级视觉特征融合不仅支持用更少token高效学习,还支持推理时动态调整token数量,在保持性能的同时显著提升效率。 Abstract: Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

[159] Laminating Representation Autoencoders for Efficient Diffusion

Ramón Calvo-González,François Fleuret

Main category: cs.CV

TL;DR: FlatDINO是一种变分自编码器,将DINOv2等SSL模型提取的冗余密集图像块特征压缩为仅32个连续token的一维序列,在ImageNet上显著降低扩散模型计算开销并保持高质量生成性能。

Details Motivation: DINOv2等SSL模型提取的密集图像块特征存在大量冗余,导致基于其进行扩散建模计算成本过高。 Method: 提出FlatDINO变分自编码器,将SSL patch特征压缩为长度32的一维连续token序列;在ImageNet 256x256上,用DiT-XL架构在FlatDINO压缩后的latent上训练扩散模型。 Result: 相比直接在原始DINOv2特征上训练扩散模型,FlatDINO实现8倍前向FLOPs减少、最多4.5倍训练步FLOPs减少,并取得1.80的gFID(classifier-free guidance)。 Conclusion: 通过高效压缩SSL patch特征,FlatDINO在大幅降低计算开销的同时维持了高质量图像生成能力,为高效视觉扩散建模提供了新路径。 Abstract: Recent work has shown that diffusion models can generate high-quality images by operating directly on SSL patch features rather than pixel-space latents. However, the dense patch grids from encoders like DINOv2 contain significant redundancy, making diffusion needlessly expensive. We introduce FlatDINO, a variational autoencoder that compresses this representation into a one-dimensional sequence of just 32 continuous tokens -an 8x reduction in sequence length and 48x compression in total dimensionality. On ImageNet 256x256, a DiT-XL trained on FlatDINO latents achieves a gFID of 1.80 with classifier-free guidance while requiring 8x fewer FLOPs per forward pass and up to 4.5x fewer FLOPs per training step compared to diffusion on uncompressed DINOv2 features. These are preliminary results and this work is in progress.

[160] PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

Jiahao Zhan,Zizhang Li,Hong-Xing Yu,Jiajun Wu

Main category: cs.CV

TL;DR: PerpetualWonder 是一种混合生成模拟器,通过统一物理状态与视觉表征的闭环系统,实现单图驱动的长时序、动作条件下的 4D 场景生成。

Details Motivation: 现有方法因物理状态与视觉表征解耦,无法在生成优化中同步更新物理动力学,导致长时序交互模拟失败。 Method: 提出首个真正闭环生成模拟系统:1)设计统一表征,建立物理状态与视觉原语间的双向映射;2)引入多视角监督的鲁棒更新机制以缓解优化歧义。 Result: 实验表明,PerpetualWonder 能从单张图像出发,成功模拟复杂、多步、长时序的动作交互,同时保持物理合理性和视觉一致性。 Conclusion: PerpetualWonder 通过闭环建模弥合了生成式建模与物理仿真之间的鸿沟,为基于单图的 4D 场景生成提供了新范式。 Abstract: We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

[161] CoWTracker: Tracking by Warping instead of Correlation

Zihang Lai,Eldar Insafutdinov,Edgar Sucar,Andrea Vedaldi

Main category: cs.CV

TL;DR: 本文提出了一种基于特征形变(warping)而非成本体(cost volumes)的新型稠密点跟踪器,结合Transformer实现长程匹配,在多个基准上达到SOTA,并意外在光流任务上也表现优异,表明两类任务可被统一建模。

Details Motivation: 现有稠密点跟踪方法依赖成本体,导致空间分辨率上的二次计算复杂度,限制了可扩展性和效率。 Method: 提出一种不使用成本体、而是通过迭代形变目标帧特征到查询帧来优化轨迹估计的新方法;结合能进行全局时空联合推理的Transformer架构,避免显式计算特征相关性。 Result: 在TAP-Vid-DAVIS、TAP-Vid-Kinetics和Robo-TAP等稠密点跟踪基准上达到SOTA;同时在Sintel、KITTI和Spring光流基准上表现优异,有时超越专用光流方法。 Conclusion: 基于形变的架构不仅能高效解决稠密点跟踪问题,还能自然迁移到光流估计,为二者提供了统一建模范式。 Abstract: Dense point tracking is a fundamental problem in computer vision, with applications ranging from video analysis to robotic manipulation. State-of-the-art trackers typically rely on cost volumes to match features across frames, but this approach incurs quadratic complexity in spatial resolution, limiting scalability and efficiency. In this paper, we propose \method, a novel dense point tracker that eschews cost volumes in favor of warping. Inspired by recent advances in optical flow, our approach iteratively refines track estimates by warping features from the target frame to the query frame based on the current estimate. Combined with a transformer architecture that performs joint spatiotemporal reasoning across all tracks, our design establishes long-range correspondences without computing feature correlations. Our model is simple and achieves state-of-the-art performance on standard dense point tracking benchmarks, including TAP-Vid-DAVIS, TAP-Vid-Kinetics, and Robo-TAP. Remarkably, the model also excels at optical flow, sometimes outperforming specialized methods on the Sintel, KITTI, and Spring benchmarks. These results suggest that warping-based architectures can unify dense point tracking and optical flow estimation.