Table of Contents
cs.CL [Back]
[1] Crystal-KV: Efficient KV Cache Management for Chain-of-Thought LLMs via Answer-First Principle
Zihan Wang,Cheng Tang,Lei Gong,Cheng Li,Chao Wang,teng wang,Wenqi Lou,Xuehai Zhou
Main category: cs.CL
TL;DR: 本文提出Crystal-KV,一种面向思维链(CoT)推理的高效KV缓存管理框架,基于‘答案优先’原则区分关键(CrystalKV)与非关键(SlipKV)缓存项,结合注意力驱动的LRU策略和自适应缓存预算分配,在大幅压缩KV缓存、提升吞吐与响应速度的同时,保持甚至提升CoT任务的准确率。
Details
Motivation: 传统KV缓存压缩方法在Chain-of-Thought(CoT)推理中效果差,因其假设所有token同等重要,而CoT中仅最终答案关键,中间推理token(think-stage)冗余且易引入误导;同时长思考序列导致KV缓存内存开销过大。 Method: 提出Crystal-KV框架:1)基于答案偏好构建think-stage注意力图,划分SlipKV(维持推理流但可能误导)与CrystalKV(真正支撑答案正确性);2)设计注意力驱动的Least Recently Frequently Used(LRFU)算法,精准识别并淘汰过期SlipKV;3)引入自适应缓存预算分配算法,依据各层/头CrystalKV动态占比调整KV缓存配额。 Result: Crystal-KV在CoT推理中实现SOTA级KV缓存压缩,显著提升推理吞吐量与响应速度,同时维持或提升答案准确率。 Conclusion: Crystal-KV通过答案导向的KV区分、注意力感知的淘汰机制与动态预算分配,有效解决CoT推理中KV缓存效率与准确性难以兼顾的问题,为大模型高效推理提供了新范式。 Abstract: Chain-of-Thought (CoT) reasoning in large language models (LLMs) significantly improves accuracy on complex tasks, yet incurs excessive memory overhead due to the long think-stage sequences stored in the Key-Value (KV) cache. Unlike traditional generation tasks where all tokens are uniformly important, CoT emphasizes the final answer, rendering conventional KV compression strategies ineffective. In this paper, we present Crystal-KV, an efficient KV cache management framework tailored for CoT reasoning. Our key insight is the answer-first principle. By mapping answer preferences into think-stage attention map, we distinguish between SlipKV, which mainly maintains the reasoning flow but may occasionally introduce misleading context, and CrystalKV, which truly contributes to the correctness of the final answer. Next, we propose an attention-based Least Recently Frequently Used algorithm. It precisely identifies when a SlipKV entry's utility expires and evicts it, retaining CrystalKV without disrupting reasoning flow. Finally, we introduce an adaptive cache budget allocation algorithm. Based on the dynamic proportion of CrystalKV, it estimates the importance of each layer/head and adjusts the KV cache budget during inference, amplifying critical components to improve budget utilization. Results show that Crystal-KV achieves state-of-the-art KV cache compression, significantly improves throughput, and enables faster response time, while maintaining, or even improving, answer accuracy for CoT reasoning.[2] Evaluating Reward Model Generalization via Pairwise Maximum Discrepancy Competitions
Shunyang Luo,Peibei Cao,Zhihui Zhu,Kehua Feng,Zhihua Wang,Keyan Ding
Main category: cs.CL
TL;DR: 本文提出Pairwise Maximum Discrepancy Competition (PMDC),一种动态、少标注的奖励模型(RM)泛化能力评估框架,通过在开放域未标注提示池中主动选择两模型分歧最大的样本进行oracle评判,并用Bradley-Terry模型建模全局排名,发现传统基准下RM排名存在显著偏差。
Details
Motivation: 现有RM评估依赖静态、预标注的偏好数据集,覆盖有限且难以反映真实开放世界中的泛化能力。 Method: 提出PMDC框架:基于大规模未标注开放域提示池,主动选取使两个RM分歧最大的prompt-response对;由oracle评判后,用Bradley-Terry模型聚合结果,生成RM全局排名与两两胜率图谱。 Result: 在10个代表性RM上重评估,发现相比传统基准排名发生显著变化;定性分析揭示了系统性泛化失败模式。 Conclusion: PMDC是一种更贴近实际部署场景的RM泛化评估范式,能暴露传统方法掩盖的问题,为改进奖励建模提供新思路和实证依据。 Abstract: Reward models (RMs) are central to aligning large language models, yet their practical effectiveness hinges on generalization to unseen prompts and shifting distributions. Most existing RM evaluations rely on static, pre-annotated preference datasets, which provide limited coverage and often fail to faithfully assess generalization in open-world settings. We introduce Pairwise Maximum Discrepancy Competition (PMDC), a dynamic and annotation-efficient framework for evaluating RM generalization using a large, unlabeled, open-domain prompt pool. PMDC actively selects prompt--response pairs that maximize disagreement between two RMs, yielding a compact set of highly contentious test cases. These cases are adjudicated by an oracle, and the resulting outcomes are aggregated via a Bradley--Terry model to produce a global ranking and pairwise win-rate landscape of RMs. We apply PMDC to re-evaluate 10 representative RMs and observe substantial rank reshuffling compared with conventional benchmarks. Qualitative analyses further uncover systematic generalization failures, providing valuable insights for improving reward modeling.[3] Uncertainty Quantification for Named Entity Recognition via Full-Sequence and Subsequence Conformal Prediction
Matthew Singer,Srijan Sengupta,Karl Pazdernik
Main category: cs.CL
TL;DR: 本文提出了一种基于共形预测的通用框架,使序列标注型命名实体识别(NER)模型能够生成具有统计保证的不确定性感知预测集,确保正确标注以用户指定置信度包含在预测集中。
Details
Motivation: 现有NER模型仅输出单一标签序列,缺乏不确定性度量,易导致下游任务级联错误;需提供具备统计保证的可靠性评估机制。 Method: 基于共形预测构建不确定性感知预测集,设计高效非一致性评分函数,支持无条件与类别条件覆盖,并兼顾句子长度、语言、实体类型及实体数量等异质性。 Result: 在三个基准数据集、四种NER模型上的实验验证了该方法的广泛适用性、有效性(满足预设覆盖率)和高效性(预测集紧凑)。 Conclusion: 该框架为NER提供了首个具备有限样本统计保证的不确定性量化方案,可无缝集成至现有序列标注模型,提升NLP系统鲁棒性与可信度。 Abstract: Named Entity Recognition (NER) serves as a foundational component in many natural language processing (NLP) pipelines. However, current NER models typically output a single predicted label sequence without any accompanying measure of uncertainty, leaving downstream applications vulnerable to cascading errors. In this paper, we introduce a general framework for adapting sequence-labeling-based NER models to produce uncertainty-aware prediction sets. These prediction sets are collections of full-sentence labelings that are guaranteed to contain the correct labeling with a user-specified confidence level. This approach serves a role analogous to confidence intervals in classical statistics by providing formal guarantees about the reliability of model predictions. Our method builds on conformal prediction, which offers finite-sample coverage guarantees under minimal assumptions. We design efficient nonconformity scoring functions to construct efficient, well-calibrated prediction sets that support both unconditional and class-conditional coverage. This framework accounts for heterogeneity across sentence length, language, entity type, and number of entities within a sentence. Empirical experiments on four NER models across three benchmark datasets demonstrate the broad applicability, validity, and efficiency of the proposed methods.[4] RAM-SD: Retrieval-Augmented Multi-agent framework for Sarcasm Detection
Ziyang Zhou,Ziqi Liu,Yan Wang,Yiming Lin,Yangbin Chen
Main category: cs.CL
TL;DR: 本文提出RAM-SD框架,通过检索增强与多智能体协作实现细粒度、可解释的讽刺检测,在四个基准上达到SOTA性能(Macro-F1 77.74%)。
Details
Motivation: 现有讽刺检测方法采用统一推理策略,难以应对讽刺表达在语境预期违背、外部知识依赖和修辞模式识别等方面的多样性需求。 Method: 提出检索增强的多智能体框架RAM-SD:包含上下文检索、元规划器分类讽刺类型并选择推理计划、多个专用智能体进行多视角分析、集成器生成可解释判断。 Result: 在四个标准基准上Macro-F1达77.74%,比GPT-4o+CoC强7.01个百分点,并提供透明、可解释的推理轨迹。 Conclusion: RAM-SD不仅提升了讽刺检测性能,还增强了模型推理过程的可解释性与认知可溯性,为复杂语义理解任务提供了新范式。 Abstract: Sarcasm detection remains a significant challenge due to its reliance on nuanced contextual understanding, world knowledge, and multi-faceted linguistic cues that vary substantially across different sarcastic expressions. Existing approaches, from fine-tuned transformers to large language models, apply a uniform reasoning strategy to all inputs, struggling to address the diverse analytical demands of sarcasm. These demands range from modeling contextual expectation violations to requiring external knowledge grounding or recognizing specific rhetorical patterns. To address this limitation, we introduce RAM-SD, a Retrieval-Augmented Multi-Agent framework for Sarcasm Detection. The framework operates through four stages: (1) contextual retrieval grounds the query in both sarcastic and non-sarcastic exemplars; (2) a meta-planner classifies the sarcasm type and selects an optimal reasoning plan from a predefined set; (3) an ensemble of specialized agents performs complementary, multi-view analysis; and (4) an integrator synthesizes these analyses into a final, interpretable judgment with a natural language explanation. Evaluated on four standard benchmarks, RAM-SD achieves a state-of-the-art Macro-F1 of 77.74%, outperforming the strong GPT-4o+CoC baseline by 7.01 points. Our framework not only sets a new performance benchmark but also provides transparent and interpretable reasoning traces, illuminating the cognitive processes behind sarcasm comprehension.[5] From Emotion to Expression: Theoretical Foundations and Resources for Fear Speech
Vigneshwaran Shankaran,Gabriella Lapesa,Claudia Wagner
Main category: cs.CL
TL;DR: 本文通过跨学科视角(心理学、政治学、传播学、语言学)系统梳理恐惧话语(fear speech)的理论基础,整合现有定义与数据集,提出统一的恐惧话语分析维度分类体系,为该领域的数据构建与计算研究提供理论与实践指导。
Details
Motivation: 恐惧话语广泛存在且影响力强,但因其表面‘文明’而常规避内容审核,目前计算语言学对其研究仍零散且资源不足,亟需跨学科整合与系统化框架。 Method: 比较心理学、政治学、传播学和语言学中关于恐惧的理论,梳理现有定义,调研相关领域数据集,并据此构建涵盖多维度的恐惧话语分类体系。 Result: 提出了一个整合性的恐惧话语 taxonomy,统一了不同学科对恐惧话语的理解维度,并系统评述了现有数据集与核心概念。 Conclusion: 恐惧话语是一个需多学科协同研究的复杂现象;本工作为后续数据集构建、模型开发与社会影响评估奠定了理论基础与方法论框架。 Abstract: Few forces rival fear in their ability to mobilize societies, distort communication, and reshape collective behavior. In computational linguistics, fear is primarily studied as an emotion, but not as a distinct form of speech. Fear speech content is widespread and growing, and often outperforms hate-speech content in reach and engagement because it appears "civiler" and evades moderation. Yet the computational study of fear speech remains fragmented and under-resourced. This can be understood by recognizing that fear speech is a phenomenon shaped by contributions from multiple disciplines. In this paper, we bridge cross-disciplinary perspectives by comparing theories of fear from Psychology, Political science, Communication science, and Linguistics. Building on this, we review existing definitions. We follow up with a survey of datasets from related research areas and propose a taxonomy that consolidates different dimensions of fear for studying fear speech. By reviewing current datasets and defining core concepts, our work offers both theoretical and practical guidance for creating datasets and advancing fear speech research.[6] Dynamic Role Assignment for Multi-Agent Debate
Miao Zhang,Junsik Kim,Siyuan Xiang,Jian Gao,Cheng Cao
Main category: cs.CL
TL;DR: 本文提出了一种动态角色分配框架,通过元辩论(Meta-Debate)在多智能体辩论系统中为不同角色选择最合适的LLM/VLM模型,显著提升了问题解决性能。
Details
Motivation: 现有基于LLM/VLM的多智能体辩论系统虽设定了专门角色,但未根据模型自身能力差异来分配角色,导致潜力未被充分挖掘。 Method: 提出动态角色分配框架:先进行两阶段元辩论——(1) 提案阶段,候选模型生成适配角色的论点;(2) 同行评审阶段,依据数据和角色特异性标准对提案打分,从而选出各角色最优模型。 Result: 在LLM问题求解基准上验证,相比统一指派(所有角色用同一模型)提升最高达74.8%,相比随机指派提升最高达29.7%。 Conclusion: 该工作确立了多智能体系统设计的新范式——从静态部署转向动态、能力感知的角色分配。 Abstract: Multi-agent large language model (LLM) and vision-language model (VLM) debate systems employ specialized roles for complex problem-solving, yet model specializations are not leveraged to decide which model should fill which role. We propose dynamic role assignment, a framework that runs a Meta-Debate to select suitable agents before the actual debate. The meta-debate has two stages: (1) proposal, where candidates provide role-tailored arguments, and (2) peer review, where proposals are scored with data and role-specific criteria to choose the best agent for each position. We evaluate our method on LLM problem solving benchmarks. Applied on top of existing debate systems, our approach consistently outperforms uniform assignments (filling all roles with the same model) by up to 74.8% and random assignments (assigning models to roles without considering their suitability) by up to 29.7%, depending on the task and the specific assignment. This work establishes a new paradigm for multi-agent system design, shifting from static agent deployment to dynamic and capability-aware selection.[7] Interpretability of the Intent Detection Problem: A New Approach
Eduardo Sanchez-Karhunen,Jose F. Quesada-Moreno,Miguel A. Gutiérrez-Naranjo
Main category: cs.CL
TL;DR: 本文利用动力系统理论分析RNN在意图检测任务中的工作机制,发现其在平衡数据集(SNIPS)上形成清晰的低维流形聚类,在不平衡数据集(ATIS)上则因类别不均衡导致低频意图聚类退化,从而提供了一种几何视角解释RNN性能差异。
Details
Motivation: 尽管深度学习(尤其是RNN)在意图检测中表现优异,但其内在工作机制尚不清楚,特别是RNN如何在隐藏状态空间中实现语义区分缺乏理论解释。 Method: 采用动力系统理论,将句子建模为隐藏状态空间中的轨迹;在SNIPS(平衡)和ATIS(不平衡)两个数据集上分析RNN隐藏状态的几何结构,解耦几何分离与读出对齐机制。 Result: 在SNIPS上观察到理想低维流形划分与意图聚类对应;在ATIS上发现类别不平衡导致低频意图聚类结构退化;揭示了几何分离与读出对齐的解耦关系。 Conclusion: RNN在意图检测中形成的计算解具有明确几何结构,该结构直接受数据集分布(如类别平衡性)影响;本工作为理解RNN内在机制提供了新的动力系统与几何解释框架。 Abstract: Intent detection, a fundamental text classification task, aims to identify and label the semantics of user queries, playing a vital role in numerous business applications. Despite the dominance of deep learning techniques in this field, the internal mechanisms enabling Recurrent Neural Networks (RNNs) to solve intent detection tasks are poorly understood. In this work, we apply dynamical systems theory to analyze how RNN architectures address this problem, using both the balanced SNIPS and the imbalanced ATIS datasets. By interpreting sentences as trajectories in the hidden state space, we first show that on the balanced SNIPS dataset, the network learns an ideal solution: the state space, constrained to a low-dimensional manifold, is partitioned into distinct clusters corresponding to each intent. The application of this framework to the imbalanced ATIS dataset then reveals how this ideal geometric solution is distorted by class imbalance, causing the clusters for low-frequency intents to degrade. Our framework decouples geometric separation from readout alignment, providing a novel, mechanistic explanation for real world performance disparities. These findings provide new insights into RNN dynamics, offering a geometric interpretation of how dataset properties directly shape a network's computational solution.[8] Who Gets Which Message? Auditing Demographic Bias in LLM-Generated Targeted Text
Tunazzina Islam
Main category: cs.CL
TL;DR: 本文首次系统分析了大语言模型(LLMs)在人口统计学条件下的定向文本生成中的偏见行为,发现GPT-4o、Llama-3.3和Mistral-Large 2.1均存在年龄与性别相关的刻板表达倾向,且上下文增强会加剧该偏差,尤其在气候传播中表现为对年轻/男性受众更强调能动性与创新,对女性/老年受众更强调温暖与传统。
Details
Motivation: 大型语言模型日益具备大规模生成个性化、有说服力文本的能力,但其在人口统计学条件下的定向通信中可能隐含并放大社会偏见,亟需系统性评估。 Method: 构建受控评估框架,使用GPT-4o、Llama-3.3和Mistral-Large 2.1三个主流模型,在Standalone Generation(隔离内在人口效应)与Context-Rich Generation(引入主题与地域上下文)两种设置下,从词汇内容、语言风格和说服框架三方面评估生成消息,并以气候传播为实证场景。 Result: 所有模型均表现出一致的年龄与性别不对称性:面向男性和青年的消息更强调能动性、创新与坚定性;面向女性和老年人的消息更强调温暖、关怀与传统;上下文提示显著放大这些差异,且针对青年或男性受众的说服得分明显更高。 Conclusion: 人口统计刻板印象会在LLM定向通信中自然浮现并被上下文强化,凸显在敏感社会应用中建立偏见感知生成流程与透明审计框架的必要性。 Abstract: Large language models (LLMs) are increasingly capable of generating personalized, persuasive text at scale, raising new questions about bias and fairness in automated communication. This paper presents the first systematic analysis of how LLMs behave when tasked with demographic-conditioned targeted messaging. We introduce a controlled evaluation framework using three leading models -- GPT-4o, Llama-3.3, and Mistral-Large 2.1 -- across two generation settings: Standalone Generation, which isolates intrinsic demographic effects, and Context-Rich Generation, which incorporates thematic and regional context to emulate realistic targeting. We evaluate generated messages along three dimensions: lexical content, language style, and persuasive framing. We instantiate this framework on climate communication and find consistent age- and gender-based asymmetries across models: male- and youth-targeted messages emphasize agency, innovation, and assertiveness, while female- and senior-targeted messages stress warmth, care, and tradition. Contextual prompts systematically amplify these disparities, with persuasion scores significantly higher for messages tailored to younger or male audiences. Our findings demonstrate how demographic stereotypes can surface and intensify in LLM-generated targeted communication, underscoring the need for bias-aware generation pipelines and transparent auditing frameworks that explicitly account for demographic conditioning in socially sensitive applications.[9] Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content
Parth Bhalerao,Diola Dsouza,Ruiwen Guan,Oana Ignat
Main category: cs.CL
TL;DR: 本文提出了MentorQA,首个面向导师式问答的多语言数据集与评估框架,聚焦长视频中的反思性、指导性回答,超越传统事实准确性,引入清晰度、一致性与学习价值等新评估维度;实验表明多智能体架构在复杂主题和低资源语言上表现更优,并揭示了LLM自动评估与人工判断间存在显著差异。
Details
Motivation: 现有问答系统评估主要关注事实正确性,但教育、职业指导等真实场景需要具备反思与指导能力的‘导师式’回答;而当前基准缺乏对多语言、长文本及导师式响应的建模与评估。 Method: 构建了包含近9000个问答对、覆盖4种语言、180小时长视频内容的MentorQA数据集;定义了涵盖清晰度、一致性、学习价值等维度的导师式评估体系;在统一条件下对比单智能体、双智能体、RAG与多智能体问答架构;并分析LLM自动评估与人工判断的一致性。 Result: 多智能体流水线在导师式响应质量上持续优于其他架构,尤其在复杂主题和低资源语言中提升显著;LLM自动评估结果与人工判断存在较大偏差,可靠性受限。 Conclusion: 本工作确立‘导师式问答’为独立研究方向,提供了首个支持多语言、长视频、教育导向的基准与评估框架,推动面向教育AI的智能体架构与评估方法研究。 Abstract: Question answering systems are typically evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM-SCU/MentorQA.[10] Systematicity between Forms and Meanings across Languages Supports Efficient Communication
Doreen Osmelak,Yang Xu,Michael Hahn,Kate McCurdy
Main category: cs.CL
TL;DR: 本文研究了不同语言中语法意义(如人称、数)在动词和代词上的表达方式,提出了一种基于可学习性的新复杂度度量方法,以解释语言形式中的系统性规律,并将高效交际理论与自然语言的系统性联系起来。
Details
Motivation: 现有高效交际理论未能解释语言形式内部的系统性关系,本文旨在填补这一空白。 Method: 提出一种基于意义-形式映射可学习性的新复杂度度量方法,分析跨语言动词和代词中语法意义的表达模式。 Result: 发现动词和代词形式受简洁性(最小化语法区分数量)和准确性(确保意图意义可恢复)两种交际压力共同塑造;新复杂度度量能更好地区分已存在与未存在的语言系统。 Conclusion: 该模型不仅提升了对语言形式系统性规律的解释力,还建立了高效交际理论与自然语言系统性之间的新联系。 Abstract: Languages vary widely in how meanings map to word forms. These mappings have been found to support efficient communication; however, this theory does not account for systematic relations within word forms. We examine how a restricted set of grammatical meanings (e.g. person, number) are expressed on verbs and pronouns across typologically diverse languages. Consistent with prior work, we find that verb and pronoun forms are shaped by competing communicative pressures for simplicity (minimizing the inventory of grammatical distinctions) and accuracy (enabling recovery of intended meanings). Crucially, our proposed model uses a novel measure of complexity (inverse of simplicity) based on the learnability of meaning-to-form mappings. This innovation captures fine-grained regularities in linguistic form, allowing better discrimination between attested and unattested systems, and establishes a new connection from efficient communication theory to systematicity in natural language.[11] Reasoning Beyond Literal: Cross-style Multimodal Reasoning for Figurative Language Understanding
Seyyed Saeid Cheshmi,Hahnemann Ortiz,James Mooney,Dongyeop Kang
Main category: cs.CL
TL;DR: 本文提出了一种三步框架,用于构建能理解多模态比喻语言(如讽刺、幽默、隐喻)的轻量级视觉-语言模型,并提供可解释的推理路径和跨风格泛化能力。
Details
Motivation: 现有视觉-语言模型在字面多模态任务上表现良好,但在处理需理解语义不一致性和主观性的比喻语言(如讽刺、幽默、隐喻)时仍面临挑战,尤其当图文信息相互增强或反转语义时。 Method: 提出一个三步框架:(i) 多模态比喻语言理解,(ii) 生成透明推理路径,(iii) 跨多种比喻风格泛化;通过联合训练与推理追踪机制,在四个比喻风格上进行实验验证。 Result: 实验表明:(1) 引入推理路径显著提升理解效果;(2) 单一风格学到的推理能力可在相关风格(如讽刺与幽默)间迁移;(3) 联合多风格训练所得轻量模型优于更大规模的开源与闭源模型。 Conclusion: 轻量级视觉-语言模型结合可验证推理机制,可在多模态比喻理解任务中实现强跨风格泛化,并提供可检验的推理过程。 Abstract: Vision-language models (VLMs) have demonstrated strong reasoning abilities in literal multimodal tasks such as visual mathematics and science question answering. However, figurative language, such as sarcasm, humor, and metaphor, remains a significant challenge, as it conveys intent and emotion through subtle incongruities between expressed and intended meanings. In multimodal settings, accompanying images can amplify or invert textual meaning, demanding models that reason across modalities and account for subjectivity. We propose a three-step framework for developing efficient multimodal reasoning models that can (i) interpret multimodal figurative language, (ii) provide transparent reasoning traces, and (iii) generalize across multiple figurative styles. Experiments across four styles show that (1) incorporating reasoning traces substantially improves multimodal figurative understanding, (2) reasoning learned in one style can transfer to others, especially between related styles like sarcasm and humor, and (3) training jointly across styles yields a generalized reasoning VLM that outperforms much larger open- and closed-source models. Our findings show that lightweight VLMs with verifiable reasoning achieve robust cross-style generalization while providing inspectable reasoning traces for multimodal tasks. The code and implementation are available at https://github.com/scheshmi/CrossStyle-MMR.[12] Relating Word Embedding Gender Biases to Gender Gaps: A Cross-Cultural Analysis
Scott Friedman,Sonja Schmer-Galunder,Anthony Chen,Jeffrey Rye
Main category: cs.CL
TL;DR: 本文提出了一种量化词嵌入中性别偏见的方法,并利用该方法分析教育、政治、经济和健康领域的性别差距,通过2018年Twitter数据验证其与实际统计性别差距的相关性。
Details
Motivation: 现有NLP模型因训练数据固有偏见而存在种族和性别偏差,这些偏差虽常被视为问题,但也可能反映真实文化中的性别/种族差异,从而为文化语境分析提供大数据视角。 Method: 提出量化词嵌入中性别偏见的指标,并将其与多国及美国各州在教育、政治、经济、健康等领域的18项国际和5项本土统计性别差距数据进行相关性分析。 Result: 在51个美国地区和99个国家的2018年Twitter数据上验证了该方法;发现词嵌入性别偏见与多项现实性别差距指标具有显著相关性,展现出一定预测能力与规律性。 Conclusion: 词嵌入中的性别偏见可作为反映现实社会性别差距的代理指标,为基于大规模文本的社会文化分析提供了可行路径。 Abstract: Modern models for common NLP tasks often employ machine learning techniques and train on journalistic, social media, or other culturally-derived text. These have recently been scrutinized for racial and gender biases, rooting from inherent bias in their training text. These biases are often sub-optimal and recent work poses methods to rectify them; however, these biases may shed light on actual racial or gender gaps in the culture(s) that produced the training text, thereby helping us understand cultural context through big data. This paper presents an approach for quantifying gender bias in word embeddings, and then using them to characterize statistical gender gaps in education, politics, economics, and health. We validate these metrics on 2018 Twitter data spanning 51 U.S. regions and 99 countries. We correlate state and country word embedding biases with 18 international and 5 U.S.-based statistical gender gaps, characterizing regularities and predictive strength.[13] DF-RAG: Query-Aware Diversity for Retrieval-Augmented Generation
Saadat Hasan Khan,Spencer Hong,Jingyu Wu,Kevin Lybarger,Youbing Yin,Erin Babinsky,Daben Liu
Main category: cs.CL
TL;DR: 本文提出DF-RAG方法,在检索阶段引入多样性以提升推理密集型问答任务的性能,通过动态优化每条查询的多样性水平,无需额外微调,显著优于传统RAG及其他基线方法。
Details
Motivation: 传统RAG在推理密集型问答任务中表现受限,因基于余弦相似度的检索虽保证相关性但引入冗余内容,降低信息召回率。 Method: 基于最大边际相关性(MMR)框架构建DF-RAG,检索既相关又彼此差异最大的信息块,并在测试时动态为每条查询优化多样性程度,无需额外微调或先验知识。 Result: DF-RAG在推理密集型QA基准上F1值比传统RAG提升4–10个百分点,且达到Oracle上限(+18% F1)的91.3%。 Conclusion: 引入检索多样性可有效缓解RAG在复杂推理任务中的冗余问题,DF-RAG提供了一种无需训练、动态适配的实用解决方案。 Abstract: Retrieval-augmented generation (RAG) is a common technique for grounding language model outputs in domain-specific information. However, RAG is often challenged by reasoning-intensive question-answering (QA), since common retrieval methods like cosine similarity maximize relevance at the cost of introducing redundant content, which can reduce information recall. To address this, we introduce Diversity-Focused Retrieval-Augmented Generation (DF-RAG), which systematically incorporates diversity into the retrieval step to improve performance on complex, reasoning-intensive QA benchmarks. DF-RAG builds upon the Maximal Marginal Relevance framework to select information chunks that are both relevant to the query and maximally dissimilar from each other. A key innovation of DF-RAG is its ability to optimize the level of diversity for each query dynamically at test time without requiring any additional fine-tuning or prior information. We show that DF-RAG improves F1 performance on reasoning-intensive QA benchmarks by 4-10 percent over vanilla RAG using cosine similarity and also outperforms other established baselines. Furthermore, we estimate an Oracle ceiling of up to 18 percent absolute F1 gains over vanilla RAG, of which DF-RAG captures up to 91.3 percent.[14] Beyond Outcome Verification: Verifiable Process Reward Models for Structured Reasoning
Massimiliano Pronesti,Anya Belz,Yufang Hou
Main category: cs.CL
TL;DR: 本文提出了一种新的强化学习框架Verifiable Process Reward Models (VPRMs),利用确定性、基于规则的验证器来评估大语言模型推理过程中的中间步骤,显著提升了医学证据风险偏倚评估任务中的性能和逻辑一致性。
Details
Motivation: 现有过程监督方法依赖神经裁判评分,存在不透明、偏差和奖励黑客问题;而可验证结果奖励虽有效,但未关注推理过程的可验证性。 Method: 提出VPRMs框架,使用确定性、基于规则的验证器对中间推理步骤进行检查,并将其应用于医学证据风险偏倚评估任务中,该领域具有明确指南和可编程验证路径。 Result: 在多个数据集上,VPRMs相比SOTA模型F1提升达20%,相比可验证结果奖励提升6.5%,并在证据依据性和逻辑连贯性方面取得显著改进。 Conclusion: 基于规则的过程验证能有效提升LLM推理的合规性与一致性,VPRMs为可信AI推理提供了新范式。 Abstract: Recent work on reinforcement learning with verifiable rewards (RLVR) has shown that large language models (LLMs) can be substantially improved using outcome-level verification signals, such as unit tests for code or exact-match checks for mathematics. In parallel, process supervision has long been explored as a way to shape the intermediate reasoning behaviour of LLMs, but existing approaches rely on neural judges to score chain-of-thought steps, leaving them vulnerable to opacity, bias, and reward hacking. To address this gap, we introduce Verifiable Process Reward Models (VPRMs), a reinforcement-learning framework in which intermediate reasoning steps are checked by deterministic, rule-based verifiers. We apply VPRMs to risk-of-bias assessment for medical evidence synthesis, a domain where guideline-defined criteria and rule-based decision paths enable programmatic verification of reasoning traces. Across multiple datasets, we find that VPRMs generate reasoning that adheres closely to domain rules and achieve substantially higher coherence between step-level decisions and final labels. Results show that VPRMs achieve up to 20% higher F1 than state-of-the-art models and 6.5% higher than verifiable outcome rewards, with substantial gains in evidence grounding and logical coherence.[15] Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Generation
David Y. Liu,Xanthe Muston,Aditya Joshi,Sebastian Sequoiah-Grayson
Main category: cs.CL
TL;DR: 本文提出了一种基于强化学习的自动故事生成后训练方法(d-RLAIF),利用叙事平衡理论构建评估原则,并通过大语言模型作为裁判提供奖励信号,显著提升了生成故事的多样性与人类叙事习惯的一致性,为ASG等主观任务提供了优于监督微调(SFT)的新范式。
Details
Motivation: 现有自动故事生成(ASG)方法依赖有限且主观性强的标注数据进行训练和评估,难以充分捕捉人类对叙事质量的多维判断。 Method: 提出d-RLAIF框架:基于Todorov叙事平衡理论设计叙事质量原则;用7B/14B LLM作为裁判模型依据原则打分并提供奖励信号;在预训练模型上开展强化学习后训练;使用Gemini-3-Flash评估生成结果。 Result: d-RLAIF生成的故事在多样性与符合人类叙事惯例方面优于监督微调(SFT)基线,且LLM裁判与人工标注员在叙事质量评估上具有较高一致性。 Conclusion: 强化学习可有效支撑语言学 grounded 的后训练,为ASG等主观性任务提供更鲁棒、更贴近人类偏好的新路径。 Abstract: Despite the subjective nature of storytelling, past works on automatic story generation (ASG) have relied on limited ground truths for training and evaluation. In this work, we explore reinforcement learning (d-RLAIF) as a post-training alternative to supervised fine-tuning (SFT). We first apply Todorov's Theory of Narrative Equilibrium to establish principles that define desirable ASG qualities. We prompt 7B and 14B LLM-as-judge models with our principles to test alignment with human annotators and provide reward signals during d-RLAIF. We use Gemini-3-Flash to evaluate the output of our post-trained models and compare them to human-written stories from the TimeTravel dataset. We show that d-RLAIF offers a viable alternative to supervised fine-tuning (SFT)--producing stories that are more diverse and aligned with human narrative conventions. Our paper demonstrates the promise of reinforcement learning for linguistically grounded post-training for subjective tasks such as ASG.[16] CaseFacts: A Benchmark for Legal Fact-Checking and Precedent Retrieval
Akshith Reddy Putta,Jacob Devasier,Chengkai Li
Main category: cs.CL
TL;DR: 本文提出了CaseFacts基准,用于验证面向普通人的法律主张是否符合美国最高法院判例,强调语义鸿沟与时间有效性挑战;采用多阶段LLM流水线构建含6294条标注主张的数据集;实验表明当前大模型表现有限,且网络搜索反而降低性能;数据集已开源。
Details
Motivation: 现有自动事实核查研究多聚焦于静态通用知识,忽视法律等高风险、动态演进且技术复杂的领域,亟需能处理普通人表述与专业判例之间语义差距及时间有效性的新基准。 Method: 提出CaseFacts基准,通过多阶段LLM流水线从专家案例摘要中合成法律主张,并设计新型语义相似性启发式方法高效识别和验证复杂判例推翻关系;对6294条主张按Supported/Refuted/Overruled三类标注。 Result: 当前最先进大模型在该任务上表现欠佳;引入无限制网络搜索反而导致性能下降,因检索到大量噪声和非权威判例;验证了任务难度与现有方法局限性。 Conclusion: CaseFacts填补了法律事实核查基准的空白,凸显了跨语义、有时效性法律推理的挑战,呼吁发展更鲁棒、权威感知的法律AI验证系统;数据集已公开以推动后续研究。 Abstract: Automated Fact-Checking has largely focused on verifying general knowledge against static corpora, overlooking high-stakes domains like law where truth is evolving and technically complex. We introduce CaseFacts, a benchmark for verifying colloquial legal claims against U.S. Supreme Court precedents. Unlike existing resources that map formal texts to formal texts, CaseFacts challenges systems to bridge the semantic gap between layperson assertions and technical jurisprudence while accounting for temporal validity. The dataset consists of 6,294 claims categorized as Supported, Refuted, or Overruled. We construct this benchmark using a multi-stage pipeline that leverages Large Language Models (LLMs) to synthesize claims from expert case summaries, employing a novel semantic similarity heuristic to efficiently identify and verify complex legal overrulings. Experiments with state-of-the-art LLMs reveal that the task remains challenging; notably, augmenting models with unrestricted web search degrades performance compared to closed-book baselines due to the retrieval of noisy, non-authoritative precedents. We release CaseFacts to spur research into legal fact verification systems.[17] Frame-Guided Synthetic Claim Generation for Automatic Fact-Checking Using High-Volume Tabular Data
Jacob Devasier,Akshith Putta,Qing Wang,Alankrit Moses,Chengkai Li
Main category: cs.CL
TL;DR: 本文提出了一种面向大规模结构化数据(如OECD表格)的自动化事实核查新基准,包含78,503条多语言合成声明,并设计了基于语义框架的生成方法;实验证明当前大模型未记忆相关事实,需真实检索与推理,且证据检索是主要瓶颈。
Details
Motivation: 现有自动事实核查基准忽视了在真实世界、高容量结构化数据(如大型表格)中验证声明的挑战,而多集中于小规模、人工整理的表格,存在关键缺口。 Method: 构建了一个大规模、多语言数据集(78,503条声明,基于434个平均超50万行的OECD复杂表格),提出基于六种语义框架的‘帧引导’方法自动生成现实声明;开展知识探测实验验证LLM未记忆事实;提供SQL生成基线系统。 Result: 实验证明该基准极具挑战性,证据检索是主要瓶颈,模型难以在海量表格中准确定位正确数据;LLM无法靠参数化知识回答,必须进行真实检索与推理。 Conclusion: 该数据集填补了面向真实大规模结构化数据的事实核查研究空白,为解决这一尚未攻克的实际问题提供了关键新资源。 Abstract: Automated fact-checking benchmarks have largely ignored the challenge of verifying claims against real-world, high-volume structured data, instead focusing on small, curated tables. We introduce a new large-scale, multilingual dataset to address this critical gap. It contains 78,503 synthetic claims grounded in 434 complex OECD tables, which average over 500K rows each. We propose a novel, frame-guided methodology where algorithms programmatically select significant data points based on six semantic frames to generate realistic claims in English, Chinese, Spanish, and Hindi. Crucially, we demonstrate through knowledge-probing experiments that LLMs have not memorized these facts, forcing systems to perform genuine retrieval and reasoning rather than relying on parameterized knowledge. We provide a baseline SQL-generation system and show that our benchmark is highly challenging. Our analysis identifies evidence retrieval as the primary bottleneck, with models struggling to find the correct data in massive tables. This dataset provides a critical new resource for advancing research on this unsolved, real-world problem.[18] PingPong: A Natural Benchmark for Multi-Turn Code-Switching Dialogues
Mohammad Rifqi Farhansyah,Hanif Muhammad Zhafran,Farid Adilazuarda,Shamsuddeen Hassan Muhammad,Maryam Ibrahim Mukhtar,Nedjma Ousidhoum,Genta Indra Winata,Ayu Purwarianti,Alham Fikri Aji
Main category: cs.CL
TL;DR: 本文提出了PingPong基准,用于评估自然多参与者语码转换对话,涵盖五种语言组合(含三语),强调真实、多线程、长距离引用的对话结构,并定义了问答、摘要和主题分类三项任务,揭示当前大模型在语码转换对话上的性能局限。
Details
Motivation: 现有基准未能准确反映真实世界中多语者语码转换的复杂性,缺乏对自然、多参与者、多线程及长距离上下文依赖对话的建模。 Method: 构建人类撰写的PingPong数据集,包含2–4人参与、多种语言组合(含三语)的对话;设计问答、对话摘要和主题分类三项下游任务;在多个SOTA语言模型上进行评测。 Result: PingPong数据比机器生成对话更自然、结构更多样(如消息长度、发言主导性、回复距离);当前主流模型在三项任务上表现有限,尤其面对语码转换输入时性能显著下降。 Conclusion: 真实语码转换对话极具挑战性,亟需开发能更好建模多语混用、长程依赖与多角色交互的鲁棒NLP系统。 Abstract: Code-switching is a widespread practice among the world's multilingual majority, yet few benchmarks accurately reflect its complexity in everyday communication. We present PingPong, a benchmark for natural multi-party code-switching dialogues covering five language-combination variations, some of which are trilingual. Our dataset consists of human-authored conversations among 2 to 4 participants covering authentic, multi-threaded structures where replies frequently reference much earlier points in the dialogue. We demonstrate that our data is significantly more natural and structurally diverse than machine-generated alternatives, offering greater variation in message length, speaker dominance, and reply distance. Based on these dialogues, we define three downstream tasks: Question Answering, Dialogue Summarization, and Topic Classification. Evaluations of several state-of-the-art language models on PingPong reveal that performance remains limited on code-switched inputs, underscoring the urgent need for more robust NLP systems capable of addressing the intricacies of real-world multilingual discourse.[19] Mind the Ambiguity: Aleatoric Uncertainty Quantification in LLMs for Safe Medical Question Answering
Yaokun Liu,Yifan Liu,Phoebe Mbuvi,Zelin Li,Ruichen Yao,Gawon Lim,Dong Wang
Main category: cs.CL
TL;DR: 本文提出了一种基于输入模糊性(aleatoric uncertainty)的'先澄清再回答'框架,用于提升医疗问答中大语言模型的安全性与准确性。
Details
Motivation: 大型语言模型在医疗问答中的部署受到用户查询模糊性的严重阻碍,这种模糊性带来显著的安全风险,并降低高风险医疗场景下的回答准确率。 Method: 构建首个研究医疗问答中输入模糊性的基准CV-MedBench;从表征工程角度分析发现模糊性在线性可分的LLM内部激活中编码;据此设计轻量级AU-Probe模块,无需微调或多次前向传播即可检测模糊性,并触发澄清机制。 Result: 在四个开源大语言模型上实验表明,该框架平均准确率较基线提升9.48%;AU-Probe高效、免训练、单次前向即可工作。 Conclusion: 所提AU引导的'Clarify-Before-Answer'框架为安全医疗问答提供了高效稳健的解决方案,增强了健康相关应用的可靠性。 Abstract: The deployment of Large Language Models in Medical Question Answering is severely hampered by ambiguous user queries, a significant safety risk that demonstrably reduces answer accuracy in high-stakes healthcare settings. In this paper, we formalize this challenge by linking input ambiguity to aleatoric uncertainty (AU), which is the irreducible uncertainty arising from underspecified input. To facilitate research in this direction, we construct CV-MedBench, the first benchmark designed for studying input ambiguity in Medical QA. Using this benchmark, we analyze AU from a representation engineering perspective, revealing that AU is linearly encoded in LLM's internal activation patterns. Leveraging this insight, we introduce a novel AU-guided "Clarify-Before-Answer" framework, which incorporates AU-Probe - a lightweight module that detects input ambiguity directly from hidden states. Unlike existing uncertainty estimation methods, AU-Probe requires neither LLM fine-tuning nor multiple forward passes, enabling an efficient mechanism to proactively request user clarification and significantly enhance safety. Extensive experiments across four open LLMs demonstrate the effectiveness of our QA framework, with an average accuracy improvement of 9.48% over baselines. Our framework provides an efficient and robust solution for safe Medical QA, strengthening the reliability of health-related applications. The code is available at https://github.com/yaokunliu/AU-Med.git, and the CV-MedBench dataset is released on Hugging Face at https://huggingface.co/datasets/yaokunl/CV-MedBench.[20] Meta-Judging with Large Language Models: Concepts, Methods, and Challenges
Hugo Silva,Mateus Mendes,Hugo Gonçalo Oliveira
Main category: cs.CL
TL;DR: 本文综述了LLM-as-a-Meta-Judge这一新兴评估范式,系统梳理其概念基础、机制、对齐训练方法、评估方式、局限性及未来方向,指出其相较传统LLM-as-a-Judge更具稳定性与可信度,但仍面临成本、提示敏感性和模型共性偏差等挑战。
Details
Motivation: 现有LLM-as-a-Judge评估方法存在提示敏感、系统性偏差、啰嗦效应及理由不可靠/幻觉等问题,亟需更鲁棒的自动化评估范式。 Method: 通过构建涵盖六大视角(概念基础、元评判机制、对齐训练方法、评估、局限与失效模式、未来方向)的框架,系统回顾和组织元评判相关研究进展。 Result: 确立LLM-as-a-Meta-Judge为提升评估稳定性与可信度的有前景方向,并明确其当前核心挑战:成本高、提示敏感、共享模型偏差。 Conclusion: LLM-as-a-Meta-Judge是迈向更可靠LLM自动评估的关键演进,但需进一步解决实际部署中的效率与公平性问题,方能推动下一代评估方法发展。 Abstract: Large language models (LLMs) are evolving fast and are now frequently used as evaluators, in a process typically referred to as LLM-as-a-Judge, which provides quality assessments of model outputs. However, recent research points out significant vulnerabilities in such evaluation, including sensitivity to prompts, systematic biases, verbosity effects, and unreliable or hallucinated rationales. These limitations motivated the development of a more robust paradigm, dubbed LLM-as-a-Meta-Judge. This survey reviews recent advances in meta-judging and organizes the literature, by introducing a framework along six key perspectives: (i) Conceptual Foundations, (ii) Mechanisms of Meta-Judging, (iii) Alignment Training Methods, (iv) Evaluation, (v) Limitations and Failure Modes, and (vi) Future Directions. By analyzing the limitations of LLM-as-a-Judge and summarizing recent advances in meta-judging by LLMs, we argue that LLM-as-a-Meta-Judge offers a promising direction for more stable and trustworthy automated evaluation, while highlighting remaining challenges related to cost, prompt sensitivity, and shared model biases, which must be addressed to advance the next generation of LLM evaluation methodologies.[21] The Shadow Self: Intrinsic Value Misalignment in Large Language Model Agents
Chen Chen,Kim Young Il,Yuan Yang,Wenhao Su,Yilin Zhang,Xueluan Gong,Qian Wang,Yongsen Zheng,Ziyao Liu,Kwok-Yan Lam
Main category: cs.CL
TL;DR: 本文提出IMPRESS框架,用于评估大语言模型(LLM)代理在真实、无害、情境化场景中的内在价值错位(Intrinsic VM)风险,并发现该风险普遍存在且受动机、风险类型、模型规模与架构影响显著,而现有缓解策略效果有限。
Details
Motivation: 现有LLM安全评估主要关注对显式有害输入的响应或系统鲁棒性,却忽视了在真实、完全良性、具自主性的代理场景中潜在的价值错位问题,尤其是内在价值错位(Intrinsic VM)。 Method: 形式化‘失控风险’并定义Intrinsic VM;构建基于多阶段LLM生成与严格质量控制的IMPRESS评估框架及相应基准;在21个SOTA LLM代理上系统评测;辅以人工验证与缓解策略实验。 Result: Intrinsic VM是普遍存在的安全风险;其发生率随动机、风险类型、模型规模和架构变化明显;上下文化与表述机制显著影响错位行为;解码策略与超参影响微弱;现有安全提示与护栏策略效果不稳定或有限。 Conclusion: Intrinsic VM是LLM代理安全中关键且被低估的风险维度,IMPRESS为系统性评估与后续治理提供了可复现、场景驱动的新范式。 Abstract: Large language model (LLM) agents with extended autonomy unlock new capabilities, but also introduce heightened challenges for LLM safety. In particular, an LLM agent may pursue objectives that deviate from human values and ethical norms, a risk known as value misalignment. Existing evaluations primarily focus on responses to explicit harmful input or robustness against system failure, while value misalignment in realistic, fully benign, and agentic settings remains largely underexplored. To fill this gap, we first formalize the Loss-of-Control risk and identify the previously underexamined Intrinsic Value Misalignment (Intrinsic VM). We then introduce IMPRESS (Intrinsic Value Misalignment Probes in REalistic Scenario Set), a scenario-driven framework for systematically assessing this risk. Following our framework, we construct benchmarks composed of realistic, fully benign, and contextualized scenarios, using a multi-stage LLM generation pipeline with rigorous quality control. We evaluate Intrinsic VM on 21 state-of-the-art LLM agents and find that it is a common and broadly observed safety risk across models. Moreover, the misalignment rates vary by motives, risk types, model scales, and architectures. While decoding strategies and hyperparameters exhibit only marginal influence, contextualization and framing mechanisms significantly shape misalignment behaviors. Finally, we conduct human verification to validate our automated judgments and assess existing mitigation strategies, such as safety prompting and guardrails, which show instability or limited effectiveness. We further demonstrate key use cases of IMPRESS across the AI Ecosystem. Our code and benchmark will be publicly released upon acceptance.[22] Do readers prefer AI-generated Italian short stories?
Michael Farrell
Main category: cs.CL
TL;DR: 本研究在双盲实验中比较读者对AI生成意大利语短篇小说与著名意大利作家Alberto Moravia作品的偏好,发现AI文本平均评分略高、更常被首选,但差异微小,且无显著人口统计学或阅读习惯因素影响偏好。
Details
Motivation: 探究读者是否更偏好AI生成的意大利语短篇小说,而非人类知名作家的作品,从而挑战关于文学创作中人类作者不可替代性的既有假设。 Method: 采用双盲实验设计,20名参与者阅读并评价三篇短篇小说(两篇由ChatGPT-4o生成,一篇出自Alberto Moravia),不告知作者信息;同时收集其阅读习惯及年龄、性别、教育程度和母语等人口统计学数据。 Result: AI生成文本获得略高的平均评分,且被更频繁地首选,但差异微小;未发现文本偏好与任何人口统计学变量或阅读习惯之间存在统计学显著关联。 Conclusion: 结果质疑了读者必然偏好人工作品的预设,提示在文学语境中对合成文本进行人工编辑的必要性可能被高估。 Abstract: This study investigates whether readers prefer AI-generated short stories in Italian over one written by a renowned Italian author. In a blind setup, 20 participants read and evaluated three stories, two created with ChatGPT-4o and one by Alberto Moravia, without being informed of their origin. To explore potential influencing factors, reading habits and demographic data, comprising age, gender, education and first language, were also collected. The results showed that the AI-written texts received slightly higher average ratings and were more frequently preferred, although differences were modest. No statistically significant associations were found between text preference and demographic or reading-habit variables. These findings challenge assumptions about reader preference for human-authored fiction and raise questions about the necessity of synthetic-text editing in literary contexts.[23] Parameter Efficient Fine Tuning Llama 3.1 for Answering Arabic Legal Questions: A Case Study on Jordanian Laws
Mohammed Fasha,Bassam Hammo,Bilal Sowan,Husam Barham,Esam Nsour
Main category: cs.CL
TL;DR: 本研究以约旦法律为案例,探索了针对阿拉伯语问答任务对Llama-3.1大语言模型进行微调的方法,使用PEFT(LoRA)和4位量化技术,在Unsloth框架下高效训练,并构建了6000对约旦法律问答数据集,通过BLEU和ROUGE指标验证了性能提升。
Details
Motivation: 提升大语言模型在阿拉伯语法律领域的问答能力,填补该特定语言与专业领域结合的微调研究空白。 Method: 采用参数高效微调(PEFT)方法,结合LoRA适配器与4-bit量化技术,基于Unsloth框架对Llama-3.1-8B-bnb-4bit和Llama-3.1-8B-Instruct-bnb-4bit两个版本进行微调;构建包含6000条约旦法律问答对的定制化数据集,并以结构化提示格式输入。 Result: 微调后模型在法律推理能力和问答准确性上优于原始基线模型,同时显著降低计算资源消耗;BLEU与ROUGE评估结果证实性能提升。 Conclusion: 该工作验证了轻量高效微调策略在阿拉伯语法律垂直领域应用的有效性,为类似低资源语言与专业领域结合的大模型适配提供了可复用的技术路径。 Abstract: This study uses Jordanian law as a case study to explore the fine-tuning of the Llama-3.1 large language model for Arabic question-answering. Two versions of the model - Llama-3.1-8B-bnb-4bit and Llama-3.1-8B-Instruct-bnb-4bit - were fine-tuned using parameter-efficient fine-tuning (PEFT) with LoRA adapters and 4-bit quantized models, leveraging the Unsloth framework for accelerated and resource-efficient training. A custom dataset of 6000 legal question-answer pairs was curated from Jordanian laws and formatted into structured prompts. Performance was evaluated using the BLEU and the ROUGE metrics to compare the fine-tuned models to their respective base versions. Results demonstrated improved legal reasoning and accuracy while achieving resource efficiency through quantization and optimized fine-tuning strategies. This work underscores the potential of adapting large language models for Arabic legal domains and highlights effective techniques for fine-tuning domain-specific tasks.[24] Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers
Zecheng Tang,Quantong Qiu,Yi Yang,Zhiyi Hong,Haiya Xiang,Kebin Liu,Qingqing Dang,Juntao Li,Min Zhang
Main category: cs.CL
TL;DR: Elastic Attention is a dynamic attention mechanism that adapts sparsity per input via a lightweight Attention Router, improving efficiency and performance in long-context LLMs without extensive retraining.
Details
Motivation: Standard attention has quadratic complexity, hindering scalability in long-context scenarios; existing hybrid attention methods use static sparsity ratios and lack task- or input-adaptive inference. Method: Introduce Elastic Attention with a lightweight Attention Router inserted into pretrained LLMs to dynamically assign each attention head to sparse or full attention modes based on input. Result: Achieves strong performance and efficient inference; validated on three long-context benchmarks across popular LLMs, trained in just 12 hours on 8xA800 GPUs. Conclusion: Dynamic, input-adaptive sparsity control via Elastic Attention effectively balances efficiency and accuracy for long-context LLMs, outperforming static hybrid attention approaches. Abstract: The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose Elastic Attention, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight Attention Router into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8xA800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.[25] WarrantScore: Modeling Warrants between Claims and Evidence for Substantiation Evaluation in Peer Reviews
Kiyotada Mori,Shohei Tanaka,Tosho Hirasawa,Tadashi Kozuno,Koichiro Yoshino,Yoshitaka Ushiku
Main category: cs.CL
TL;DR: 本文提出了一种新的评估科学评审意见中主张与证据间逻辑推理质量的指标,以提升评审意见实质化程度(substantiation level)的可解释性评估,实验表明该方法比传统方法与人工评分具有更高相关性。
Details
Motivation: 科学同行评审面临审稿人资源短缺问题,需借助语言模型降低人力成本;现有评估评审意见‘实质化程度’的方法仅检测证据存在与否,缺乏对主张与证据间逻辑推理质量的准确评估。 Method: 提出一种新评估指标,不仅提取评审意见中的主张(claims)和证据(evidence),还专门建模并评估二者之间的逻辑推理关系,从而更准确衡量实质化程度。 Result: 实验结果显示,所提方法与人类评分的相关性高于传统方法,验证了其在提升同行评审效率方面的潜力。 Conclusion: 通过精细化建模主张与证据间的逻辑推理,可显著提升自动化评估评审意见质量的可靠性与可解释性,为AI辅助同行评审提供更坚实基础。 Abstract: The scientific peer-review process is facing a shortage of human resources due to the rapid growth in the number of submitted papers. The use of language models to reduce the human cost of peer review has been actively explored as a potential solution to this challenge. A method has been proposed to evaluate the level of substantiation in scientific reviews in a manner that is interpretable by humans. This method extracts the core components of an argument, claims and evidence, and assesses the level of substantiation based on the proportion of claims supported by evidence. The level of substantiation refers to the extent to which claims are based on objective facts. However, when assessing the level of substantiation, simply detecting the presence or absence of supporting evidence for a claim is insufficient; it is also necessary to accurately assess the logical inference between a claim and its evidence. We propose a new evaluation metric for scientific review comments that assesses the logical inference between claims and evidence. Experimental results show that the proposed method achieves a higher correlation with human scores than conventional methods, indicating its potential to better support the efficiency of the peer-review process.[26] Revisiting Modality Invariance in a Multilingual Speech-Text Model via Neuron-Level Analysis
Toshiki Nakai,Varsha Suresh,Vera Demberg
Main category: cs.CL
TL;DR: 本文研究了多语言语音-文本基础模型(SeamlessM4T v2)中语言与模态(语音/文本)表征的一致性问题,发现其编码器虽趋向语言无关,但解码器在跨模态(尤其语音→文本)重建源语言时存在困难,且存在局部化的模态选择性神经元结构及激活集中现象,导致模型在非主导语种和语音条件下的鲁棒性下降。
Details
Motivation: 探究多语言语音-文本基础模型是否在语音和文本两种模态下对同一语言进行一致的内部表征,即模态不变性是否真正成立。 Method: 在SeamlessM4T v2上开展三项互补分析:(1)使用平均精度排序识别语言/模态选择性神经元;(2)通过推理时中位数替换干预检验其因果功能作用;(3)分析不同语言与模态间激活幅度的不均衡性(activation-magnitude inequality)。 Result: 发现编码器表征随层级加深趋于语言无关,但损害了解码器恢复源语言的能力(尤其语音→文本);跨注意力机制的key/value投影中存在高度局域化的模态选择性结构;语音条件解码与非主导文字系统表现出更高激活集中度。 Conclusion: 模型未实现完全的模态不变性;语言压缩与模态特异性结构共存,导致跨模态与跨语言泛化能力受限,尤其在语音输入和低资源文字系统中更易出现脆弱性。 Abstract: Multilingual speech-text foundation models aim to process language uniformly across both modality and language, yet it remains unclear whether they internally represent the same language consistently when it is spoken versus written. We investigate this question in SeamlessM4T v2 through three complementary analyses that probe where language and modality information is encoded, how selective neurons causally influence decoding, and how concentrated this influence is across the network. We identify language- and modality-selective neurons using average-precision ranking, investigate their functional role via median-replacement interventions at inference time, and analyze activation-magnitude inequality across languages and modalities. Across experiments, we find evidence of incomplete modality invariance. Although encoder representations become increasingly language-agnostic, this compression makes it more difficult for the shared decoder to recover the language of origin when constructing modality-agnostic representations, particularly when adapting from speech to text. We further observe sharply localized modality-selective structure in cross-attention key and value projections. Finally, speech-conditioned decoding and non-dominant scripts exhibit higher activation concentration, indicating heavier reliance on a small subset of neurons, which may underlie increased brittleness across modalities and languages.[27] CLM-Bench: Benchmarking and Analyzing Cross-lingual Misalignment of LLMs in Knowledge Editing
Yucheng Hu,Wei Zhou,Juesi Xiao
Main category: cs.CL
TL;DR: 本文提出CLM-Bench——一个以中文为本的文化感知多语言知识编辑评测基准,揭示现有方法存在跨语言错位问题:中英文编辑向量近似正交、难以迁移,并通过表示空间几何分析给出解释。
Details
Motivation: 现有MKE评测基准多采用英译中等机械翻译方式构建,引入翻译失真、忽略目标语言特有文化实体,无法真实反映LLM的多语言知识分布。 Method: 提出‘中文优先’的原生构建范式,构建包含1010组基于中文文化语境的CounterFact编辑对的CLM-Bench;在主流LLM(如Llama-3、Qwen2)上开展实验,并通过层间表征分析探究中英文编辑向量的几何关系。 Result: 发现显著的跨语言错位现象:单语编辑无法跨语言泛化;中英文编辑向量在隐空间中近似正交,位于互不重叠的子空间;而混合语言编辑表现出向量线性可加性。 Conclusion: 当前多语言知识编辑方法在跨语言迁移上效果有限;构建文化原生的评测基准(如CLM-Bench)对推动真正鲁棒的多语言知识编辑至关重要。 Abstract: Knowledge Editing (KE) has emerged as a promising paradigm for updating facts in Large Language Models (LLMs) without retraining. However, progress in Multilingual Knowledge Editing (MKE) is currently hindered by biased evaluation frameworks. We observe that existing MKE benchmarks are typically constructed by mechanically translating English-centric datasets into target languages (e.g., English-to-Chinese). This approach introduces translation artifacts and neglects culturally specific entities native to the target language, failing to reflect the true knowledge distribution of LLMs. To address this, we propose CLM-Bench, a culture-aware benchmark constructed using a native Chinese-first methodology. We curate 1,010 high-quality CounterFact pairs rooted in Chinese cultural contexts and align them with English counterparts. Using CLM-Bench, we conduct extensive experiments on representative LLMs (e.g., Llama-3, Qwen2) and reveal a significant Cross-lingual Misalignment: edits in one language function independently and fail to propagate to the other. We further provide a geometric explanation via layer-wise representation analysis, demonstrating that edit vectors for Chinese and English are nearly orthogonal -- residing in disjoint subspaces -- while mixed-lingual editing exhibits linear additivity of these vectors. Our findings challenge the effectiveness of current methods in cross-lingual transfer and underscore the importance of culturally native benchmarks.[28] Oops, Wait: Token-Level Signals as a Lens into LLM Reasoning
Jaehui Hwang,Dongyoon Han,Sangdoo Yun,Byeongho Heo
Main category: cs.CL
TL;DR: 本文系统分析了大语言模型中话语类标记(如'wait'、'therefore')与推理正确性的相关性,发现其受训练策略影响显著,但在不同模型规模下保持稳定;尤其指出小数据微调模型虽习得此类信号,但利用不充分。
Details
Motivation: 现有研究缺乏对话语类标记在不同训练策略和模型规模下变化规律的系统分析。 Method: 通过分析多种模型在token-level上的概率分布,考察特定话语类标记与推理正确性的相关性,并比较不同训练策略与模型规模下的表现。 Result: 发现特定话语类标记(如'wait')与推理正确性存在强相关性;该相关性随训练策略变化,但在模型规模变化时保持稳定;小规模数据微调模型能习得此类信号,但利用不充分。 Conclusion: 话语类标记可作为观测和理解大语言模型推理动态的有效透镜,其行为模式揭示了训练策略对推理能力形成的关键影响。 Abstract: The emergence of discourse-like tokens such as "wait" and "therefore" in large language models (LLMs) has offered a unique window into their reasoning processes. However, systematic analyses of how such signals vary across training strategies and model scales remain lacking. In this paper, we analyze token-level signals through token probabilities across various models. We find that specific tokens strongly correlate with reasoning correctness, varying with training strategies while remaining stable across model scales. A closer look at the "wait" token in relation to answer probability demonstrates that models fine-tuned on small-scale datasets acquire reasoning ability through such signals but exploit them only partially. This work provides a systematic lens to observe and understand the dynamics of LLM reasoning.[29] Clustering-driven Memory Compression for On-device Large Language Models
Ondrej Bohdal,Pramit Saha,Umberto Michieli,Mete Ozay,Taha Ceritli
Main category: cs.CL
TL;DR: 本文提出一种基于聚类的记忆压缩策略,用于在设备端大语言模型中高效利用用户记忆,兼顾上下文长度限制与个性化生成质量。
Details
Motivation: 设备端大语言模型上下文长度有限,而直接拼接用户记忆或简单平均压缩均存在上下文爆炸或语义冲突问题。 Method: 通过相似性对用户记忆进行聚类,并在每个簇内融合记忆,再拼接到提示中,以保持语义一致性并减少冗余。 Result: 实验表明该方法显著减少记忆token数量,在固定上下文预算下优于朴素平均和直接拼接等基线方法,且提升生成质量。 Conclusion: 基于聚类的记忆压缩是一种有效平衡上下文效率与个性化性能的新范式。 Abstract: Large language models (LLMs) often rely on user-specific memories distilled from past interactions to enable personalized generation. A common practice is to concatenate these memories with the input prompt, but this approach quickly exhausts the limited context available in on-device LLMs. Compressing memories by averaging can mitigate context growth, yet it frequently harms performance due to semantic conflicts across heterogeneous memories. In this work, we introduce a clustering-based memory compression strategy that balances context efficiency and personalization quality. Our method groups memories by similarity and merges them within clusters prior to concatenation, thereby preserving coherence while reducing redundancy. Experiments demonstrate that our approach substantially lowers the number of memory tokens while outperforming baseline strategies such as naive averaging or direct concatenation. Furthermore, for a fixed context budget, clustering-driven merging yields more compact memory representations and consistently enhances generation quality.[30] Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes
Gautam Siddharth Kashyap,Harsh Joshi,Niharika Jain,Ebad Shabbir,Jiechao Gao,Nipun Joshi,Usman Naseem
Main category: cs.CL
TL;DR: 本文提出ConLLM框架,结合对比学习与大语言模型,解决现有深度伪造检测中模态割裂和跨模态推理浅层的问题,显著提升音频、视频及音视频多模态检测性能。
Details
Motivation: 现有深度伪造检测方法存在模态割裂导致泛化能力差、跨模态推理浅层导致难以捕捉细粒度语义不一致性两大核心问题。 Method: 提出ConLLM混合框架:第一阶段用预训练模型提取各模态特征;第二阶段通过对比学习对齐多模态嵌入以缓解模态割裂,并引入大语言模型进行语义级跨模态推理以识别语义不一致。 Result: 在音频、视频、音视频任务上均取得显著提升:音频EER降低最多50%,视频准确率最高提升8%,音视频任务准确率提升约9%;消融实验表明预训练模型嵌入带来9%-10%的稳定增益。 Conclusion: ConLLM通过融合对比学习与大语言模型推理,有效克服多模态深度伪造检测中的关键瓶颈,为鲁棒、可解释的深度伪造识别提供了新范式。 Abstract: The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%-10% consistent improvements across modalities.[31] Less is More for RAG: Information Gain Pruning for Generator-Aligned Reranking and Evidence Selection
Zhipeng Song,Yizhi Zhou,Xiangyu Kong,Jiulong Jiao,Xinrui Bao,Xu You,Xueqing Shi,Yuhang Zhou,Heng Qi
Main category: cs.CL
TL;DR: 本文提出了一种名为信息增益剪枝(IGP)的方法,用于在检索增强生成(RAG)中更有效地选择外部证据,从而在不增加上下文预算的前提下提升问答质量并显著减少输入token。
Details
Motivation: 现有RAG方法中,检索相关性指标(如NDCG)与端到端问答质量弱相关甚至负相关,尤其在多段证据注入时因冗余和轻微冲突导致生成不稳定。 Method: 提出信息增益剪枝(IGP),一种部署友好的重排序与剪枝模块,基于生成器对齐的效用信号选择证据,并在截断前过滤弱或有害段落,不改变原有预算接口。 Result: 在五个开放域问答基准上,IGP一致提升了质量-成本权衡;在典型多证据设置下,平均F1相对提升约12–20%,最终阶段输入token减少约76–79%。 Conclusion: IGP是一种轻量、通用且高效的方法,能显著提升RAG系统在有限上下文预算下的性能与效率。 Abstract: Retrieval-augmented generation (RAG) grounds large language models with external evidence, but under a limited context budget, the key challenge is deciding which retrieved passages should be injected. We show that retrieval relevance metrics (e.g., NDCG) correlate weakly with end-to-end QA quality and can even become negatively correlated under multi-passage injection, where redundancy and mild conflicts destabilize generation. We propose \textbf{Information Gain Pruning (IGP)}, a deployment-friendly reranking-and-pruning module that selects evidence using a generator-aligned utility signal and filters weak or harmful passages before truncation, without changing existing budget interfaces. Across five open-domain QA benchmarks and multiple retrievers and generators, IGP consistently improves the quality--cost trade-off. In a representative multi-evidence setting, IGP delivers about +12--20% relative improvement in average F1 while reducing final-stage input tokens by roughly 76--79% compared to retriever-only baselines.[32] Improving User Privacy in Personalized Generation: Client-Side Retrieval-Augmented Modification of Server-Side Generated Speculations
Alireza Salemi,Hamed Zamani
Main category: cs.CL
TL;DR: P^3是一个交互式个性化框架,通过服务器端大模型生成草稿、客户端小模型基于私有用户档案进行修正,实现在不泄露隐私的前提下提升个性化生成质量。
Details
Motivation: 现有个性化方法在隐私保护(避免向云端暴露用户私有档案)与生成质量(依赖能力较弱的本地模型)之间存在权衡,亟需兼顾二者的新方案。 Method: P^3采用双模型协同机制:服务器端大模型仅根据用户查询生成k个草稿token;客户端小模型访问本地私有档案,对草稿进行评估与修正;迭代直至生成结束符。 Result: 在LaMP-QA基准上,P^3显著优于非个性化服务端和个性化客户端基线,平均提升7.4%–9%;恢复了‘泄露式’上界90.3%–95.7%的效用;隐私分析显示仅引入1.5%–3.5%额外泄露;客户端仅生成9.2%总token,适合边缘部署。 Conclusion: P^3在保障用户隐私前提下实现了高质量个性化生成,是一种实用、高效且隐私友好的新范式。 Abstract: Personalization is crucial for aligning Large Language Model (LLM) outputs with individual user preferences and background knowledge. State-of-the-art solutions are based on retrieval augmentation, where relevant context from a user profile is retrieved for LLM consumption. These methods deal with a trade-off between exposing retrieved private data to cloud providers and relying on less capable local models. We introduce $P^3$, an interactive framework for high-quality personalization without revealing private profiles to server-side LLMs. In $P^3$, a large server-side model generates a sequence of $k$ draft tokens based solely on the user query, while a small client-side model, with retrieval access to the user's private profile, evaluates and modifies these drafts to better reflect user preferences. This process repeats until an end token is generated. Experiments on LaMP-QA, a recent benchmark consisting of three personalized question answering datasets, show that $P^3$ consistently outperforms both non-personalized server-side and personalized client-side baselines, achieving statistically significant improvements of $7.4%$ to $9%$ on average. Importantly, $P^3$ recovers $90.3%$ to $95.7%$ of the utility of a ``leaky'' upper-bound scenario in which the full profile is exposed to the large server-side model. Privacy analyses, including linkability and attribute inference attacks, indicate that $P^3$ preserves the privacy of a non-personalized server-side model, introducing only marginal additional leakage ($1.5%$--$3.5%$) compared to submitting a query without any personal context. Additionally, the framework is efficient for edge deployment, with the client-side model generating only $9.2%$ of the total tokens. These results demonstrate that $P^3$ provides a practical, effective solution for personalized generation with improved privacy.[33] Sequence Repetition Enhances Token Embeddings and Improves Sequence Labeling with Decoder-only Language Models
Matija Luka Kukić,Marko Čuljak,David Dukić,Martin Tutek,Jan Šnajder
Main category: cs.CL
TL;DR: 本文提出序列重复(SR)方法,使解码器-only模型在序列标注任务中具备双向上下文建模能力,无需修改模型结构,且性能优于编码器和去掩码解码器。
Details
Motivation: 解码器-only语言模型天生为自回归设计,缺乏双向上下文建模能力,而序列标注任务依赖双向信息;传统方案(如移除因果掩码)需大幅修改模型,亟需更轻量、无侵入的适配方法。 Method: 引入序列重复(SR)机制,在输入序列后追加自身副本(如[x1,x2]→[x1,x2,x1,x2]),利用解码器的自回归特性隐式建模双向上下文,并探索不同重复次数及中间层嵌入的有效性。 Result: SR显著提升解码器在序列标注任务上的性能,超越标准编码器和去掩码解码器;增加重复次数不损害性能;中间层嵌入效果媲美最后一层,但计算开销更低。 Conclusion: 序列重复是一种高效、低侵入的双向化策略,可缓解解码器-only模型的结构限制,拓展其在序列标注等token-level任务中的适用性。 Abstract: Modern language models (LMs) are trained in an autoregressive manner, conditioned only on the prefix. In contrast, sequence labeling (SL) tasks assign labels to each individual input token, naturally benefiting from bidirectional context. This discrepancy has historically led SL to rely on inherently bidirectional encoder-only models. However, the rapid development of decoder-only models has raised the question of whether they can be adapted to SL. While causal mask removal has emerged as a viable technique for adapting decoder-only models to leverage the full context for SL, it requires considerable changes to the base model functionality. In this work, we explore sequence repetition (SR) as a less invasive alternative for enabling bidirectionality in decoder-only models. Through fine-tuning experiments, we show that SR inherently makes decoders bidirectional, improving the quality of token-level embeddings and surpassing encoders and unmasked decoders. Contrary to earlier claims, we find that increasing the number of repetitions does not degrade SL performance. Finally, we demonstrate that embeddings from intermediate layers are highly effective for SR, comparable to those from final layers, while being significantly more efficient to compute. Our findings underscore that SR alleviates the structural limitations of decoders, enabling more efficient and adaptable LMs and broadening their applicability to other token-level tasks.[34] From Chains to DAGs: Probing the Graph Structure of Reasoning in LLMs
Tianjun Zhong,Linyang He,Nima Mesgarani
Main category: cs.CL
TL;DR: 本文提出Reasoning DAG Probing框架,探究大语言模型(LLM)隐状态是否线性可解地编码了推理有向无环图(DAG)的几何结构,并分析该结构在各层中的涌现规律;结果表明LLM内部推理具有可测量的图结构,而不仅是线性链式。
Details
Motivation: 许多推理问题天然具有有向无环图(DAG)结构(如多前提依赖、并行推导、节点复用),但现有研究多假设推理是线性链式;尚不清楚LLM内部是否反映此类图结构。 Method: 提出Reasoning DAG Probing框架:为每个推理节点赋予文本实现,训练轻量探测器从隐藏状态中预测节点深度和节点间距离两个图论属性;通过控制实验(破坏推理结构但保留表层文本特性)验证特异性。 Result: 发现推理DAG的几何结构在中间层隐状态中被有意义地编码;其可恢复性随节点深度和模型规模系统性变化。 Conclusion: LLM的内部推理不仅具备序列性,还展现出可测量、可探测的图结构,支持其具备更丰富的多步推理表征能力。 Abstract: Recent progress in large language models has renewed interest in mechanistically characterizing how multi-step reasoning is represented and computed. While much prior work treats reasoning as a linear chain of steps, many reasoning problems are more naturally structured as directed acyclic graphs (DAGs), where intermediate conclusions may depend on multiple premises, branch into parallel sub-derivations, and later merge or be reused. Understanding whether such graph-structured reasoning is reflected in model internals remains an open question. In this work, we introduce Reasoning DAG Probing, a framework that directly asks whether LLM hidden states encode the geometry of a reasoning DAG in a linearly accessible form, and where this structure emerges across layers. Within this framework, we associate each reasoning node with a textual realization and train lightweight probes to predict two graph-theoretic properties from hidden states: node depth and pairwise node distance. We use these probes to analyze the layerwise emergence of DAG structure and evaluate controls that disrupt reasoning-relevant structure while preserving superficial textual properties. Our results provide evidence that reasoning DAG geometry is meaningfully encoded in intermediate layers, with recoverability varying systematically by node depth and model scale, suggesting that LLM reasoning is not only sequential but exhibits measurable internal graph structure.[35] Learning to Ideate for Machine Learning Engineering Agents
Yunxiang Zhang,Kang Zhou,Zhichao Xu,Kiran Ramnath,Yun Zhou,Sangmin Woo,Haibo Ding,Lin Lee Cheong
Main category: cs.CL
TL;DR: 本文提出MLE-Ideator双智能体框架,将算法构想与实现分离,显著提升机器学习工程智能体的迭代优化能力。
Details
Motivation: 现有机器学习工程(MLE)智能体难以迭代优化算法效果,需引入战略级构想能力。 Method: 构建分离构想(Ideator)与实现(Implementation Agent)的双智能体框架;Ideator既可零样本调用,也可通过强化学习(RL)在少量任务数据上训练优化。 Result: 零样本下显著超越纯实现基线;仅用10个任务、1K样本训练的Qwen3-8B Ideator相对提升11.5%,并超越Claude Sonnet 3.5。 Conclusion: 分离构想与实现是构建面向科学发现的战略型AI系统的一条可行且高效路径。 Abstract: Existing machine learning engineering (MLE) agents struggle to iteratively optimize their implemented algorithms for effectiveness. To address this, we introduce MLE-Ideator, a dual-agent framework that separates ideation from implementation. In our system, an implementation agent can request strategic help from a dedicated Ideator. We show this approach is effective in two ways. First, in a training-free setup, our framework significantly outperforms implementation-only agent baselines on MLE-Bench. Second, we demonstrate that the Ideator can be trained with reinforcement learning (RL) to generate more effective ideas. With only 1K training samples from 10 MLE tasks, our RL-trained Qwen3-8B Ideator achieves an 11.5% relative improvement compared to its untrained counterpart and surpasses Claude Sonnet 3.5. These results highlights a promising path toward training strategic AI systems for scientific discovery.[36] What Language Models Know But Don't Say: Non-Generative Prior Extraction for Generalization
Sara Rezaeimanesh,Mohammad M. Ghassemi
Main category: cs.CL
TL;DR: 本文提出LoID方法,通过直接提取大语言模型(LLM)在词元级别对正负语义方向的预测置信度,构建贝叶斯逻辑回归的信息性先验分布,显著提升小样本、分布外(OOD)场景下的分类性能。
Details
Motivation: 在医学和金融等领域,大规模标注数据昂贵且稀缺,导致模型在小数据集上训练后泛化能力差;而大语言模型蕴含大量领域知识,但尚未被有效用于增强小样本贝叶斯建模。 Method: LoID(Logit-Informed Distributions)是一种确定性方法:通过构造正反语义方向的提示句,探测LLM对每个特征影响方向的token-level logit输出,利用其在多样化表述下的一致性来量化信念强度与可靠性,从而生成逻辑回归的先验分布。 Result: 在10个真实表格数据集及合成协变量偏移OOD设置下,LoID相较无信息先验显著提升AUC,最多填补59%与全量数据训练的oracle模型之间的性能差距;在10个数据集中的8个上优于AutoElicit和LLMProcesses。 Conclusion: LoID提供了一种可复现、高效且无需文本生成的大语言模型知识注入方式,为小样本、分布外场景下的贝叶斯建模提供了新范式。 Abstract: In domains like medicine and finance, large-scale labeled data is costly and often unavailable, leading to models trained on small datasets that struggle to generalize to real-world populations. Large language models contain extensive knowledge from years of research across these domains. We propose LoID (Logit-Informed Distributions), a deterministic method for extracting informative prior distributions for Bayesian logistic regression by directly accessing their token-level predictions. Rather than relying on generated text, we probe the model's confidence in opposing semantic directions (positive vs. negative impact) through carefully constructed sentences. By measuring how consistently the LLM favors one direction across diverse phrasings, we extract the strength and reliability of the model's belief about each feature's influence. We evaluate LoID on ten real-world tabular datasets under synthetic out-of-distribution (OOD) settings characterized by covariate shift, where the training data represents only a subset of the population. We compare our approach against (1) standard uninformative priors, (2) AutoElicit, a recent method that prompts LLMs to generate priors via text completions, (3) LLMProcesses, a method that uses LLMs to generate numerical predictions through in-context learning and (4) an oracle-style upper bound derived from fitting logistic regression on the full dataset. We assess performance using Area Under the Curve (AUC). Across datasets, LoID significantly improves performance over logistic regression trained on OOD data, recovering up to \textbf{59\%} of the performance gap relative to the oracle model. LoID outperforms AutoElicit and LLMProcessesc on 8 out of 10 datasets, while providing a reproducible and computationally efficient mechanism for integrating LLM knowledge into Bayesian inference.[37] Beyond the Rabbit Hole: Mapping the Relational Harms of QAnon Radicalization
Bich Ngoc,Doan,Giuseppe Russo,Gianmarco De Francisci Morales,Robert West
Main category: cs.CL
TL;DR: 本研究通过混合方法分析QAnon支持社区的12747个叙事,识别出六种激进化人格类型,并揭示这些类型与亲友所报告的情感伤害之间的关联。
Details
Motivation: 现有大规模计算研究忽视了阴谋论信徒亲友所承受的情感代价,本研究旨在填补这一空白。 Method: 采用BERTopic主题建模识别激进化路径,LDA图形模型提取六种‘激进化人格’,并结合大语言模型辅助的情绪检测与回归建模,关联人格类型与叙述者情感反应。 Result: 发现六种激进化人格可有效预测亲友的具体情感伤害:被视为主动意识形态选择的激进化引发愤怒与厌恶;伴随个人与认知崩溃的激进化则引发恐惧与悲伤。 Conclusion: 激进化是一种关系性现象,本研究首次构建了实证框架来理解其人际影响,为研究者与实践者提供关键指导。 Abstract: The rise of conspiracy theories has created far-reaching societal harm in the public discourse by eroding trust and fueling polarization. Beyond this public impact lies a deeply personal toll on the friends and families of conspiracy believers, a dimension often overlooked in large-scale computational research. This study fills this gap by systematically mapping radicalization journeys and quantifying the associated emotional toll inflicted on loved ones. We use the prominent case of QAnon as a case study, analyzing 12747 narratives from the r/QAnonCasualties support community through a novel mixed-methods approach. First, we use topic modeling (BERTopic) to map the radicalization trajectories, identifying key pre-existing conditions, triggers, and post-radicalization characteristics. From this, we apply an LDA-based graphical model to uncover six recurring archetypes of QAnon adherents, which we term "radicalization personas." Finally, using LLM-assisted emotion detection and regression modeling, we link these personas to the specific emotional toll reported by narrators. Our findings reveal that these personas are not just descriptive; they are powerful predictors of the specific emotional harms experienced by narrators. Radicalization perceived as a deliberate ideological choice is associated with narrator anger and disgust, while those marked by personal and cognitive collapse are linked to fear and sadness. This work provides the first empirical framework for understanding radicalization as a relational phenomenon, offering a vital roadmap for researchers and practitioners to navigate its interpersonal fallout.[38] UrduLM: A Resource-Efficient Monolingual Urdu Language Model
Syed Muhammad Ali,Hammad Sajid,Zainab Haider,Ali Muhammad Asad,Haya Fatima,Abdul Samad
Main category: cs.CL
TL;DR: 本文提出了UrduLM,一个专为乌尔都语设计的单语语言模型,通过构建33GB乌尔都语语料库、定制BPE分词器和预训练1亿参数解码器模型,在低资源条件下显著提升乌尔都语NLP任务性能,并开源全部资源。
Details
Motivation: 乌尔都语缺乏专用的基于Transformer的语言模型和高质量语料库;现有多语言模型在乌尔都语上表现差、计算开销大且存在文化偏差,主要因训练数据不足。 Method: 构建了33GB多元来源乌尔都语语料库;设计了降低20–30%分词开销的定制BPE分词器;预训练了一个1亿参数的解码器-only语言模型(UrduLM)。 Result: 在少样本评估中,UrduLM在情感分类任务上达到66.6%准确率,在语法纠错任务上BLEU得分超30,性能媲美体积达其30倍的多语言模型。 Conclusion: UrduLM为乌尔都语NLP研究建立了新基线,并提供可扩展框架,助力其他代表性不足语言的建模工作;全部资源(语料、分词器、模型权重、评测基准)已开源。 Abstract: Urdu, spoken by 230 million people worldwide, lacks dedicated transformer-based language models and curated corpora. While multilingual models provide limited Urdu support, they suffer from poor performance, high computational costs, and cultural inaccuracies due to insufficient training data. To address these challenges, we present UrduLM, a pretrained Urdu monolingual language model trained in low-resource settings. We curate a 33GB Urdu corpus from diverse sources, develop a custom BPE tokenizer that reduces tokenization overhead by atleast 20-30% compared to multilingual alternatives, and pretrain a 100M-parameter decoder-only model. In few-shot evaluations, UrduLM achieves competitive performance with multilingual models up to 30x its size, reaching 66.6% accuracy on sentiment classification and BLEU scores exceeding 30 on grammar correction tasks. The complete methodology -- including corpus, tokenizer, model weights, and evaluation benchmarks -- is released openly to establish a baseline for Urdu NLP research and provide a scalable framework for other underrepresented languages.[39] Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning
Chunxu Zhao,Xin Huang,Xue Han,Shujian Huang,Chao Deng,Junlan Feng
Main category: cs.CL
TL;DR: 本文提出PASMR方法,通过将多语言数学推理问题翻译为枢轴语言,并利用枢轴语言的推理答案监督目标语言的推理过程,实现无需外部标注或奖励模型的跨语言自反馈,从而提升大语言模型在低资源语言中的多语言推理对齐能力。
Details
Motivation: 大型语言模型在多语言场景下(尤其是低资源语言)表现下降,原因是其多语言理解与推理能力缺乏一致性对齐。 Method: 提出Pivot-Aligned Self-Feedback Multilingual Reasoning(PASMR),以模型主语言为枢轴语言,训练时先将问题翻译至枢轴语言,再用枢轴语言的推理答案监督目标语言的推理过程,构建跨语言自反馈机制。 Result: 实验表明该方法显著提升了模型对多语言问题的理解与推理能力,在多项任务中取得明显性能提升。 Conclusion: PASMR有效缓解了大语言模型在多语言数学推理中的语言偏差问题,实现了不依赖外部标注或奖励模型的高效跨语言对齐。 Abstract: Despite the impressive reasoning abilities demonstrated by large language models (LLMs), empirical evidence indicates that they are not language agnostic as expected, leading to performance declines in multilingual settings, especially for low-resource languages. We attribute the decline to the model's inconsistent multilingual understanding and reasoning alignment. To address this, we present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR), aiming to improve the alignment of multilingual math reasoning abilities in LLMs. This approach designates the model's primary language as the pivot language. During training, the model first translates questions into the pivot language to facilitate better alignment of reasoning patterns. The reasoning process in the target language is then supervised by the pivot language's reasoning answers, thereby establishing a cross-lingual self-feedback mechanism without relying on external correct answers or reward models. Extensive experimental results demonstrate that our method enhances both the model's understanding of questions and its reasoning capabilities, leading to notable task improvements.[40] S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
Qingsen Ma,Dianyun Wang,Yaoye Wang,Lechen Ning,Sujie Zhu,Xiaohang Zhang,Jiaming Lyu,Linhao Ren,Zhenbo Xu,Zhaofeng He
Main category: cs.CL
TL;DR: 本文提出S3-Attention框架,通过稀疏自编码器和CPU倒排索引实现内存高效的长上下文推理,替代传统KV缓存,在保持性能的同时显著降低GPU内存占用。
Details
Motivation: 现有长上下文推理方法存在内存开销大(KV缓存线性增长)和检索不相关(外部检索易返回词义相似但因果无关内容)的问题。 Method: S3-Attention在推理时将键/查询投影为稀疏特征标识,用轻量稀疏自编码器提取top-k特征,并在单次流式扫描中构建CPU倒排索引;生成时基于特征共激活检索紧凑证据片段,可选融合BM25提升精确匹配。 Result: 在统一LongBench评测下,S3-Hybrid在多个模型家族上接近全上下文推理性能,且在信息密集场景中鲁棒性更强;但当前原型存在墙钟延迟高于优化全KV基线的工程局限。 Conclusion: S3-Attention是一种以内存为中心、注意力对齐的端内检索框架,能有效缓解长上下文推理的内存与噪声瓶颈,具备实用潜力,需后续核级优化以降低延迟。 Abstract: Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.[41] Distance-to-Distance Ratio: A Similarity Measure for Sentences Based on Rate of Change in LLM Embeddings
Abdullah Qureshi,Kenneth Rice,Alexander Wolpert
Main category: cs.CL
TL;DR: 本文提出了一种新的LLM句子嵌入相似性度量方法——距离-距离比(DDR),其灵感来自Lipschitz连续性,用于衡量上下文对语义的影响,并在受控扰动实验中展现出比现有指标更精细的语义区分能力。
Details
Motivation: 现有文本嵌入相似性度量需符合人类对文本相似性的感知,但缺乏对上下文语义影响的显式建模。 Method: 提出距离-距离比(DDR)指标,基于上下文前后词嵌入距离变化率建模语义影响;在句子数据集上设计单词替换扰动实验(同义词/随机词,替换1-3个词),评估DDR性能。 Result: DDR在各类扰动下均能更精细地区分语义相似与不相似文本,性能持续优于主流相似性度量指标。 Conclusion: DDR是一种更符合人类语义直觉、对上下文敏感且鲁棒的句子嵌入相似性度量方法。 Abstract: A measure of similarity between text embeddings can be considered adequate only if it adheres to the human perception of similarity between texts. In this paper, we introduce the distance-to-distance ratio (DDR), a novel measure of similarity between LLM sentence embeddings. Inspired by Lipschitz continuity, DDR measures the rate of change in similarity between the pre-context word embeddings and the similarity between post-context LLM embeddings, thus measuring the semantic influence of context. We evaluate the performance of DDR in experiments designed as a series of perturbations applied to sentences drawn from a sentence dataset. For each sentence, we generate variants by replacing one, two, or three words with either synonyms, which constitute semantically similar text, or randomly chosen words, which constitute semantically dissimilar text. We compare the performance of DDR with other prevailing similarity metrics and demonstrate that DDR consistently provides finer discrimination between semantically similar and dissimilar texts, even under minimal, controlled edits.[42] A Computational Approach to Visual Metonymy
Saptarshi Ghosh,Linfeng Liu,Tianyu Jiang
Main category: cs.CL
TL;DR: 本文首次对视觉转喻进行计算研究,提出基于符号学理论的生成框架,并构建首个视觉转喻数据集ViMET,用于评估多模态模型在理解间接视觉指代方面的能力。
Details
Motivation: 视觉图像常通过关联线索(而非直接描绘)传达深层概念(如工具暗示职业),这种视觉转喻现象尚未被计算建模,亟需系统性研究。 Method: 提出融合大语言模型与文生图模型的符号学驱动生成流程,构建含2000道多选题的视觉转喻数据集ViMET,用于评估多模态模型的认知推理能力。 Result: 实验显示人类准确率(86.9%)显著高于当前最优视觉语言模型(65.9%),揭示模型在理解间接视觉指代方面存在明显缺陷。 Conclusion: 视觉转喻是现有视觉语言模型的重要认知盲区;ViMET为评估和推动该方向研究提供了首个基准数据集与方法框架。 Abstract: Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines' ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.[43] Unsupervised Elicitation of Moral Values from Language Models
Meysam Alizadeh,Fabrizio Gilardi,Zeynab Samei
Main category: cs.CL
TL;DR: 本文提出了一种无需人工标注的内部一致性最大化(ICM)方法,用于从预训练语言模型中无监督地激发其潜在的道德推理能力,并在多个基准上验证了其有效性及对社会偏见的缓解作用。
Details
Motivation: 由于构建道德评估的真值数据困难且易受多元框架和偏见影响,亟需一种不依赖人工标注的道德能力挖掘方法。 Method: 采用内部一致性最大化(ICM)算法,在三个基准数据集和四种语言模型上进行无监督道德判断标注与泛化测试,并对比预训练模型、聊天机器人模型及人工标注微调效果。 Result: ICM在Norm Bank和ETHICS基准上全面超越预训练与聊天机器人基线;基于ICM标签微调效果媲美甚至优于人工标签微调;在正义与常识道德框架中增益最大;将种族、社会经济地位和政治相关偏见错误率降低超50%。 Conclusion: 预训练语言模型具备可被无监督方法(如ICM)激发的潜在道德推理能力,为AI对齐提供了可扩展的新路径。 Abstract: As AI systems become pervasive, grounding their behavior in human values is critical. Prior work suggests that language models (LMs) exhibit limited inherent moral reasoning, leading to calls for explicit moral teaching. However, constructing ground truth data for moral evaluation is difficult given plural frameworks and pervasive biases. We investigate unsupervised elicitation as an alternative, asking whether pretrained (base) LMs possess intrinsic moral reasoning capability that can be surfaced without human supervision. Using the Internal Coherence Maximization (ICM) algorithm across three benchmark datasets and four LMs, we test whether ICM can reliably label moral judgments, generalize across moral frameworks, and mitigate social bias. Results show that ICM outperforms all pre-trained and chatbot baselines on the Norm Bank and ETHICS benchmarks, while fine-tuning on ICM labels performs on par with or surpasses those of human labels. Across theoretically motivated moral frameworks, ICM yields its largest relative gains on Justice and Commonsense morality. Furthermore, although chatbot LMs exhibit social bias failure rates comparable to their pretrained ones, ICM reduces such errors by more than half, with the largest improvements in race, socioeconomic status, and politics. These findings suggest that pretrained LMs possess latent moral reasoning capacities that can be elicited through unsupervised methods like ICM, providing a scalable path for AI alignment.[44] Hylog: A Hybrid Approach to Logging Text Production in Non-alphabetic Scripts
Roberto Crotti,Giovanni Denaro,Zhiqiang Du,Ricardo Muñoz Martín
Main category: cs.CL
TL;DR: 本文提出Hylog,一种结合分析型键盘记录与生态文本记录的混合日志系统,用于更完整、精细地研究非字母文字(如中文)通过输入法编辑器(IME)进行文本输入的认知过程。
Details
Motivation: 现有研究型键盘记录器大多无法捕捉非字母文字输入时输入法编辑器(IME)所执行的屏幕变换,存在方法论空白。 Method: 开发了模块化、开源的混合记录系统Hylog,通过插件支持主流应用(如Word、Chrome),同步捕获键盘输入与渲染文本,并由hybridizer模块生成双轨迹记录;并开展概念验证实验(两名被试进行中英翻译)。 Result: Hylog成功记录了拉丁字母、汉字及IME确认操作的按键与时间间隔,这些数据传统键盘记录器无法获取;支持提出关于IME辅助打字中不同语言层级认知限制与可利用性的新假设。 Conclusion: Hylog填补了非字母文字认知研究中的记录技术空白,其插件架构便于扩展至其他IME系统,推动更具包容性的多语言文本产出研究。 Abstract: Research keyloggers are essential for cognitive studies of text production, yet most fail to capture the on-screen transformations performed by Input Method Editors (IMEs) for non-alphabetic scripts. To address this methodological gap, we present Hylog, a novel hybrid logging system that combines analytical keylogging with ecological text logging for a more complete and finer-grained analysis. Our modular, open-source system uses plug-ins for standard applications (Microsoft Word, Google Chrome) to capture both keyboard output and rendered text, which a hybridizer module then synchronizes into a dual trace. To validate the system's technical feasibility and demonstrate its analytical capabilities, we conducted a proof-of-concept study where two volunteers translated a text into simplified Chinese. Hylog successfully captured keypresses and temporal intervals between Latin letters, Chinese characters, and IME confirmations -- some measurements invisible to traditional keyloggers. The resulting data enable the formulation of new, testable hypotheses about the cognitive restrictions and affordances at different linguistic layers in IME-mediated typing. Our plug-in architecture enables extension to other IME systems and fosters more inclusive multilingual text-production research.[45] ProGraph-R1: Progress-aware Reinforcement Learning for Graph Retrieval Augmented Generation
Jinyoung Park,Sanghyeok Lee,Omar Zia Khan,Hyunwoo J. Kim,Joo-Kyung Kim
Main category: cs.CL
TL;DR: 本文提出ProGraph-R1,一种面向图检索与多步推理的进展感知型智能体框架,通过结构感知的超图检索机制和基于进展的逐步策略优化,克服了现有RL驱动GraphRAG方法在图结构利用不足和奖励稀疏方面的缺陷,显著提升了多跳问答的推理准确率与生成质量。
Details
Motivation: 现有基于强化学习的GraphRAG框架(如Graph-R1)存在两大局限:一是仅依赖语义相似性进行检索,忽视知识图谱内在结构;二是仅使用稀疏的结果级奖励,无法评估中间检索步骤的质量及其依赖关系。 Method: 提出ProGraph-R1框架:1)结构感知的超图检索机制,联合建模语义相关性与图连通性,支持连贯的多跳路径遍历;2)进展驱动的逐步策略优化,通过在图内衡量中间推理进展来调制优势函数,提供密集的学习信号。 Result: 在多跳问答基准测试中,ProGraph-R1在推理准确率和生成质量上持续优于现有GraphRAG方法。 Conclusion: 引入图结构感知与密集进展奖励是提升GraphRAG推理能力的关键路径,ProGraph-R1为知识图谱增强的多步推理提供了更鲁棒、可解释的智能体训练范式。 Abstract: Graph Retrieval-Augmented Generation (GraphRAG) has been successfully applied in various knowledge-intensive question answering tasks by organizing external knowledge into structured graphs of entities and relations. It enables large language models (LLMs) to perform complex reasoning beyond text-chunk retrieval. Recent works have employed reinforcement learning (RL) to train agentic GraphRAG frameworks that perform iterative interactions between LLMs and knowledge graphs. However, existing RL-based frameworks such as Graph-R1 suffer from two key limitations: (1) they primarily depend on semantic similarity for retrieval, often overlooking the underlying graph structure, and (2) they rely on sparse, outcome-level rewards, failing to capture the quality of intermediate retrieval steps and their dependencies. To address these limitations, we propose ProGraph-R1, a progress-aware agentic framework for graph-based retrieval and multi-step reasoning. ProGraph-R1 introduces a structure-aware hypergraph retrieval mechanism that jointly considers semantic relevance and graph connectivity, encouraging coherent traversal along multi-hop reasoning paths. We also design a progress-based step-wise policy optimization, which provides dense learning signals by modulating advantages according to intermediate reasoning progress within a graph, rather than relying solely on final outcomes. Experiments on multi-hop question answering benchmarks demonstrate that ProGraph-R1 consistently improves reasoning accuracy and generation quality over existing GraphRAG methods.[46] Cross-Lingual Probing and Community-Grounded Analysis of Gender Bias in Low-Resource Bengali
Md Asgor Hossain Reaj,Rajan Das Gupta,Jui Saha Pritha,Abdullah Al Noman,Abir Ahmed,Golam Md Mohiuddin,Tze Hui Liew
Main category: cs.CL
TL;DR: 本研究探讨了大型语言模型(LLMs)在孟加拉语中固有的性别偏见问题,指出英语中心的偏见检测框架难以直接适用于孟加拉语,并通过多种方法(词典挖掘、计算分类、翻译对比、GPT生成)及两次实地调研,揭示其独特性与文化敏感性,强调需构建面向低资源语言的本地化、社区驱动的偏见评估与缓解工具。
Details
Motivation: 当前性别偏见研究主要集中于英语,而全球南方语言(如孟加拉语)因语言结构与社会文化差异所导致的隐性偏见被严重忽视,亟需针对性研究。 Method: 采用词典驱动挖掘、计算分类模型、跨语言翻译对比分析及GPT生成偏见语句等多策略提取性别偏见表达;并开展两次覆盖农村与低收入社区的实地调查以获取真实语境下的偏见认知。 Result: 发现孟加拉语中的性别偏见具有显著区别于英语的特征,英语中心框架在该语言中效果受限;实地调研揭示出自动化系统常忽略的文化特异性偏见;社区参与式方法对识别本地相关偏见至关重要。 Conclusion: 需为孟加拉语等代表性不足的语言专门设计语言学工具与偏见缓解策略,推动构建更具包容性与公平性的NLP系统,并为其他印度语系语言的偏见研究奠定基础。 Abstract: Large Language Models (LLMs) have achieved significant success in recent years; yet, issues of intrinsic gender bias persist, especially in non-English languages. Although current research mostly emphasizes English, the linguistic and cultural biases inherent in Global South languages, like Bengali, are little examined. This research seeks to examine the characteristics and magnitude of gender bias in Bengali, evaluating the efficacy of current approaches in identifying and alleviating bias. We use several methods to extract gender-biased utterances, including lexicon-based mining, computational classification models, translation-based comparison analysis, and GPT-based bias creation. Our research indicates that the straight application of English-centric bias detection frameworks to Bengali is severely constrained by language disparities and socio-cultural factors that impact implicit biases. To tackle these difficulties, we executed two field investigations inside rural and low-income areas, gathering authentic insights on gender bias. The findings demonstrate that gender bias in Bengali presents distinct characteristics relative to English, requiring a more localized and context-sensitive methodology. Additionally, our research emphasizes the need of integrating community-driven research approaches to identify culturally relevant biases often neglected by automated systems. Our research enhances the ongoing discussion around gender bias in AI by illustrating the need to create linguistic tools specifically designed for underrepresented languages. This study establishes a foundation for further investigations into bias reduction in Bengali and other Indic languages, promoting the development of more inclusive and fair NLP systems.[47] DPI: Exploiting Parameter Heterogeneity for Interference-Free Fine-Tuning
Xiaoyu Liu,Xiaoyu Guan,Di Liang,Xianjie Wu
Main category: cs.CL
TL;DR: 本文提出一种动态参数隔离策略,通过识别各SFT任务的核心参数区域并据此分阶段冻结参数,缓解多任务微调中的‘跷跷板效应’,提升整体性能。
Details
Motivation: 解决监督微调(SFT)中异构任务目标冲突导致的‘跷跷板效应’,即优化一个任务会损害其他任务性能的问题,根源在于参数更新缺乏区分性。 Method: 首先独立微调各SFT任务,识别其核心参数区域(参数更新最大的子集);根据核心区域重叠度对任务聚类或分阶段;在多阶段SFT中冻结前序任务获得的核心参数,防止后续任务覆盖。 Result: 在多个公开数据集上的实验表明,该方法显著降低数据冲突,在性能上持续优于多阶段和多任务微调基线。 Conclusion: 参数异质性是跨任务干扰的根源,通过动态识别与隔离任务特定核心参数区域,可有效缓解SFT中的任务冲突,提升模型泛化能力与稳定性。 Abstract: Supervised fine-tuning (SFT) is a crucial step for adapting large language models (LLMs) to downstream tasks. However, conflicting objectives across heterogeneous SFT tasks often induce the "seesaw effect": optimizing for one task may degrade performance on others, particularly when model parameters are updated indiscriminately. In this paper, we propose a principled approach to disentangle and isolate task-specific parameter regions, motivated by the hypothesis that parameter heterogeneity underlies cross-task interference. Specifically, we first independently fine-tune LLMs on diverse SFT tasks and identify each task's core parameter region as the subset of parameters exhibiting the largest updates. Tasks with highly overlapping core parameter regions are merged for joint training, while disjoint tasks are organized into different stages. During multi-stage SFT, core parameters acquired in prior tasks are frozen, thereby preventing overwriting by subsequent tasks. To verify the effectiveness of our method, we conducted intensive experiments on multiple public datasets. The results showed that our dynamic parameter isolation strategy consistently reduced data conflicts and achieved consistent performance improvements compared to multi-stage and multi-task tuning baselines.[48] Controlling Reading Ease with Gaze-Guided Text Generation
Andreas Säuberli,Darja Jepifanova,Diego Frassinelli,Barbara Plank
Main category: cs.CL
TL;DR: 本文提出了一种利用眼动预测模型引导语言模型生成具有可控阅读难度文本的方法,并通过眼动实验验证了其有效性。
Details
Motivation: 利用眼动模式反映认知负荷的特点,实现对文本阅读难度的可控生成,以支持信息无障碍和个性化语言学习。 Method: 采用预测人类注视模式的模型来引导语言模型输出,使其诱发特定的阅读行为。 Result: 眼动实验表明该方法能有效调节文本的阅读难度(通过阅读时间和主观难度评估),且变化主要源于影响词汇加工的特征。 Conclusion: 该方法为文本简化与个性化教育材料生成提供了可行的技术路径。 Abstract: The way our eyes move while reading can tell us about the cognitive effort required to process the text. In the present study, we use this fact to generate texts with controllable reading ease. Our method employs a model that predicts human gaze patterns to steer language model outputs towards eliciting certain reading behaviors. We evaluate the approach in an eye-tracking experiment with native and non-native speakers of English. The results demonstrate that the method is effective at making the generated texts easier or harder to read, measured both in terms of reading times and perceived difficulty of the texts. A statistical analysis reveals that the changes in reading behavior are mostly due to features that affect lexical processing. Possible applications of our approach include text simplification for information accessibility and generation of personalized educational material for language learning.[49] Beyond a Single Perspective: Text Anomaly Detection with Multi-View Language Representations
Yixin Liu,Kehan Yan,Shiyuan Li,Qingfeng Chen,Shirui Pan
Main category: cs.CL
TL;DR: 本文提出MCA²,一种多视角文本异常检测框架,通过融合多个预训练语言模型的嵌入,并设计对比协作与自适应分配模块,提升跨数据集和异常类型的泛化能力。
Details
Motivation: 现有两步式文本异常检测方法受限于单一嵌入模型及对多样数据集和异常类型的不适应性。 Method: 提出MCA²框架:采用多视角重建模型提取正常文本模式;设计对比协作模块增强不同嵌入视角间的互补性;引入自适应分配模块动态加权各视角贡献。 Result: 在10个基准数据集上显著优于强基线方法。 Conclusion: 融合多模型嵌入并建模视角间协作与自适应权重,可有效提升文本异常检测的鲁棒性与泛化性。 Abstract: Text anomaly detection (TAD) plays a critical role in various language-driven real-world applications, including harmful content moderation, phishing detection, and spam review filtering. While two-step "embedding-detector" TAD methods have shown state-of-the-art performance, their effectiveness is often limited by the use of a single embedding model and the lack of adaptability across diverse datasets and anomaly types. To address these limitations, we propose to exploit the embeddings from multiple pretrained language models and integrate them into $MCA^2$, a multi-view TAD framework. $MCA^2$ adopts a multi-view reconstruction model to effectively extract normal textual patterns from multiple embedding perspectives. To exploit inter-view complementarity, a contrastive collaboration module is designed to leverage and strengthen the interactions across different views. Moreover, an adaptive allocation module is developed to automatically assign the contribution weight of each view, thereby improving the adaptability to diverse datasets. Extensive experiments on 10 benchmark datasets verify the effectiveness of $MCA^2$ against strong baselines. The source code of $MCA^2$ is available at https://github.com/yankehan/MCA2.[50] DIETA: A Decoder-only transformer-based model for Italian-English machine TrAnslation
Pranav Kasela,Marco Braga,Alessandro Ghiotto,Andrea Pilzer,Marco Viviani,Alessandro Raganato
Main category: cs.CL
TL;DR: 本文介绍了DIETA,一个专为意大利语-英语机器翻译设计的0.5B参数解码器-only Transformer模型,基于大规模多领域平行语料训练,并发布新评估集与全部资源。
Details
Motivation: 现有模型在意大利语-英语翻译任务上缺乏专门优化的小型高效模型,且缺乏面向当代文本(如新闻)的评估基准。 Method: 构建约2.07亿句对的高质量意大利语-英语平行语料(涵盖议会、法律、网络、字幕、新闻、文学等),并加入3.52亿句回译数据;训练0.5B参数Decoder-only Transformer模型;构建并发布基于2025年WikiNews的450句新评估集。 Result: DIETA在多个意大利语-英语基准测试中表现具竞争力,在32系统排行榜中稳居第二四分位,并在5个测试套件中的4个上超越多数小于3B参数的模型。 Conclusion: DIETA证明了针对特定语言对和领域定制小型Transformer模型的有效性,其开源数据、模型与评估集将推动意英翻译研究发展。 Abstract: In this paper, we present DIETA, a small, decoder-only Transformer model with 0.5 billion parameters, specifically designed and trained for Italian-English machine translation. We collect and curate a large parallel corpus consisting of approximately 207 million Italian-English sentence pairs across diverse domains, including parliamentary proceedings, legal texts, web-crawled content, subtitles, news, literature and 352 million back-translated data using pretrained models. Additionally, we create and release a new small-scale evaluation set, consisting of 450 sentences, based on 2025 WikiNews articles, enabling assessment of translation quality on contemporary text. Comprehensive evaluations show that DIETA achieves competitive performance on multiple Italian-English benchmarks, consistently ranking in the second quartile of a 32-system leaderboard and outperforming most other sub-3B models on four out of five test suites. The training script, trained models, curated corpus, and newly introduced evaluation set are made publicly available, facilitating further research and development in specialized Italian-English machine translation. https://github.com/pkasela/DIETA-Machine-Translation[51] Linguistic and Argument Diversity in Synthetic Data for Function-Calling Agents
Dan Greenstein,Zohar Karnin,Chen Amiraz,Oren Somekh
Main category: cs.CL
TL;DR: 本文提出了一种无需人工规则或分类法、基于通用多样性度量优化生成高质量多样化函数调用训练数据的方法,在查询和参数层面提升语言与语义多样性,并在多样性和正确性上优于现有方法,尤其在BFCL基准上实现7.4%准确率提升。
Details
Motivation: 现有函数调用代理训练数据在请求的语言多样性及参数覆盖(如city_name、stock_ticker)方面探索不足,难以支撑鲁棒、泛化的模型训练。 Method: 提出一种基于通用多样性度量(不依赖人工规则或领域特定分类法)的合成数据生成方法,同时优化用户查询和参数两个维度的多样性。 Result: 所生成数据在多样性上显著优于SOTA基线方法,同时保持相当的正确性;以此数据训练的模型在OOD性能(特别是BFCL基准)上表现更优,准确率提升7.4%。 Conclusion: 面向函数调用代理的数据生成应兼顾语言与参数多样性,本文提出的无监督、通用多样性优化方法具有跨场景鲁棒性与实际有效性。 Abstract: The construction of function calling agents has emerged as a promising avenue for extending model capabilities. A major challenge for this task is obtaining high quality diverse data for training. Prior work emphasizes diversity in functions, invocation patterns, and interaction turns, yet linguistic diversity of requests and coverage of arguments (e.g., \texttt{city\_name}, \texttt{stock\_ticker}) remain underexplored. We propose a method that generates synthetic datasets via optimizing general-purpose diversity metrics across both queries and arguments, without relying on hand-crafted rules or taxonomies, making it robust to different usecases. We demonstrate the effectiveness of our technique via both intrinsic and extrinsic testing, comparing it to SoTA data generation methods. We show a superiority over baselines in terms of diversity, while keeping comparable correctness. Additionally, when used as a training set, the model resulting from our dataset exhibits superior performance compared to analogous models based on the baseline data generation methods in out-of-distribution performance. In particular, we achieve an $7.4\%$ increase in accuracy on the BFCL benchmark compared to similar counterparts.[52] EFT-CoT: A Multi-Agent Chain-of-Thought Framework for Emotion-Focused Therapy
Lanqing Du,Yunong Li,YuJie Long,Shihong Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于情绪聚焦疗法(EFT)的多智能体思维链框架(EFT-CoT),通过‘自下而上’的三阶段推理流程(身体感知→认知探索→叙事干预),结合8个专用智能体和高质量数据集EFT-Instruct,显著提升了心理健康问答系统在共情深度与专业性等方面的性能。
Details
Motivation: 现有基于认知行为疗法(CBT)的心理健康问答方法偏重‘自上而下’的理性重构,忽视来访者的具身体验与原发情绪加工,难以满足真实咨询中对共情与可解释性的高要求。 Method: 提出EFT-CoT多智能体思维链框架,包含‘具身感知-认知探索-叙事干预’三阶段推理流;构建EFT-Instruct数据集(约6.7万条真实文本经思维链蒸馏);微调专用模型EFT-LLM。 Result: EFT-LLM在共情深度、结构专业性等指标上超越强基线及人类应答;消融实验证实多智能体机制的必要性;模型展现出更强的心理学推理能力。 Conclusion: EFT-CoT为构建可解释、高共情的心理咨询AI系统提供了有效新路径,推动LLM在心理健康领域的范式从理性主导转向情绪与身体并重。 Abstract: Leveraging Large Language Models (LLMs) for Mental Health Question Answering (MHQA) is promising for mitigating resource shortages. However, existing Cognitive Behavioral Therapy (CBT)-based approaches predominantly favor a "top-down" rational restructuring, often neglecting clients' embodied experiences and primary emotion processing. To address this, we propose an Emotion-Focused Therapy (EFT)-based Multi-Agent Chain-of-Thought framework (EFT-CoT). Adopting a "bottom-up" trajectory, it deconstructs the intervention into a three-stage reasoning flow: "Embodied Perception - Cognitive Exploration - Narrative Intervention." Utilizing eight specialized agents, the system explicitly executes critical components such as somatic awareness mapping, adaptive assessment, core belief extraction, and narrative restructuring. We further constructed "EFT-Instruct," a high-quality dataset via Chain-of-Thought distillation of approximately 67,000 authentic texts, and fine-tuned a specialized model, EFT-LLM. Experimental evaluations demonstrate that EFT-LLM outperforms strong baselines and human responses across metrics like empathy depth and structural professionalism. Ablation studies confirm the necessity of the multi-agent mechanism. The model exhibits superior psychological reasoning, offering an effective pathway for interpretable, high-empathy counseling systems.[53] D-Models and E-Models: Diversity-Stability Trade-offs in the Sampling Behavior of Large Language Models
Jia Gu,Liang Pang,Huawei Shen,Xueqi Cheng
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)中token级预测概率(P_token)与任务级目标分布(P_task)之间的对齐关系,发现两类模型:D-models(如Qwen-2.5)具有高波动性但低任务对齐性,E-models(如Mistral-Small)更稳定且更贴近任务需求;并通过仿真、下游任务评估和机制分析揭示其多样性与稳定性权衡,为实际应用中的模型选型提供指导。
Details
Motivation: 尽管LLM能近似真实世界分布,但其细粒度采样概率(P_token)是否真正符合具体任务需求(即P_task)尚不清楚,亟需系统性探究。 Method: 通过受控的分布采样仿真识别模型行为差异,结合代码生成、推荐等下游任务实证评估,并分析模型内部机制(如logits分布、注意力模式等)以解释差异根源。 Result: 发现D-models(如Qwen-2.5)P_token步间波动大、与P_task对齐差;E-models(如Mistral-Small)更稳定、对齐更好;下游任务中二者呈现系统性多样性-稳定性权衡;内部分析表明差异源于归一化策略与注意力聚焦特性。 Conclusion: P_token与P_task的对齐程度是影响LLM任务适配性的关键隐性属性;D/E模型各有适用场景——D-models适合需高探索性的任务,E-models更适合要求可靠性和一致性的Web级应用(如搜索、推荐、对话)。 Abstract: The predictive probability of the next token (P_token) in large language models (LLMs) is inextricably linked to the probability of relevance for the next piece of information, the purchase probability of the next product, and the execution probability of the next action-all of which fall under the scope of the task-level target distribution (P_task). While LLMs are known to generate samples that approximate real-world distributions, whether their fine-grained sampling probabilities faithfully align with task requirements remains an open question. Through controlled distribution-sampling simulations, we uncover a striking dichotomy in LLM behavior, distinguishing two model types: D-models (e.g. Qwen-2.5), whose P_token exhibits large step-to-step variability and poor alignment with P_task; and E-models (e.g. Mistral-Small), whose P_token is more stable and better aligned with P_task. We further evaluate these two model types in downstream tasks such as code generation and recommendation, revealing systematic trade-offs between diversity and stability that shape task outcomes. Finally, we analyze the internal properties of both model families to probe their underlying mechanisms. These findings offer foundational insights into the probabilistic sampling behavior of LLMs and provide practical guidance on when to favor D- versus E-models. For web-scale applications, including recommendation, search, and conversational agents, our results inform model selection and configuration to balance diversity with reliability under real-world uncertainty, providing a better level of interpretation.[54] On the Emergence and Test-Time Use of Structural Information in Large Language Models
Michelle Chao Chen,Moritz Miller,Bernhard Schölkopf,Siyuan Guo
Main category: cs.CL
TL;DR: 本文研究了语言模型如何从观测数据中学习抽象结构,并在测试时利用所学结构信息,通过设计基于语言结构变换的自然语言数据集进行实验,发现结构信息的学习与复杂推理任务相关,但测试时的组合生成能力仍有限。
Details
Motivation: 学习观测数据中的结构信息对于科学发现中的机制理解以及测试时灵活的组合生成至关重要。 Method: 设计了一个基于语言结构变换的自然语言数据集,在受控环境下研究语言模型对抽象结构的学习及测试时利用能力。 Result: 实验证明结构信息的学习与复杂推理任务的出现相关,但模型在测试时的组合生成能力仍然有限。 Conclusion: 语言模型虽能学习结构信息并支持复杂推理,但其测试时的组合生成能力尚待提升。 Abstract: Learning structural information from observational data is central to producing new knowledge outside the training corpus. This holds for mechanistic understanding in scientific discovery as well as flexible test-time compositional generation. We thus study how language models learn abstract structures and utilize the learnt structural information at test-time. To ensure a controlled setup, we design a natural language dataset based on linguistic structural transformations. We empirically show that the emergence of learning structural information correlates with complex reasoning tasks, and that the ability to perform test-time compositional generation remains limited.[55] Self-Manager: Parallel Agent Loop for Long-form Deep Research
Yilong Xu,Zhi Zheng,Xiang Long,Yujun Cai,Yiwei Wang
Main category: cs.CL
TL;DR: 本文提出Self-Manager,一种支持异步并发执行的并行智能体循环框架,通过为每个子线程分配独立上下文和线程控制块,解决现有单上下文、串行执行范式导致的干扰与阻塞问题,在DeepResearch Bench上全面超越基线方法。
Details
Motivation: 现有智能体在处理长周期深度研究任务时,受限于单一上下文窗口和串行执行范式,导致子任务间相互干扰和阻塞,影响可扩展性与适应性。 Method: 提出Self-Manager框架,采用主-子线程结构:主线程创建多个拥有隔离上下文的子线程,并通过Thread Control Blocks进行迭代式协同管理,实现异步并发执行。 Result: 在DeepResearch Bench基准测试中,Self-Manager在所有指标上均显著优于单智能体循环基线;分析实验验证了其设计必要性,并展现出更强的上下文容量、执行效率与泛化能力。 Conclusion: Self-Manager通过解耦上下文与执行流,为复杂长周期研究任务提供了更可扩展、灵活且鲁棒的智能体架构范式。 Abstract: Long-form deep research requires multi-faceted investigations over extended horizons to get a comprehensive report. When handling such complex tasks, existing agents manage context at the subtask level to overcome linear context accumulation and information loss. However, they still adhere to a single context window and sequential execution paradigm, which results in mutual interference and blocking behavior, restricting scalability and adaptability. To address this issue, this paper introduces Self-Manager, a parallel agent loop that enables asynchronous and concurrent execution. The main thread can create multiple subthreads, each with its own isolated context, and manage them iteratively through Thread Control Blocks, allowing for more focused and flexible parallel agent execution. To assess its effectiveness, we benchmark Self-Manager on DeepResearch Bench, where it consistently outperforms existing single-agent loop baselines across all metrics. Furthermore, we conduct extensive analytical experiments to demonstrate the necessity of Self-Manager's design choices, as well as its advantages in contextual capacity, efficiency, and generalization.[56] Assessment of Generative Named Entity Recognition in the Era of Large Language Models
Qi Zhan,Yile Wang,Hui Huang
Main category: cs.CL
TL;DR: 本文系统评估了开源大语言模型(LLM)在扁平与嵌套命名实体识别(NER)任务上的表现,发现经参数高效微调和结构化输出格式(如内联括号或XML)后,其性能可媲美甚至超越传统编码器模型及闭源LLM(如GPT-3);NER能力源于指令遵循与生成能力,而非记忆;NER指令微调对模型通用能力影响甚微,甚至提升部分任务表现。
Details
Motivation: 探索大语言模型在命名实体识别(NER)任务中从传统序列标注范式转向生成范式的潜力与局限,厘清性能差距、格式影响、是否依赖记忆及微调对通用能力的影响。 Method: 在八个不同规模的开源LLM和四个标准NER数据集上,开展系统性实验,考察参数高效微调、多种结构化输出格式(如inline bracketed、XML)、指令微调策略,并分析模型是否依赖记忆及对通用能力的影响。 Result: (1)开源LLM经参数高效微调与结构化格式可在扁平与嵌套NER上达到或超越传统模型及GPT-3;(2)NER能力源于指令遵循与生成能力,非实体-标签对记忆;(3)NER指令微调几乎不影响通用能力,甚至提升DROP等任务表现。 Conclusion: 生成式NER是传统NER方法的有前景且用户友好的替代方案,结构化提示与轻量微调即可释放LLM在NER任务中的潜力。 Abstract: Named entity recognition (NER) is evolving from a sequence labeling task into a generative paradigm with the rise of large language models (LLMs). We conduct a systematic evaluation of open-source LLMs on both flat and nested NER tasks. We investigate several research questions including the performance gap between generative NER and traditional NER models, the impact of output formats, whether LLMs rely on memorization, and the preservation of general capabilities after fine-tuning. Through experiments across eight LLMs of varying scales and four standard NER datasets, we find that: (1) With parameter-efficient fine-tuning and structured formats like inline bracketed or XML, open-source LLMs achieve performance competitive with traditional encoder-based models and surpass closed-source LLMs like GPT-3; (2) The NER capability of LLMs stems from instruction-following and generative power, not mere memorization of entity-label pairs; and (3) Applying NER instruction tuning has minimal impact on general capabilities of LLMs, even improving performance on datasets like DROP due to enhanced entity understanding. These findings demonstrate that generative NER with LLMs is a promising, user-friendly alternative to traditional methods. We release the data and code at https://github.com/szu-tera/LLMs4NER.[57] ShapLoRA: Allocation of Low-rank Adaption on Large Language Models via Shapley Value Inspired Importance Estimation
Yi Zhao,Qinghua Yao,Xinyuan song,Wei Zhu
Main category: cs.CL
TL;DR: 本文提出ShapLoRA框架,利用基于Shapley值的可解释敏感度指标(Shapley sensitivity)优化LoRA的秩分配,提升参数高效微调性能。
Details
Motivation: 现有LoRA秩分配方法依赖不可靠、不可解释的重要性度量,导致性能受限。 Method: 提出Shapley sensitivity作为可解释的重要性度量,结合敏感度分析与协同博弈思想;在独立验证集上计算该指标,并设计分配-重训练流程以确保公平比较。 Result: 在多个挑战性任务上,ShapLoRA优于近期基线方法,且可调参数量相当。 Conclusion: ShapLoRA通过引入可解释的Shapley敏感度,显著提升了LoRA秩分配的合理性与有效性,推动了参数高效微调的发展。 Abstract: Low-rank adaption (LoRA) is a representative method in the field of parameter-efficient fine-tuning (PEFT), and is key to Democratizating the modern large language models (LLMs). The vanilla LoRA is implemented with uniform ranks, and the recent literature have found that properly allocating ranks on the LLM backbones results in performance boosts. However, the previous rank allocation methods have limitations since they rely on inexplanable and unreliable importance measures for the LoRA ranks. To address the above issues, we propose the ShapLoRA framework. Inspired by the explanable attribution measure Shapley Value, we combine the sensitivity-based measures with the idea of coalitions in the collaborative games among LoRA ranks, and propose a more explainable importance measure called Shapley sensitivity. In addition, we optimize the workflow of the existing works by: (a) calculating Shapley sensitivity on a separate validation set; (b) Setting up the allocating-retraining procedures for fair comparisons. We have conducted experiments on various challenging tasks, and the experimental results demonstrate that our ShapLoRA method can outperform the recent baselines with comparable tunable parameters.\footnote{Codes and fine-tuned models will be open-sourced to facilitate future research.[58] A Monosemantic Attribution Framework for Stable Interpretability in Clinical Neuroscience Large Language Models
Michail Mamalakis,Tiago Azevedo,Cristian Cosentino,Chiara D'Ercoli,Subati Abulikemu,Zhongtian Sun,Richard Bethlehem,Pietro Lio
Main category: cs.CL
TL;DR: 本文提出了一种统一的可解释性框架,通过单义特征提取整合归因式与机制式解释方法,以提升大语言模型在阿尔茨海默病进展诊断等临床场景中的稳定性与可信度。
Details
Motivation: 临床场景(如阿尔茨海默病诊断)亟需可信赖、早期且稳定的LLM预测;现有归因方法存在高方法间变异性和不稳定性,机制解释方法缺乏输入输出对齐和显式重要性评分。 Method: 构建LLM某一层的单义嵌入空间,优化框架以显式降低方法间变异性,生成稳定输入级重要性分数,并通过目标层的解压缩表征突出关键特征。 Result: 实现了更稳定、可解释、输入对齐且具显式重要性评分的LLM解释,提升了其在认知健康与神经退行性疾病中的安全可信应用能力。 Conclusion: 单义特征提取为融合归因与机制可解释性提供了新范式,有助于推动LLM在高风险临床决策中的落地。 Abstract: Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LLMs in cognitive health and neurodegenerative disease.[59] LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction
Junior Cedric Tonga,Chen Cecilia Liu,Iryna Gurevych,Fajri Koto
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLMs)的迭代式提示框架,用于构建跨语言的文化常识知识图谱(CCKG),揭示了当前LLMs中文化知识编码不均衡的问题,并验证了CCKG在文化推理和故事生成中的实用性。
Details
Motivation: 大语言模型虽蕴含丰富的文化常识,但这些知识多为隐式、非结构化,限制了其可解释性与应用;亟需一种方法系统提取并结构化文化常识。 Method: 提出基于提示的迭代框架,将LLMs视为文化档案库,逐步抽取文化特异性实体、关系与实践,并构建跨语言的多步推理链,形成文化常识知识图谱(CCKG)。 Result: 在五国文化样本上的人工评估表明:CCKG在英语中表征更优(即使目标文化为中文、印尼语、阿拉伯语等),揭示LLMs文化编码不均衡;用CCKG增强小模型可提升文化推理与故事生成能力,尤以英语推理链增益最大。 Conclusion: LLMs作为文化技术兼具潜力与局限;链式结构的文化知识是实现文化扎根NLP的可行基础。 Abstract: Large language models (LLMs) encode rich cultural knowledge learned from diverse web-scale data, offering an unprecedented opportunity to model cultural commonsense at scale. Yet this knowledge remains mostly implicit and unstructured, limiting its interpretability and use. We present an iterative, prompt-based framework for constructing a Cultural Commonsense Knowledge Graph (CCKG) that treats LLMs as cultural archives, systematically eliciting culture-specific entities, relations, and practices and composing them into multi-step inferential chains across languages. We evaluate CCKG on five countries with human judgments of cultural relevance, correctness, and path coherence. We find that the cultural knowledge graphs are better realized in English, even when the target culture is non-English (e.g., Chinese, Indonesian, Arabic), indicating uneven cultural encoding in current LLMs. Augmenting smaller LLMs with CCKG improves performance on cultural reasoning and story generation, with the largest gains from English chains. Our results show both the promise and limits of LLMs as cultural technologies and that chain-structured cultural knowledge is a practical substrate for culturally grounded NLP.[60] SD-E$^2$: Semantic Exploration for Reasoning Under Token Budgets
Kshitij Mishra,Nils Lukas,Salem Lahlou
Main category: cs.CL
TL;DR: 本文提出SD-E²框架,通过在推理轨迹中显式优化语义多样性来提升小语言模型的复杂推理能力,在GSM8K、MedMCQA和AIME等基准上显著超越基线模型。
Details
Motivation: 小语言模型(SLMs)在有限计算资源下难以进行高效探索,导致复杂推理能力受限。 Method: 提出Semantic Diversity-Exploration-Exploitation(SD-E²)强化学习框架,利用冻结句嵌入模型计算语义多样性奖励(覆盖不同策略+平均成对差异),并与正确性和效率结合为z-score归一化的多目标目标函数。 Result: 在GSM8K上比Qwen2.5-3B-Instruct提升27.4个百分点,发现平均9.8种语义不同的解题策略;MedMCQA达49.64%(基线38.37%);AIME达13.28%(基线6.74%)。 Conclusion: 奖励语义新颖性可提供更计算高效的探索-利用信号;通过认知适应(调整推理结构而非逐token计算),SD-E²为资源受限小模型提供了新的效率提升路径。 Abstract: Small language models (SLMs) struggle with complex reasoning because exploration is expensive under tight compute budgets. We introduce Semantic Diversity-Exploration-Exploitation (SD-E$^2$), a reinforcement learning framework that makes exploration explicit by optimizing semantic diversity in generated reasoning trajectories. Using a frozen sentence-embedding model, SD-E$^2$ assigns a diversity reward that captures (i) the coverage of semantically distinct solution strategies and (ii) their average pairwise dissimilarity in embedding space, rather than surface-form novelty. This diversity reward is combined with outcome correctness and solution efficiency in a z-score-normalized multi-objective objective that stabilizes training. On GSM8K, SD-E$^2$ surpasses the base Qwen2.5-3B-Instruct and strong GRPO baselines (GRPO-CFL and GRPO-CFEE) by +27.4, +5.2, and +1.5 percentage points, respectively, while discovering on average 9.8 semantically distinct strategies per question. We further improve MedMCQA to 49.64% versus 38.37% for the base model and show gains on the harder AIME benchmark (1983-2025), reaching 13.28% versus 6.74% for the base. These results indicate that rewarding semantic novelty yields a more compute-efficient exploration-exploitation signal for training reasoning-capable SLMs. By introducing cognitive adaptation-adjusting the reasoning process structure rather than per-token computation-SD-E$^2$ offers a complementary path to efficiency gains in resource-constrained models.[61] AI-based approach to burnout identification from textual data
Marina Zavertiaeva,Petr Parshakov,Mikhail Usanin,Aleksei Smirnov,Sofia Paklina,Anastasiia Kibardina
Main category: cs.CL
TL;DR: 本文提出了一种基于RuBERT模型的AI方法,利用NLP从文本中检测职业倦怠,通过合成数据和真实用户评论进行微调,可大规模应用于高压力工作环境中的倦怠语言信号监测。
Details
Motivation: 在高压力工作环境中,及时识别职业倦怠对员工健康与组织管理至关重要,但传统方法依赖主观问卷,缺乏实时、可扩展的文本分析手段。 Method: 采用预训练的RuBERT模型,先用于情感分析,再使用ChatGPT生成的合成句子和俄罗斯YouTube上关于倦怠的真实用户评论进行微调,构建二分类(倦怠/非倦怠)模型,输出文本的倦怠概率。 Result: 模型能有效为输入文本分配倦怠概率,具备处理海量书面通信的能力,验证了在俄语语境下基于文本的倦怠检测可行性。 Conclusion: 该方法为职业倦怠的自动化、规模化早期预警提供了新路径,尤其适用于俄语数字交流场景,但需进一步验证其在真实临床或组织干预中的有效性。 Abstract: This study introduces an AI-based methodology that utilizes natural language processing (NLP) to detect burnout from textual data. The approach relies on a RuBERT model originally trained for sentiment analysis and subsequently fine-tuned for burnout detection using two data sources: synthetic sentences generated with ChatGPT and user comments collected from Russian YouTube videos about burnout. The resulting model assigns a burnout probability to input texts and can be applied to process large volumes of written communication for monitoring burnout-related language signals in high-stress work environments.[62] PEAR: Pairwise Evaluation for Automatic Relative Scoring in Machine Translation
Lorenzo Proietti,Roman Grundkiewicz,Matt Post
Main category: cs.CL
TL;DR: PEAR是一种新的监督式质量评估(QE)指标家族,将无参考机器翻译评估重构为分级的成对比较任务,通过预测两个候选译文间质量差异的方向与幅度,在WMT24基准上显著优于同类单候选QE方法及更大规模模型。
Details
Motivation: 现有参考自由的机器翻译质量评估方法多基于单候选打分,难以建模译文间的相对质量关系;而人类评估天然具有比较性,因此需设计能直接建模相对质量差异的QE方法。 Method: PEAR将QE建模为成对比较任务:输入源句和两个候选译文,输出其质量差的方向与程度;使用人工评分差值构建监督信号,并引入顺序反转下的符号一致性正则化。 Result: 在WMT24元评估基准上,PEAR超越参数量大得多的QE和参考式指标;提供更少冗余的评估信号;且可高效用于MBR解码,显著降低成对打分开销。 Conclusion: 成对比较范式比单候选打分更适合参考自由的MT质量评估,PEAR验证了该范式的有效性、高效性与实用性。 Abstract: We present PEAR (Pairwise Evaluation for Automatic Relative Scoring), a supervised Quality Estimation (QE) metric family that reframes reference-free Machine Translation (MT) evaluation as a graded pairwise comparison. Given a source segment and two candidate translations, PEAR predicts the direction and magnitude of their quality difference. The metrics are trained using pairwise supervision derived from differences in human judgments, with an additional regularization term that encourages sign inversion under candidate order reversal. On the WMT24 meta-evaluation benchmark, PEAR outperforms strictly matched single-candidate QE baselines trained with the same data and backbones, isolating the benefit of the proposed pairwise formulation. Despite using substantially fewer parameters than recent large metrics, PEAR surpasses far larger QE models and reference-based metrics. Our analysis further indicates that PEAR yields a less redundant evaluation signal relative to other top metrics. Finally, we show that PEAR is an effective utility function for Minimum Bayes Risk (MBR) decoding, reducing pairwise scoring cost at negligible impact.[63] Evaluating Semantic and Syntactic Understanding in Large Language Models for Payroll Systems
Hendrika Maclean,Mert Can Cakmak,Muzakkiruddin Ahmed Mohammed,Shames Al Mandalawi,John Talburt
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)在高精度数值计算(如薪资系统)中的可靠性,发现仅靠提示工程在简单任务中有效,而复杂任务需显式计算;提出了一个可复现的评估框架与实用部署建议。
Details
Motivation: 大型语言模型在自然语言理解上进步显著,但在精确数值计算和可审计性方面仍不可靠,亟需在高风险场景(如薪资系统)中验证其准确性与可信度。 Method: 构建合成薪资系统作为测试场景,设计从简单到复杂的分层数据集,对比多种提示策略(基础提示、模式引导提示、推理提示)及多个主流模型(GPT、Claude、Perplexity、Grok、Gemini)的表现。 Result: 实验揭示两种明确情形:部分任务通过精细提示即可达到高精度;另一些复杂任务则必须引入显式计算模块才能保证分币级准确;不同模型表现差异显著。 Conclusion: 单纯依赖LLM进行关键数值任务存在风险;需结合提示优化与外部计算机制,并建立可复现的评估框架以保障准确性与可审计性。 Abstract: Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying LLMs in settings that demand both accuracy and assurance.[64] A System for Name and Address Parsing with Large Language Models
Adeeba Tarannum,Muzakkiruddin Ahmed Mohammed,Mert Can Cakmak,Shames Al Mandalawi,John Talburt
Main category: cs.CL
TL;DR: 本文提出了一种无需微调的提示驱动、验证为中心的框架,用于将非结构化的人名和地址文本可靠地转换为17字段的结构化数据,结合输入归一化、结构化提示、约束解码和严格规则验证,在真实异构地址数据上实现了高精度、强模式一致性与稳定置信度校准。
Details
Motivation: 传统基于规则和概率的方法在干净输入下表现好,但在噪声或多语言条件下失效;神经网络和大语言模型(LLMs)则缺乏确定性控制和可复现性。 Method: 提出一种提示驱动、验证为中心的框架,整合输入归一化、结构化 prompting、约束解码和严格基于规则的验证,在固定实验设置下确保可复现性,且无需微调。 Result: 在异构真实地址数据上的评估显示高字段级准确率、强模式遵循性及稳定的置信度校准。 Conclusion: 将确定性验证与生成式提示相结合,可提供一种鲁棒、可解释且可扩展的结构化信息抽取方案,是训练密集型或领域专用模型的实用替代方案。 Abstract: Reliable transformation of unstructured person and address text into structured data remains a key challenge in large-scale information systems. Traditional rule-based and probabilistic approaches perform well on clean inputs but fail under noisy or multilingual conditions, while neural and large language models (LLMs) often lack deterministic control and reproducibility. This paper introduces a prompt-driven, validation-centered framework that converts free-text records into a consistent 17-field schema without fine-tuning. The method integrates input normalisation, structured prompting, constrained decoding, and strict rule-based validation under fixed experimental settings to ensure reproducibility. Evaluations on heterogeneous real-world address data show high field-level accuracy, strong schema adherence, and stable confidence calibration. The results demonstrate that combining deterministic validation with generative prompting provides a robust, interpretable, and scalable solution for structured information extraction, offering a practical alternative to training-heavy or domain-specific models.[65] CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
Pedro Ortiz Suarez,Laurie Burchell,Catherine Arnett,Rafael Mosquera-Gómez,Sara Hincapie-Monsalve,Thom Vaughan,Damian Stewart,Malte Ostendorff,Idris Abdulmumin,Vukosi Marivate,Shamsuddeen Hassan Muhammad,Atnafu Lambebo Tonja,Hend Al-Khalifa,Nadia Ghezaiel Hammouda,Verrah Otiende,Tack Hwa Wong,Jakhongir Saydaliev,Melika Nobakhtian,Muhammad Ravi Shulthan Habibi,Chalamalasetti Kranti,Carol Muchemi,Khang Nguyen,Faisal Muhammad Adam,Luis Frentzen Salim,Reem Alqifari,Cynthia Amol,Joseph Marvin Imperial,Ilker Kesen,Ahmad Mustafid,Pavel Stepachev,Leshem Choshen,David Anugraha,Hamada Nayel,Seid Muhie Yimam,Vallerie Alexandra Putra,My Chiffon Nguyen,Azmine Toushik Wasi,Gouthami Vadithya,Rob van der Goot,Lanwenn ar C'horr,Karan Dua,Andrew Yates,Mithil Bangera,Yeshil Bangera,Hitesh Laxmichand Patel,Shu Okabe,Fenal Ashokbhai Ilasariya,Dmitry Gaynullin,Genta Indra Winata,Yiyuan Li,Juan Pablo Martínez,Amit Agarwal,Ikhlasul Akmal Hanif,Raia Abu Ahmad,Esther Adenuga,Filbert Aurelian Tjiaranata,Weerayut Buaphet,Michael Anugraha,Sowmya Vajjala,Benjamin Rice,Azril Hafizi Amirudin,Jesujoba O. Alabi,Srikant Panda,Yassine Toughrai,Bruhan Kyomuhendo,Daniel Ruffinelli,Akshata A,Manuel Goulão,Ej Zhou,Ingrid Gabriela Franco Ramirez,Cristina Aggazzotti,Konstantin Dobler,Jun Kevin,Quentin Pagès,Nicholas Andrews,Nuhu Ibrahim,Mattes Ruckdeschel,Amr Keleg,Mike Zhang,Casper Muziri,Saron Samuel,Sotaro Takeshita,Kun Kerdthaisong,Luca Foppiano,Rasul Dent,Tommaso Green,Ahmad Mustapha Wali,Kamohelo Makaaka,Vicky Feliren,Inshirah Idris,Hande Celikkanat,Abdulhamid Abubakar,Jean Maillard,Benoît Sagot,Thibault Clérice,Kenton Murray,Sarah Luger
Main category: cs.CL
TL;DR: 本文介绍了CommonLID,一个面向网络域、覆盖109种语言(含大量此前服务不足的语言)的社区驱动、人工标注的语言识别(LID)基准数据集,用于评估和改进现有LID模型在噪声多、异构性强的网络文本上的性能。
Details
Motivation: 现有LID模型在噪声大、异构性强的网络文本上表现不佳,尤其对许多语言效果差;已有评估基准未能充分反映真实网络场景下的LID性能,且缺乏对欠资源语言的覆盖。 Method: 构建了名为CommonLID的人工标注、开源、面向网络域的109语种LID基准;联合5个其他常用评测集,在8个主流LID模型上进行系统性评估与分析。 Result: 实验证明,现有LID评测普遍高估了模型在Web领域中对许多语言的识别准确率;CommonLID为提升多语言语料库质量及推动更公平、鲁棒的LID研究提供了关键资源。 Conclusion: CommonLID填补了Web领域高质量、多语言、人工标注LID基准的空白,揭示了当前LID评估的局限性,并为构建更具代表性的多语言语料库和更稳健的语言识别技术提供了基础支撑。 Abstract: Language identification (LID) is a fundamental step in curating multilingual corpora. However, LID models still perform poorly for many languages, especially on the noisy and heterogeneous web data often used to train multilingual language models. In this paper, we introduce CommonLID, a community-driven, human-annotated LID benchmark for the web domain, covering 109 languages. Many of the included languages have been previously under-served, making CommonLID a key resource for developing more representative high-quality text corpora. We show CommonLID's value by using it, alongside five other common evaluation sets, to test eight popular LID models. We analyse our results to situate our contribution and to provide an overview of the state of the art. In particular, we highlight that existing evaluations overestimate LID accuracy for many languages in the web domain. We make CommonLID and the code used to create it available under an open, permissive license.[66] Addressing LLM Diversity by Infusing Random Concepts
Pulin Agrawal,Prasoon Goyal
Main category: cs.CL
TL;DR: This paper investigates whether adding random concepts to prompts can increase the output diversity of large language models (LLMs), and proposes a systematic evaluation protocol using diversity metrics on outputs like lists of Hollywood actors.
Details
Motivation: LLMs often generate outputs with limited diversity; the authors aim to explore whether injecting randomness into prompts can mitigate this issue. Method: The authors design a systematic evaluation protocol where LLMs are prompted with questions like 'Name 10 Hollywood actors', with and without prepended random words or sentences, then measure output diversity. Result: Experiments across multiple LLMs show that prepending unrelated random content to prompts increases output diversity. Conclusion: Infusing random concepts into prompts improves LLM output diversity, and the proposed evaluation protocol offers a foundation for future research on diversity benchmarking and enhancement techniques. Abstract: Large language models (LLMs) are known to produce outputs with limited diversity. In this work, we study whether infusing random concepts in the prompts can improve the diversity of the generated outputs. To benchmark the approach, we design a systematic evaluation protocol which involves prompting an LLM with questions of the form "Name 10 Hollywood actors", and analyzing diversity measures of the resulting LLM outputs. Our experiments on multiple LLMs show that prepending random words/sentences unrelated to the prompt result in greater diversity in the outputs of LLMs. We believe that this promising result and the evaluation protocol opens up interesting avenues for future work, such as how infusing randomness into LLMs could be applied to other domains. Further, the evaluation protocol could also inspire research into benchmarking LLM diversity more systematically.[67] Neurocomputational Mechanisms of Syntactic Transfer in Bilingual Sentence Production
Ahmet Yavuz Uluslu,Elliot Murphy
Main category: cs.CL
TL;DR: 本文探讨了将振荡信号纳入双语产出错误研究的价值,提出ROSE神经模型可解释双语句法迁移及形态句法序列失败,并以跨语言影响(CLI)为例,将其归因于L2句子规划期间特定的振荡功能障碍模式。
Details
Motivation: 传统双语研究多依赖事件相关电位等时间标记,本文旨在引入振荡信号以提供更精细的神经计算层面约束,深化对双语产出机制的理解。 Method: 基于ROSE神经计算模型,结合振荡神经动力学分析,对跨语言影响(CLI)及其抑制/竞争机制进行建模与理论阐释。 Result: ROSE模型成功刻画了句法迁移的形式特性及形态句法序列失败模式;CLI被解释为特定振荡失败模式所致,支持更复杂时空生物标志物的探索。 Conclusion: 将振荡信号与ROSE模型结合,不仅为双语理论提供了可验证的神经实现机制(链接假说),也拓展了语言功能障碍研究的神经标记维度。 Abstract: We discuss the benefits of incorporating into the study of bilingual production errors and their traditionally documented timing signatures (e.g., event-related potentials) certain types of oscillatory signatures, which can offer new implementational-level constraints for theories of bilingualism. We argue that a recent neural model of language, ROSE, can offer a neurocomputational account of syntactic transfer in bilingual production, capturing some of its formal properties and the scope of morphosyntactic sequencing failure modes. We take as a case study cross-linguistic influence (CLI) and attendant theories of functional inhibition/competition, and present these as being driven by specific oscillatory failure modes during L2 sentence planning. We argue that modeling CLI in this way not only offers the kind of linking hypothesis ROSE was built to encourage, but also licenses the exploration of more spatiotemporally complex biomarkers of language dysfunction than more commonly discussed neural signatures.[68] Grounded Concreteness: Human-Like Concreteness Sensitivity in Vision-Language Models
Aryan Roy,Zekun Wang,Christopher J. MacLellan
Main category: cs.CL
TL;DR: 本文通过对比Llama文本模型及其对应的Llama Vision多模态模型,研究视觉-语言模型(VLMs)是否比纯文本大语言模型(LLMs)展现出更类人的语言具体性敏感度。研究从输出行为、嵌入几何结构、注意力动态及词元级具体性评分四个层面进行分析,发现VLMs在各项指标上均表现出更强的语义具体性敏感性和更符合人类认知的表征特性。
Details
Motivation: 探究多模态预训练是否能提升模型对语言具体性的感知能力,使其更接近人类的语言理解方式,而非仅依赖图像输入带来的性能增益。 Method: 采用控制变量法,对比相同文本骨干(Llama)与对应视觉-语言模型(Llama Vision)在多个参数规模下的表现;从输出准确性、表示空间结构、注意力熵及模型生成的具体性评分四方面量化分析。 Result: VLMs在具体性更高的输入上准确率提升更显著;其表征空间沿具体性维度组织更清晰;生成的具体性评分更贴近人类规范分布;注意力模式也显示更强的上下文依赖与感知基础特征。 Conclusion: 多模态预训练增强了模型对语言具体性的敏感度,使其在无图像输入的纯文本任务中也展现出更类人、更具感知基础的语言理解能力。 Abstract: Do vision--language models (VLMs) develop more human-like sensitivity to linguistic concreteness than text-only large language models (LLMs) when both are evaluated with text-only prompts? We study this question with a controlled comparison between matched Llama text backbones and their Llama Vision counterparts across multiple model scales, treating multimodal pretraining as an ablation on perceptual grounding rather than access to images at inference. We measure concreteness effects at three complementary levels: (i) output behavior, by relating question-level concreteness to QA accuracy; (ii) embedding geometry, by testing whether representations organize along a concreteness axis; and (iii) attention dynamics, by quantifying context reliance via attention-entropy measures. In addition, we elicit token-level concreteness ratings from models and evaluate alignment to human norm distributions, testing whether multimodal training yields more human-consistent judgments. Across benchmarks and scales, VLMs show larger gains on more concrete inputs, exhibit clearer concreteness-structured representations, produce ratings that better match human norms, and display systematically different attention patterns consistent with increased grounding.[69] Sparks of Cooperative Reasoning: LLMs as Strategic Hanabi Agents
Mahesh Ramesh,Kaousheik Jayakumar,Aswinkumar Ramkumar,Pavan Thodima,Aniket Rege
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在不完全信息合作推理任务——纸牌游戏Hanabi中的表现,通过不同层级的上下文工程(Watson/Sherlock/Mycroft设置)评估17个SOTA模型,并发布首个公开Hanabi数据集(HanabiLogs与HanabiRewards),基于其微调显著提升LLM协作能力及泛化性能。
Details
Motivation: Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems; Hanabi serves as a canonical testbed requiring theory-of-mind and strategic communication. Method: Benchmarked 17 SOTA LLM agents in 2–5 player Hanabi games; systematically varied context engineering: minimal prompt (Watson), Bayesian-programmatic deduction scaffolding (Sherlock), and multi-turn working memory (Mycroft); released two new datasets (HanabiLogs & HanabiRewards); performed supervised and RL finetuning on Qwen3-Instruct (4B). Result: Strongest models in Sherlock setting average >15 points but still below humans/specialists (>20); cross-play performance interpolates smoothly with model scale; RL finetuning yields +156% Hanabi gain, reaching within ~3 points of o4-mini and surpassing GPT-4.1 by 52%; HanabiRewards-finetuned model generalizes to group-guessing (+11%), EventQA (+6.4%), IFBench-800K (+1.7 Pass@10), and matches AIME 2025 Pass@10. Conclusion: Context engineering and targeted dataset curation (especially move-level reward annotations) significantly improve cooperative reasoning in LLMs; HanabiRewards enables both domain-specific gains and broad generalization, suggesting value-aligned, stepwise reasoning scaffolds transfer beyond narrow tasks. Abstract: Cooperative reasoning under incomplete information remains challenging for both humans and multi-agent systems. The card game Hanabi embodies this challenge, requiring theory-of-mind reasoning and strategic communication. We benchmark 17 state-of-the-art LLM agents in 2-5 player games and study the impact of context engineering across model scales (4B to 600B+) to understand persistent coordination failures and robustness to scaffolding: from a minimal prompt with only explicit card details (Watson setting), to scaffolding with programmatic, Bayesian-motivated deductions (Sherlock setting), to multi-turn state tracking via working memory (Mycroft setting). We show that (1) agents can maintain an internal working memory for state tracking and (2) cross-play performance between different LLMs smoothly interpolates with model strength. In the Sherlock setting, the strongest reasoning models exceed 15 points on average across player counts, yet still trail experienced humans and specialist Hanabi agents, both consistently scoring above 20. We release the first public Hanabi datasets with annotated trajectories and move utilities: (1) HanabiLogs, containing 1,520 full game logs for instruction tuning, and (2) HanabiRewards, containing 560 games with dense move-level value annotations for all candidate moves. Supervised and RL finetuning of a 4B open-weight model (Qwen3-Instruct) on our datasets improves cooperative Hanabi play by 21% and 156% respectively, bringing performance to within ~3 points of a strong proprietary reasoning model (o4-mini) and surpassing the best non-reasoning model (GPT-4.1) by 52%. The HanabiRewards RL-finetuned model further generalizes beyond Hanabi, improving performance on a cooperative group-guessing benchmark by 11%, temporal reasoning on EventQA by 6.4%, instruction-following on IFBench-800K by 1.7 Pass@10, and matching AIME 2025 mathematical reasoning Pass@10.[70] CHiRPE: A Step Towards Real-World Clinical NLP with Clinician-Oriented Model Explanations
Stephanie Fong,Zimu Wang,Guilherme C. Oliveira,Xiangyu Zhao,Yiwen Jiang,Jiahe Liu,Beau-Luke Colton,Scott Woods,Martha E. Shenton,Barnaby Nelson,Zongyuan Ge,Dominic Dwyer
Main category: cs.CL
TL;DR: CHiRPE是一个面向临床高风险预测的可解释NLP流水线,结合症状映射、LLM摘要与BERT分类,在 psychosis 风险预测中达90%以上准确率,并通过临床专家共同设计的新型SHAP解释格式(如混合图-文摘要)显著提升可解释性。
Details
Motivation: 传统XAI方法不符合临床推理逻辑,且缺乏临床医生参与,难以满足医疗NLP工具对可解释性的实际需求。 Method: 提出CHiRPE流水线:整合症状域映射、LLM摘要生成和BERT分类;基于AMP-SCZ研究中24家国际诊所的944份半结构化访谈转录本训练;与临床医生协同开发新型SHAP解释格式(如概念引导的混合图-文摘要)。 Result: 在三种BERT变体上均实现>90%准确率,优于基线模型;28位临床专家强烈偏好所提出的概念引导型解释格式,尤其混合图-文摘要形式。 Conclusion: 以临床为导向的模型开发路径可同时提升NLP模型的预测准确性与临床可解释性,为医疗AI落地提供可行范式。 Abstract: The medical adoption of NLP tools requires interpretability by end users, yet traditional explainable AI (XAI) methods are misaligned with clinical reasoning and lack clinician input. We introduce CHiRPE (Clinical High-Risk Prediction with Explainability), an NLP pipeline that takes transcribed semi-structured clinical interviews to: (i) predict psychosis risk; and (ii) generate novel SHAP explanation formats co-developed with clinicians. Trained on 944 semi-structured interview transcripts across 24 international clinics of the AMP-SCZ study, the CHiRPE pipeline integrates symptom-domain mapping, LLM summarisation, and BERT classification. CHiRPE achieved over 90% accuracy across three BERT variants and outperformed baseline models. Explanation formats were evaluated by 28 clinical experts who indicated a strong preference for our novel concept-guided explanations, especially hybrid graph-and-text summary formats. CHiRPE demonstrates that clinically-guided model development produces both accurate and interpretable results. Our next step is focused on real-world testing across our 24 international sites.[71] GLEN-Bench: A Graph-Language based Benchmark for Nutritional Health
Jiatan Huang,Zheyuan Zhang,Tianyi Ma,Mingchen Li,Yaning Zheng,Yanfang Ye,Chuxu Zhang
Main category: cs.CL
TL;DR: 本文提出了GLEN-Bench,首个面向营养健康评估的图-语言联合基准,整合多源真实数据构建知识图谱,支持风险检测、个性化膳食推荐与可解释问答三大任务,并在阿片类药物使用障碍场景中验证其有效性。
Details
Motivation: 现有营养干预计算方法存在三大缺陷:忽视现实约束(如社会经济地位、共病、食物可及性)、缺乏可解释性、缺少统一评估基准。 Method: 构建融合NHANES、FNDDS和USDA数据的营养健康知识图谱;设计包含风险检测、约束下个性化推荐和图增强问答的三任务基准GLEN-Bench;评估图神经网络、大语言模型及混合架构。 Result: 在阿片类药物使用障碍不同阶段成功识别细微营养差异;建立了扎实基线,揭示了与健康风险显著关联的膳食模式;验证了图-语言联合方法在可解释个性化营养干预中的可行性。 Conclusion: GLEN-Bench填补了营养健康AI评估的空白,推动兼顾临床实用性、社会公平性与模型可解释性的个性化膳食干预研究。 Abstract: Nutritional interventions are important for managing chronic health conditions, but current computational methods provide limited support for personalized dietary guidance. We identify three key gaps: (1) dietary pattern studies often ignore real-world constraints such as socioeconomic status, comorbidities, and limited food access; (2) recommendation systems rarely explain why a particular food helps a given patient; and (3) no unified benchmark evaluates methods across the connected tasks needed for nutritional interventions. We introduce GLEN-Bench, the first comprehensive graph-language based benchmark for nutritional health assessment. We combine NHANES health records, FNDDS food composition data, and USDA food-access metrics to build a knowledge graph that links demographics, health conditions, dietary behaviors, poverty-related constraints, and nutrient needs. We test the benchmark using opioid use disorder, where models must detect subtle nutritional differences across disease stages. GLEN-Bench includes three linked tasks: risk detection identifies at-risk individuals from dietary and socioeconomic patterns; recommendation suggests personalized foods that meet clinical needs within resource constraints; and question answering provides graph-grounded, natural-language explanations to facilitate comprehension. We evaluate these graph-language approaches, including graph neural networks, large language models, and hybrid architectures, to establish solid baselines and identify practical design choices. Our analysis identifies clear dietary patterns linked to health risks, providing insights that can guide practical interventions.[72] FABLE: Forest-Based Adaptive Bi-Path LLM-Enhanced Retrieval for Multi-Document Reasoning
Lin Sun,Linglin Zhang,Jingang Huang,Change Jia,Zhengwei Cheng,Xiangzheng Zhang
Main category: cs.CL
TL;DR: 本文提出FABLE框架,通过LLM增强的森林式分层索引与双路径检索策略,在保持高精度的同时大幅降低长上下文推理的token消耗,证明结构化检索在长上下文LLM时代仍不可替代。
Details
Motivation: 长上下文LLM虽发展迅速,但仍存在‘中间丢失’、计算开销大、多文档推理可扩展性差等问题;而传统RAG受限于扁平化分块检索,引入语义噪声且难以支持跨文档结构化合成。 Method: 提出FABLE:构建LLM增强的多粒度语义森林索引,并采用双路径策略——LLM引导的分层遍历 + 结构感知的传播机制,辅以显式预算控制实现自适应效率权衡。 Result: 实验表明FABLE持续超越SOTA RAG方法,在准确率上媲美全上下文LLM推理,同时减少高达94%的token使用。 Conclusion: 长上下文LLM并未消除对结构化检索的需求,反而凸显其必要性;FABLE验证了将LLM深度融入检索架构(而非仅用于生成)的有效性。 Abstract: The rapid expansion of long-context Large Language Models (LLMs) has reignited debate on whether Retrieval-Augmented Generation (RAG) remains necessary. However, empirical evidence reveals persistent limitations of long-context inference, including the lost-in-the-middle phenomenon, high computational cost, and poor scalability for multi-document reasoning. Conversely, traditional RAG systems, while efficient, are constrained by flat chunk-level retrieval that introduces semantic noise and fails to support structured cross-document synthesis. We present \textbf{FABLE}, a \textbf{F}orest-based \textbf{A}daptive \textbf{B}i-path \textbf{L}LM-\textbf{E}nhanced retrieval framework that integrates LLMs into both knowledge organization and retrieval. FABLE constructs LLM-enhanced hierarchical forest indexes with multi-granularity semantic structures, then employs a bi-path strategy combining LLM-guided hierarchical traversal with structure-aware propagation for fine-grained evidence acquisition, with explicit budget control for adaptive efficiency trade-offs. Extensive experiments demonstrate that FABLE consistently outperforms SOTA RAG methods and achieves comparable accuracy to full-context LLM inference with up to 94\% token reduction, showing that long-context LLMs amplify rather than fully replace the need for structured retrieval.[73] Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models
Kunat Pipatanakul,Pittawat Taveekitworachai
Main category: cs.CL
TL;DR: 本文提出Typhoon S,一种轻量级、开源的后训练方法,结合监督微调、在线策略蒸馏和小规模强化微调(RFT),在有限资源下实现主权可控的大语言模型,尤其在泰语法律推理和本地知识任务中表现优异。
Details
Motivation: 解决当前大语言模型多由少数机构在高资源语言上开发、难以满足区域/国家层面主权机构对模型权重、数据和部署完全可控、透明且资源受限的需求。 Method: 提出Typhoon S后训练框架,包含监督微调、on-policy蒸馏和小规模RFT;引入InK-GRPO(GRPO增强版,融合下一词预测损失)提升领域任务性能。 Result: 在泰语场景验证:该方法能有效将基础模型转化为通用指令模型,并显著提升泰语法律推理与本地知识能力,同时保持通用性能;大幅降低指令数据规模与计算需求。 Conclusion: 精心设计的轻量级后训练策略可在学术级资源下实现高质量、主权可控的LLM,无需依赖海量指令数据或复杂偏好调优流程。 Abstract: Large language models (LLMs) have progressed rapidly; however, most state-of-the-art models are trained and evaluated primarily in high-resource languages such as English and Chinese, and are often developed by a small number of organizations with access to large-scale compute and data. This gatekeeping creates a practical barrier for sovereign settings in which a regional- or national-scale institution or domain owner must retain control and understanding of model weights, training data, and deployment while operating under limited resources and strict transparency constraints. To this end, we identify two core requirements: (1) adoptability, the ability to transform a base model into a general-purpose assistant, and (2) sovereign capability, the ability to perform high-stakes, region-specific tasks (e.g., legal reasoning in local languages and cultural knowledge). We investigate whether these requirements can be achieved without scaling massive instruction corpora or relying on complex preference tuning pipelines and large-scale reinforcement fine-tuning (RFT). We present Typhoon S, a minimal and open post-training recipe that combines supervised fine-tuning, on-policy distillation, and small-scale RFT. Using Thai as a representative case study, we demonstrate that our approach transforms both sovereign-adapted and general-purpose base models into instruction-tuned models with strong general performance. We further show that small-scale RFT with InK-GRPO -- an extension of GRPO that augments the GRPO loss with a next-word prediction loss -- improves Thai legal reasoning and Thai-specific knowledge while preserving general capabilities. Our results suggest that a carefully designed post-training strategy can reduce the required scale of instruction data and computation, providing a practical path toward high-quality sovereign LLMs under academic-scale resources.[74] Fine-Grained Emotion Detection on GoEmotions: Experimental Comparison of Classical Machine Learning, BiLSTM, and Transformer Models
Ani Harutyunyan,Sachin Kumar
Main category: cs.CL
TL;DR: 本文在GoEmotions数据集上对三种模型(TF-IDF逻辑回归、BiLSTM+Attention、微调BERT)进行细粒度多标签情感识别基准测试,发现逻辑回归在Micro-F1上最优(0.51),而BERT在Macro-F1(0.49)、Hamming Loss(0.036)和Subset Accuracy(0.36)上表现更均衡且优于原论文结果,表明表层词法线索对高频情感有效,上下文建模则提升稀有及模糊情感的识别。
Details
Motivation: 解决细粒度情感识别中因标签重叠和类别不平衡带来的挑战。 Method: 在GoEmotions数据集上对比三种模型:基于TF-IDF与二元相关性的逻辑回归、带注意力机制的BiLSTM、以及针对多标签分类微调的BERT;采用官方划分,并用逆频率类权重缓解类别不平衡。 Result: 逻辑回归取得最高Micro-F1(0.51);BERT在Macro-F1(0.49)、Hamming Loss(0.036)和Subset Accuracy(0.36)上表现最佳,超越原论文结果。 Conclusion: 高频情感依赖表面词汇线索,而上下文表示(如BERT)更利于识别稀有和模糊情感,体现模型选择需兼顾指标侧重与任务特性。 Abstract: Fine-grained emotion recognition is a challenging multi-label NLP task due to label overlap and class imbalance. In this work, we benchmark three modeling families on the GoEmotions dataset: a TF-IDF-based logistic regression system trained with binary relevance, a BiLSTM with attention, and a BERT model fine-tuned for multi-label classification. Experiments follow the official train/validation/test split, and imbalance is mitigated using inverse-frequency class weights. Across several metrics, namely Micro-F1, Macro-F1, Hamming Loss, and Subset Accuracy, we observe that logistic regression attains the highest Micro-F1 of 0.51, while BERT achieves the best overall balance surpassing the official paper's reported results, reaching Macro-F1 0.49, Hamming Loss 0.036, and Subset Accuracy 0.36. This suggests that frequent emotions often rely on surface lexical cues, whereas contextual representations improve performance on rarer emotions and more ambiguous examples.[75] MemWeaver: Weaving Hybrid Memories for Traceable Long-Horizon Agentic Reasoning
Juexiang Ye,Xue Li,Xinyu Yang,Chengkai Huang,Lanshun Nie,Lina Yao,Dechen Zhan
Main category: cs.CL
TL;DR: MemWeaver是一种统一的记忆框架,通过结构化图记忆、经验记忆和段落记忆三部分协同工作,提升大语言模型代理在长周期交互中的时间一致性、多跳推理与证据可追溯性,并显著压缩输入上下文。
Details
Motivation: 现有记忆系统依赖非结构化检索或粗粒度抽象,导致时间冲突、推理脆弱和可追溯性差,难以支持长周期、多步、证据驱动的智能体交互。 Method: 提出MemWeaver框架,包含:1)时序对齐的图记忆(支持结构化关系推理);2)经验记忆(从重复观测中抽象交互模式);3)段落记忆(保留原始文本证据);并采用双通道检索策略联合获取结构知识与支持证据。 Result: 在LoCoMo基准上,MemWeaver显著提升多跳与时间推理准确率,同时将输入上下文长度减少95%以上。 Conclusion: MemWeaver通过结构化、分层、证据耦合的记忆设计,有效解决了长周期智能体记忆中的时序一致性、推理鲁棒性与可验证性难题,为LLM智能体记忆系统提供了新范式。 Abstract: Large language model-based agents operating in long-horizon interactions require memory systems that support temporal consistency, multi-hop reasoning, and evidence-grounded reuse across sessions. Existing approaches largely rely on unstructured retrieval or coarse abstractions, which often lead to temporal conflicts, brittle reasoning, and limited traceability. We propose MemWeaver, a unified memory framework that consolidates long-term agent experiences into three interconnected components: a temporally grounded graph memory for structured relational reasoning, an experience memory that abstracts recurring interaction patterns from repeated observations, and a passage memory that preserves original textual evidence. MemWeaver employs a dual-channel retrieval strategy that jointly retrieves structured knowledge and supporting evidence to construct compact yet information-dense contexts for reasoning. Experiments on the LoCoMo benchmark demonstrate that MemWeaver substantially improves multi-hop and temporal reasoning accuracy while reducing input context length by over 95\% compared to long-context baselines.[76] TechING: Towards Real World Technical Image Understanding via VLMs
Tafazzul Nadeem,Bhavik Shangari,Manish Rai,Gagan Raj Gupta,Ashutosh Modi
Main category: cs.CL
TL;DR: 本文提出了一种用于技术图表理解的合成数据集和自监督训练方法,通过微调Llama 3.2 11B-instruct模型得到LLama-VL-TUG,在手绘技术图理解任务中显著提升ROUGE-L和F1分数。
Details
Motivation: 现有视觉语言模型(VLMs)难以理解手绘技术图表(如流程图、框图),而真实手绘图像难以大规模采集,制约了模型训练。 Method: 构建大规模、贴近真实的手绘风格合成图像语料库;设计多种新自监督任务;在合成数据上微调Llama 3.2 11B-instruct模型,得到LLama-VL-TUG。 Result: 在合成数据评估中,ROUGE-L性能提升2.14倍;在真实手绘图像的人工评估中,7/8类图表编译错误最少,平均F1分数提升6.97倍。 Conclusion: 合成数据与自监督任务可有效弥补真实手绘数据稀缺问题,显著提升VLM对技术图表的理解能力,LLama-VL-TUG成为当前最优方案之一。 Abstract: Professionals working in technical domain typically hand-draw (on whiteboard, paper, etc.) technical diagrams (e.g., flowcharts, block diagrams, etc.) during discussions; however, if they want to edit these later, it needs to be drawn from scratch. Modern day VLMs have made tremendous progress in image understanding but they struggle when it comes to understanding technical diagrams. One way to overcome this problem is to fine-tune on real world hand-drawn images, but it is not practically possible to generate large number of such images. In this paper, we introduce a large synthetically generated corpus (reflective of real world images) for training VLMs and subsequently evaluate VLMs on a smaller corpus of hand-drawn images (with the help of humans). We introduce several new self-supervision tasks for training and perform extensive experiments with various baseline models and fine-tune Llama 3.2 11B-instruct model on synthetic images on these tasks to obtain LLama-VL-TUG, which significantly improves the ROUGE-L performance of Llama 3.2 11B-instruct by 2.14x and achieves the best all-round performance across all baseline models. On real-world images, human evaluation reveals that we achieve minimum compilation errors across all baselines in 7 out of 8 diagram types and improve the average F1 score of Llama 3.2 11B-instruct by 6.97x.[77] BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation
Peng Sun,Xiangyu Zhang,Duan Wu
Main category: cs.CL
TL;DR: 本文提出BoRP(Bootstrapped Regression Probing)框架,利用大语言模型隐空间的几何特性,通过极化指数引导的自举机制和偏最小二乘回归(PLS),实现对开放域对话AI用户满意度的高保真、低成本评估,显著优于生成式基线且大幅降低推理开销。
Details
Motivation: 传统A/B测试在开放域对话AI中缺乏可靠满意度指标:显式反馈稀疏,隐式指标模糊,亟需一种高保真、可扩展的自动评估方法。 Method: BoRP基于LLM隐空间几何结构,采用极化指数驱动的自举机制自动生成评分标准,并利用偏最小二乘(PLS)将隐藏状态映射为连续满意度得分,不依赖生成式打分。 Result: 在工业数据集上,BoRP(基于Qwen3-8B/14B)在与人工评分的一致性上显著超越生成式基线(包括Qwen3-Max),同时推理成本降低数个数量级。 Conclusion: BoRP为对话AI提供了高效、精准、可部署的满意度评估新范式,支持全量监控与高灵敏度CUPED增强型A/B测试。 Abstract: Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.[78] Reflecting Twice before Speaking with Empathy: Self-Reflective Alternating Inference for Empathy-Aware End-to-End Spoken Dialogue
Yuhang Jia,Pei Liu,Haoqin Sun,Jiaming Zhou,Xuxin Cheng,Cao Liu,Ke Zeng,Xunliang Cai,Yong Qin
Main category: cs.CL
TL;DR: 本文提出EmpathyEval评估模型和ReEmpathy端到端口语语言模型,通过共情自反思交替推理机制提升对话中的共情能力,克服传统监督信号僵化局限。
Details
Motivation: 现有方法依赖刚性监督信号(如真值响应或偏好分数),难以建模复杂共情,因共情无唯一正确答案,数值评分无法捕捉情感表达与行为恰当性的细微差别。 Method: 提出自然语言描述型评估模型EmpathyEval;在此基础上构建ReEmpathy模型,引入共情自反思交替推理机制,交替进行口语响应生成与自由形式的共情相关反思推理。 Result: 大量实验表明,ReEmpathy显著提升共情敏感型口语对话质量,验证了反思推理的有效性。 Conclusion: ReEmpathy为实现更富情感智能与共情意识的人机交互提供了新路径。 Abstract: End-to-end Spoken Language Models (SLMs) hold great potential for paralinguistic perception, and numerous studies have aimed to enhance their capabilities, particularly for empathetic dialogue. However, current approaches largely depend on rigid supervised signals, such as ground-truth response in supervised fine-tuning or preference scores in reinforcement learning. Such reliance is fundamentally limited for modeling complex empathy, as there is no single "correct" response and a simple numerical score cannot fully capture the nuances of emotional expression or the appropriateness of empathetic behavior. To address these limitations, we sequentially introduce EmpathyEval, a descriptive natural-language-based evaluation model for assessing empathetic quality in spoken dialogues. Building upon EmpathyEval, we propose ReEmpathy, an end-to-end SLM that enhances empathetic dialogue through a novel Empathetic Self-Reflective Alternating Inference mechanism, which interleaves spoken response generation with free-form, empathy-related reflective reasoning. Extensive experiments demonstrate that ReEmpathy substantially improves empathy-sensitive spoken dialogue by enabling reflective reasoning, offering a promising approach toward more emotionally intelligent and empathy-aware human-computer interactions.[79] U-Fold: Dynamic Intent-Aware Context Folding for User-Centric Agents
Jin Su,Runnan Fang,Yeqiu Li,Xiaobin Wang,Shihao Cai,Pengjun Xie,Ningyu Zhang,Fajie Yuan
Main category: cs.CL
TL;DR: 本文提出U-Fold框架,通过动态、意图感知的上下文折叠方法,解决大语言模型代理在用户中心多轮对话中因上下文长度限制导致的信息丢失与意图漂移问题,在多个基准测试中显著优于ReAct等基线方法。
Details
Motivation: 现有上下文折叠方法在用户中心多轮对话中存在两大缺陷:不可逆地丢弃关键细粒度约束和中间事实;无法跟踪动态演变的用户意图,导致遗漏和错误操作。 Method: U-Fold框架保留完整对话与工具调用历史,每轮通过两个核心组件生成意图感知的演化对话摘要和紧凑的任务相关工具日志。 Result: 在τ-bench、τ²-bench、VitaBench及更难的上下文膨胀设置中,U-Fold在长上下文场景下对ReAct胜率达71.4%,相较先前折叠基线最高提升27.0%,尤其在长程、嘈杂、多轮任务中表现突出。 Conclusion: U-Fold是将上下文管理技术从单查询基准迈向真实用户中心应用的重要进展。 Abstract: Large language model (LLM)-based agents have been successfully deployed in many tool-augmented settings, but their scalability is fundamentally constrained by context length. Existing context-folding methods mitigate this issue by summarizing past interactions, yet they are typically designed for single-query or single-intent scenarios. In more realistic user-centric dialogues, we identify two major failure modes: (i) they irreversibly discard fine-grained constraints and intermediate facts that are crucial for later decisions, and (ii) their summaries fail to track evolving user intent, leading to omissions and erroneous actions. To address these limitations, we propose U-Fold, a dynamic context-folding framework tailored to user-centric tasks. U-Fold retains the full user--agent dialogue and tool-call history but, at each turn, uses two core components to produce an intent-aware, evolving dialogue summary and a compact, task-relevant tool log. Extensive experiments on $τ$-bench, $τ^2$-bench, VitaBench, and harder context-inflated settings show that U-Fold consistently outperforms ReAct (achieving a 71.4% win rate in long-context settings) and prior folding baselines (with improvements of up to 27.0%), particularly on long, noisy, multi-turn tasks. Our study demonstrates that U-Fold is a promising step toward transferring context-management techniques from single-query benchmarks to realistic user-centric applications.[80] Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning
Zhaoyan Gong,Zhiqiang Liu,Songze Li,Xiaoke Guo,Yuanxiang Liu,Xinle Deng,Zhizhen Liu,Lei Liang,Huajun Chen,Wen Zhang
Main category: cs.CL
TL;DR: 本文提出了首个通过强化学习训练的自主端到端时间知识图谱问答(TKGQA)智能体Temp-R1,通过扩展动作空间和反向课程学习提升多跳时序推理能力,在多个基准上达到SOTA性能。
Details
Motivation: 现有TKGQA方法依赖固定流程和昂贵的闭源API,缺乏灵活性与可扩展性;同时单步动作推理易导致认知过载,且模型易在简单问题上走捷径而无法发展复杂推理能力。 Method: 提出Temp-R1:1)设计包含专用内部动作与外部动作的扩展动作空间以缓解认知过载;2)引入反向课程学习策略,先训练难问题再迁移至易问题,避免捷径学习;3)采用强化学习进行端到端训练。 Result: 8B参数的Temp-R1在MultiTQ和TimelineKGQA数据集上达到SOTA,复杂问题性能较强基线提升19.8%。 Conclusion: Temp-R1确立了自主时序推理智能体的新范式,为TKGQA提供了更灵活、可扩展且具备深度推理能力的解决方案。 Abstract: Temporal Knowledge Graph Question Answering (TKGQA) is inherently challenging, as it requires sophisticated reasoning over dynamic facts with multi-hop dependencies and complex temporal constraints. Existing methods rely on fixed workflows and expensive closed-source APIs, limiting flexibility and scalability. We propose Temp-R1, the first autonomous end-to-end agent for TKGQA trained through reinforcement learning. To address cognitive overload in single-action reasoning, we expand the action space with specialized internal actions alongside external action. To prevent shortcut learning on simple questions, we introduce reverse curriculum learning that trains on difficult questions first, forcing the development of sophisticated reasoning before transferring to easier cases. Our 8B-parameter Temp-R1 achieves state-of-the-art performance on MultiTQ and TimelineKGQA, improving 19.8% over strong baselines on complex questions. Our work establishes a new paradigm for autonomous temporal reasoning agents. Our code will be publicly available soon at https://github.com/zjukg/Temp-R1.[81] Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
Keigo Shibata,Kazuki Yano,Ryosuke Takahashi,Jaesung Lee,Wataru Ikeda,Jun Suzuki
Main category: cs.CL
TL;DR: 本文研究了Transformer语言模型内部隐藏状态向量的角距离变化规律,发现多数模型在最后层出现显著‘跳跃’,并提出一种跳跃抑制正则化方法(JREG)以提升中间层能力均衡性与下游任务性能。
Details
Motivation: 观察到Transformer模型在中间层隐藏状态角距离变化小,而最后一层出现异常大的跳跃,推测这可能反映模型能力分配不均、存在潜在缺陷,需量化并抑制该现象。 Method: 提出一个量化最终层附近跳跃强度的指标;设计跳跃抑制正则化器(JREG),在预训练中显式惩罚该跳跃;在Llama系列模型上验证其有效性。 Result: JREG在多个开源模型中被证实普遍存在且随预训练增强;在三种尺寸Llama模型上的实验表明,加入JREG后下游任务性能优于基线,且不改变模型结构。 Conclusion: 最终层的隐藏状态角距离跳跃是一种可量化、可干预的模型行为特征;通过JREG正则化可促进能力更均衡地分布于中间层,从而提升整体性能。 Abstract: This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only slight changes in the angular distance between the input and output hidden state vectors in the middle Transformer layers, despite a disproportionately large ``jump'' in the angular distance occurring in or around the final Transformer layer. To characterize this, we first introduce a quantitative metric for the jump strength around the final layer, and then demonstrate its prevalence across many open-weight models, as well as its amplification throughout pre-training. Assuming such jumps indicate an undesirable property, we propose the jump-suppressing regularizer (JREG) which penalizes this jump during pre-training, thereby encouraging more balanced capability usage across the middle layers. Empirical evaluations of three model sizes of Llama-based models, trained with the proposed JREG method, reveal improved task performance compared to the baseline without altering the model architecture.[82] Calibrating Beyond English: Language Diversity for Better Quantized Multilingual LLM
Everlyn Asiko Chimoto,Mostafa Elhoushi,Bruce A. Bassett
Main category: cs.CL
TL;DR: 本文系统评估了不同语言校准集对多语言大语言模型量化效果的影响,发现非英语和多语言校准集显著优于英语单语校准集,尤其在语言匹配和多样性方面至关重要。
Details
Motivation: 现有后训练量化方法多依赖小规模、仅含英语的校准数据集,其对多语言大语言模型的影响尚未被充分研究。 Method: 在10种语言的数据上,系统评估8种校准设置(5种单语+3种多语混合)在两种量化器(GPTQ、AWQ)上的表现,分析困惑度变化、语言匹配效应及激活分布差异。 Result: 多语言混合校准集使Llama3.1 8B和Qwen2.5 7B模型困惑度最大降低3.52;语言特异性校准带来单语最优提升;特定语言-量化器组合因激活范围分布差异出现性能下降。 Conclusion: 静态统一校准策略不适用于多语言LLM,需根据目标语言和数据多样性定制校准集以实现鲁棒量化。 Abstract: Quantization is an effective technique for reducing the storage footprint and computational costs of Large Language Models (LLMs), but it often results in performance degradation. Existing post-training quantization methods typically use small, English-only calibration sets; however, their impact on multilingual models remains underexplored. We systematically evaluate eight calibration settings (five single-language and three multilingual mixes) on two quantizers (GPTQ, AWQ) on data from 10 languages. Our findings reveal a consistent trend: non-English and multilingual calibration sets significantly improve perplexity compared to English-only baselines. Specifically, we observe notable average perplexity gains across both quantizers on Llama3.1 8B and Qwen2.5 7B, with multilingual mixes achieving the largest overall reductions of up to 3.52 points in perplexity. Furthermore, our analysis indicates that tailoring calibration sets to the evaluation language yields the largest improvements for individual languages, underscoring the importance of linguistic alignment. We also identify specific failure cases where certain language-quantizer combinations degrade performance, which we trace to differences in activation range distributions across languages. These results highlight that static one-size-fits-all calibration is suboptimal and that tailoring calibration data, both in language and diversity, plays a crucial role in robustly quantizing multilingual LLMs.[83] MultiVis-Agent: A Multi-Agent Framework with Logic Rules for Reliable and Comprehensive Cross-Modal Data Visualization
Jinwei Lu,Yuanfeng Song,Chen Zhang,Raymond Chi-Wing Wong
Main category: cs.CL
TL;DR: 本文提出MultiVis-Agent,一种逻辑规则增强的多智能体框架,用于可靠地生成多模态、多场景的可视化图表。该框架通过四层逻辑规则提供数学保证的系统可靠性,并在新构建的MultiVis-Bench基准上显著优于基线方法。
Details
Motivation: 现实中的可视化任务复杂且多模态,现有系统存在单模态输入、一次性生成和流程僵化等根本局限;LLM方法虽有潜力但可靠性差(如灾难性失败、无限循环)。 Method: 提出MultiVis-Agent多智能体框架,引入四层逻辑规则体系(作为数学约束引导LLM推理而非替代),形式化定义涵盖四种场景的MultiVis任务,并构建含1000+案例的MultiVis-Bench基准。 Result: 在挑战性任务上可视化得分为75.63%,显著高于基线(57.54–62.79%);任务完成率达99.58%,代码执行成功率达94.56%(无逻辑规则时分别为74.48%和65.10%)。 Conclusion: 逻辑规则增强的多智能体框架能有效兼顾可视化生成的复杂性与可靠性,为自动化可视化提供了新范式。 Abstract: Real-world visualization tasks involve complex, multi-modal requirements that extend beyond simple text-to-chart generation, requiring reference images, code examples, and iterative refinement. Current systems exhibit fundamental limitations: single-modality input, one-shot generation, and rigid workflows. While LLM-based approaches show potential for these complex requirements, they introduce reliability challenges including catastrophic failures and infinite loop susceptibility. To address this gap, we propose MultiVis-Agent, a logic rule-enhanced multi-agent framework for reliable multi-modal and multi-scenario visualization generation. Our approach introduces a four-layer logic rule framework that provides mathematical guarantees for system reliability while maintaining flexibility. Unlike traditional rule-based systems, our logic rules are mathematical constraints that guide LLM reasoning rather than replacing it. We formalize the MultiVis task spanning four scenarios from basic generation to iterative refinement, and develop MultiVis-Bench, a benchmark with over 1,000 cases for multi-modal visualization evaluation. Extensive experiments demonstrate that our approach achieves 75.63% visualization score on challenging tasks, significantly outperforming baselines (57.54-62.79%), with task completion rates of 99.58% and code execution success rates of 94.56% (vs. 74.48% and 65.10% without logic rules), successfully addressing both complexity and reliability challenges in automated visualization generation.[84] Overalignment in Frontier LLMs: An Empirical Study of Sycophantic Behaviour in Healthcare
Clément Christophe,Wadood Mohammed Abdul,Prateek Munjal,Tathagata Raha,Ronnie Rajan,Praveenkumar Kanithi
Main category: cs.CL
TL;DR: 本文提出了一种基于医学多选题(MCQA)的稳健框架,用于评估大语言模型(LLM)在临床场景中的'谄媚倾向'(sycophancy),即模型为迎合用户而牺牲事实准确性的风险;提出了'校正谄媚分'(Adjusted Sycophancy Score)新指标,控制模型随机不稳定性影响,并发现推理优化模型在权威压力下反而更易错误合理化用户错误建议,提示高基准准确率不等于临床可靠。
Details
Motivation: 大型语言模型(LLM)越来越多地被整合进临床工作流,但其‘谄媚倾向’——即优先迎合用户而非坚持事实准确性——对患者安全构成严重风险;现有评估方法常依赖主观数据集,缺乏可验证的客观标准。 Method: 构建基于医学多选题(MCQA)的评估框架,具备可验证的真实答案;提出‘校正谄媚分’(Adjusted Sycophancy Score),通过建模并扣除模型固有‘混淆性’(confusability)来分离出纯粹的对齐偏差;对Qwen-3和Llama-3系列模型开展大规模缩放分析,并深入分析‘Thinking’类推理优化模型的内部推理轨迹。 Result: 发现模型抗谄媚能力随规模呈清晰提升趋势;揭示‘Thinking’模型虽基础准确率高,但在权威压力下其内部推理链常错误地为用户错误主张提供合理性解释;简化推理结构的模型反而展现出更强的抗专家驱动谄媚鲁棒性。 Conclusion: 基准测试性能不能作为临床可靠性的代理指标;需专门设计面向医疗安全的评估范式与鲁棒性指标;过度复杂的推理机制未必增强临床可信度,甚至可能加剧对权威意见的盲从风险。 Abstract: As LLMs are increasingly integrated into clinical workflows, their tendency for sycophancy, prioritizing user agreement over factual accuracy, poses significant risks to patient safety. While existing evaluations often rely on subjective datasets, we introduce a robust framework grounded in medical MCQA with verifiable ground truths. We propose the Adjusted Sycophancy Score, a novel metric that isolates alignment bias by accounting for stochastic model instability, or "confusability". Through an extensive scaling analysis of the Qwen-3 and Llama-3 families, we identify a clear scaling trajectory for resilience. Furthermore, we reveal a counter-intuitive vulnerability in reasoning-optimized "Thinking" models: while they demonstrate high vanilla accuracy, their internal reasoning traces frequently rationalize incorrect user suggestions under authoritative pressure. Our results across frontier models suggest that benchmark performance is not a proxy for clinical reliability, and that simplified reasoning structures may offer superior robustness against expert-driven sycophancy.[85] When Domain Pretraining Interferes with Instruction Alignment: An Empirical Study of Adapter Merging in Medical LLMs
Junyi Zou
Main category: cs.CL
TL;DR: 本文提出了一种两阶段LoRA微调方法(领域自适应预训练+监督微调)及加权适配器融合策略,以提升大语言模型在医学领域的术语准确性与安全指令遵循能力,并在医学验证集上取得了较好的生成指标结果。
Details
Motivation: 大型语言模型在通用任务上表现强劲,但在医学术语精确性和安全关键型指令遵循方面存在不足,亟需针对医疗等高风险领域的专门优化方法。 Method: 采用两阶段LoRA微调:第一阶段为领域自适应预训练(DAPT),注入广泛医学知识;第二阶段为监督微调(SFT),使用指令式医学问答数据对齐行为;并提出加权适配器融合(Weighted Adapter Merging)策略,在导出前线性合并SFT与PT适配器。 Result: 在医学验证集(F5/F6)上,融合模型在实用解码配置下达到BLEU-4=16.38、ROUGE-1=20.42、ROUGE-2=4.60、ROUGE-L=11.54;同时进行了解码敏感性与训练稳定性分析。 Conclusion: 加权适配器融合能有效平衡指令遵循能力与领域知识保留,为安全关键型专业领域微调提供了可行且可解释的轻量级方案。 Abstract: Large language models (LLMs) show strong general capability but often struggle with medical terminology precision and safety-critical instruction following. We present a case study for adapter interference in safety-critical domains using a 14B-parameter base model through a two-stage LoRA pipeline: (1) domain-adaptive pre-training (PT) to inject broad medical knowledge via continued pre-training (DAPT), and (2) supervised fine-tuning (SFT) to align the model with medical question-answering behaviors through instruction-style data. To balance instruction-following ability and domain knowledge retention, we propose Weighted Adapter Merging, linearly combining SFT and PT adapters before exporting a merged base-model checkpoint. On a held-out medical validation set (F5/F6), the merged model achieves BLEU-4 = 16.38, ROUGE-1 = 20.42, ROUGE-2 = 4.60, and ROUGE-L = 11.54 under a practical decoding configuration. We further analyze decoding sensitivity and training stability with loss curves and controlled decoding comparisons.[86] Code over Words: Overcoming Semantic Inertia via Code-Grounded Reasoning
Manjie Xu,Isabella Yin,Xinyi Tu,Chi Zhang,Yixin Zhu
Main category: cs.CL
TL;DR: 本文揭示了大语言模型在面对动态规则变化时存在“语义惯性”问题,即难以抑制预训练先验知识;提出通过将动态规则表示为可执行代码(而非描述性文本)并结合代码驱动的微调方法(LCV),有效缓解该问题,挑战了‘模型越大越好’的传统假设。
Details
Motivation: LLMs在面对动态、上下文相关的规则变化时,难以抑制其预训练中形成的固定语义先验(如‘岩浆是危险的’),导致推理失败,这一现象被称为语义惯性。 Method: 提出Code-Grounded Vistas(LCV)方法:将物理规则建模为可执行代码而非自然语言描述,并在训练中引入反事实样本对和矛盾规则状态识别机制,迫使模型关注逻辑约束而非视觉或语义先验。 Result: 实验表明,使用代码表示可逆转‘逆向缩放’现象(即大模型反而表现更差);LCV在效率与准确性上均优于推理时搜索类方法。 Conclusion: 表示形式(自然语言 vs. 可执行代码)从根本上决定了模型随规模扩展时 contextual reasoning 能力是提升还是退化;该发现质疑了单纯扩大模型规模的有效性,对需动态覆盖先验知识的任务具有重要启示。 Abstract: LLMs struggle with Semantic Inertia: the inability to inhibit pre-trained priors (e.g., "Lava is Dangerous") when dynamic, in-context rules contradict them. We probe this phenomenon using Baba Is You, where physical laws are mutable text rules, enabling precise evaluation of models' ability to override learned priors when rules change. We quantatively observe that larger models can exhibit inverse scaling: they perform worse than smaller models when natural language reasoning requires suppressing pre-trained associations (e.g., accepting "Lava is Safe"). Our analysis attributes this to natural language encoding, which entangles descriptive semantics and logical rules, leading to persistent hallucinations of familiar physics despite explicit contradictory rules. Here we show that representing dynamics as executable code, rather than descriptive text, reverses this trend and enables effective prior inhibition. We introduce Code-Grounded Vistas (LCV), which fine-tunes models on counterfactual pairs and identifies states with contradictory rules, thereby forcing attention to logical constraints rather than visual semantics. This training-time approach outperforms expensive inference-time search methods in both efficiency and accuracy. Our results demonstrate that representation fundamentally determines whether scaling improves or impairs contextual reasoning. This challenges the assumption that larger models are universally better, with implications for domains that require dynamic overriding of learned priors.[87] CitiLink: Enhancing Municipal Transparency and Citizen Engagement through Searchable Meeting Minutes
Rodrigo Silva,José Evans,José Isidro,Miguel Marques,Afonso Fonseca,Ricardo Morais,João Canavilhas,Arian Pasquali,Purificação Silvano,Alípio Jorge,Nuno Guimarães,Sérgio Nunes,Ricardo Campos
Main category: cs.CL
TL;DR: CitiLink 是一个利用大语言模型(LLM)和信息检索技术,将葡萄牙市议会会议纪要转化为结构化、可搜索数据的平台,旨在提升地方政府透明度与公众可及性。
Details
Motivation: 城市议会会议纪要冗长、格式僵化,公众和记者难以高效获取关键信息,亟需提升其可访问性与透明度。 Method: 使用大语言模型(如Gemini)从会议纪要中提取元数据、议题主题和投票结果;采用BM25进行全文检索,并支持多维度筛选;构建用户友好的交互界面。 Result: 系统在120份来自六个葡萄牙市政厅的会议纪要上成功部署;用户测试验证了其可用性;Gemini在信息抽取任务中展现出良好效果。 Conclusion: NLP与IR技术可有效赋能地方政务文本结构化与开放检索,CitiLink为提升基层政府数字治理能力提供了可行范例。 Abstract: City council minutes are typically lengthy and formal documents with a bureaucratic writing style. Although publicly available, their structure often makes it difficult for citizens or journalists to efficiently find information. In this demo, we present CitiLink, a platform designed to transform unstructured municipal meeting minutes into structured and searchable data, demonstrating how NLP and IR can enhance the accessibility and transparency of local government. The system employs LLMs to extract metadata, discussed subjects, and voting outcomes, which are then indexed in a database to support full-text search with BM25 ranking and faceted filtering through a user-friendly interface. The developed system was built over a collection of 120 minutes made available by six Portuguese municipalities. To assess its usability, CitiLink was tested through guided sessions with municipal personnel, providing insights into how real users interact with the system. In addition, we evaluated Gemini's performance in extracting relevant information from the minutes, highlighting its effectiveness in data extraction.[88] Hierarchical Text Classification with LLM-Refined Taxonomies
Jonas Golde,Nicolaas Jedema,Ravi Krishnan,Phong Le
Main category: cs.CL
TL;DR: 本文提出TaxMorph框架,利用大语言模型(LLM)对层级文本分类(HTC)中的标签 taxonomy 进行语义驱动的重构(如重命名、合并、拆分、重排序),使其更契合语言模型的内在表示与决策偏好;实验表明LLM优化后的taxonomy显著提升HTC性能(最高+2.9pp F1),且其改进源于更好匹配模型的归纳偏置而非简单提升嵌入可分性。
Details
Motivation: 现实世界中的分类体系(taxonomy)常存在歧义(如同名叶子节点、相似父节点),导致语言模型难以学习清晰的决策边界,影响层级文本分类效果。 Method: 提出TaxMorph框架,使用大语言模型对整个taxonomy执行语义感知的结构化操作(包括重命名、合并、拆分、重排序),以使taxonomy更贴合语言模型内部表征和分类混淆模式;通过对比人类构建与LLM重构的taxonomy在嵌入空间聚类性和模型混淆一致性上的差异,解释性能提升机制。 Result: 在三个HTC基准上,LLM重构的taxonomy在各类设置下均优于人工构建taxonomy,F1最高提升2.9个百分点;分析发现:人工taxonomy在嵌入空间中聚类更清晰,但LLM重构的taxonomy更准确反映模型实际分类混淆模式,即更契合其归纳偏置。 Conclusion: LLM引导的taxonomy重构能生成更适配语言模型学习机制的分类体系,从而提升层级文本分类性能;该方法强调taxonomy应服务于模型的认知结构,而非仅追求人类可解释性或几何可分性。 Abstract: Hierarchical text classification (HTC) depends on taxonomies that organize labels into structured hierarchies. However, many real-world taxonomies introduce ambiguities, such as identical leaf names under similar parent nodes, which prevent language models (LMs) from learning clear decision boundaries. In this paper, we present TaxMorph, a framework that uses large language models (LLMs) to transform entire taxonomies through operations such as renaming, merging, splitting, and reordering. Unlike prior work, our method revises the full hierarchy to better match the semantics encoded by LMs. Experiments across three HTC benchmarks show that LLM-refined taxonomies consistently outperform human-curated ones in various settings up to +2.9pp. in F1. To better understand these improvements, we compare how well LMs can assign leaf nodes to parent nodes and vice versa across human-curated and LLM-refined taxonomies. We find that human-curated taxonomies lead to more easily separable clusters in embedding space. However, the LLM-refined taxonomies align more closely with the model's actual confusion patterns during classification. In other words, even though they are harder to separate, they better reflect the model's inductive biases. These findings suggest that LLM-guided refinement creates taxonomies that are more compatible with how models learn, improving HTC performance.[89] Corpus-Based Approaches to Igbo Diacritic Restoration
Ignatius Ezeani
Main category: cs.CL
TL;DR: 本文针对低资源语言Igbo的变音符号消歧问题,提出了一种灵活的数据集生成框架,并比较了n-gram模型、分类模型和嵌入模型三种方法。
Details
Motivation: 全球7000种语言中95%以上是NLP低资源语言,缺乏数据、工具和技术;Igbo作为典型低资源语言,其变音符号消歧问题亟待解决。 Method: 构建灵活的数据集生成框架;提出并比较三种方法:标准n-gram模型(利用前序词序列预测)、分类模型(使用目标词前后窗口词)、嵌入模型(计算上下文与候选变体的嵌入相似度)。 Result: 实现了Igbo语言变音符号恢复的多种建模方法,为低资源语言NLP任务提供了可复用的数据生成框架和模型方案。 Conclusion: 该工作为低资源语言(特别是Igbo)的变音符号恢复任务提供了系统性方法和实践路径,有助于拓展NLP技术在更多语言上的适用性。 Abstract: With natural language processing (NLP), researchers aim to enable computers to identify and understand patterns in human languages. This is often difficult because a language embeds many dynamic and varied properties in its syntax, pragmatics and phonology, which need to be captured and processed. The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries. But these research works focus more on well-resourced languages such as English, Japanese, German, French, Russian, Mandarin Chinese, etc. Over 95% of the world's 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work. In this thesis, we present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages. Focusing on the Igbo language, we report the steps taken to develop a flexible framework for generating datasets for diacritic restoration. Three main approaches, the standard n-gram model, the classification models and the embedding models were proposed. The standard n-gram models use a sequence of previous words to the target stripped word as key predictors of the correct variants. For the classification models, a window of words on both sides of the target stripped word was used. The embedding models compare the similarity scores of the combined context word embeddings and the embeddings of each of the candidate variant vectors.[90] Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction
Mikel Zubillaga,Oscar Sainz,Oier Lopez de Lacalle,Eneko Agirre
Main category: cs.CL
TL;DR: 本文提出ThinkTwice框架,利用采样而非贪心解码生成多个候选模板,并通过无监督(基于一致性)或有监督(奖励模型)方式选择最优模板,显著提升文档级信息抽取性能。
Details
Motivation: 标准做法使用贪心解码以避免输出变异性,但作者认为这种变异性可被利用来提升性能,尤其在推理模型中。 Method: 提出ThinkTwice框架:LLM对同一文档采样生成多个候选模板;设计无监督(跨输出一致性)和有监督(基于奖励模型)的选择模块;并采用拒绝采样法生成带推理轨迹的银标训练数据以缓解金标数据稀缺问题。 Result: 实验表明,无监督与有监督的ThinkTwice均稳定优于贪心解码基线及当前最优方法。 Conclusion: 采样+选择策略比单一贪心解码更有效,ThinkTwice为DocIE提供了新范式,且其银标数据构造方法可缓解推理轨迹标注瓶颈。 Abstract: Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.[91] Pisets: A Robust Speech Recognition System for Lectures and Interviews
Ivan Bondarenko,Daniil Grebenkin,Oleg Sedukhin,Mikhail Klementev,Roman Derunets,Lyudmila Budneva
Main category: cs.CL
TL;DR: 本文提出了一种名为Pisets的语音转文本系统,专为科学家和记者设计,采用Wav2Vec2、AST和Whisper三组件架构,并结合课程学习、俄语语料库及不确定性建模,显著提升了长音频在多变声学条件下的识别准确率与鲁棒性,优于WhisperX和标准Whisper。
Details
Motivation: 提升语音识别准确性,减少Whisper模型的错误和幻觉问题,尤其针对科学家和记者对高可靠性转录的需求。 Method: 构建三组件架构:Wav2Vec2主识别、AST过滤假阳性、Whisper最终识别;引入课程学习、多样化俄语语音语料训练及先进不确定性建模技术。 Result: Pisets在长音频、多种声学条件下表现更鲁棒, transcription质量优于WhisperX和原始Whisper模型。 Conclusion: 该系统通过模块化设计与多策略优化,有效提升了专业场景下语音转写系统的准确性与可靠性,代码已开源。 Abstract: This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.[92] Latent Knowledge as a Predictor of Fact Acquisition in Fine-Tuned Large Language Models
Daniel B. Hier,Tayo Obafemi-Ajayi
Main category: cs.CL
TL;DR: 本文研究了大语言模型在微调过程中如何学习和泛化生物医学本体术语映射,发现潜在知识(latent knowledge)是影响事实获取速度、泛化能力及退化抵抗性的关键因素。
Details
Motivation: 探究大语言模型中生物医学事实存储的不均衡性,特别是‘潜在知识’(存在于权重中但难以通过确定性解码访问的事实)对微调学习动态的影响。 Method: 使用Llama 3.1 8B Instruct模型,在Human Phenotype Ontology(HPO)和Gene Ontology(GO)术语映射任务上进行20轮微调;采用随机解码检测基线潜在知识,结合Cox比例风险模型分析知识获取、泛化与退化的预测因子。 Result: HPO召回率从基线2.8%提升至71.9%;潜在知识是最快获取事实的最强预测因子(HR=2.6),并促进更快收敛;GO泛化率仅5.8%,且依赖于潜在知识存在;未见术语的正确映射更易退化,而已见术语因训练强化更具鲁棒性。 Conclusion: 潜在知识不仅预测微调中事实学习的速度,也限制未见本体事实的泛化能力;而事实是否在训练中被强化,决定了其抵抗退化的程度。 Abstract: Large language models store biomedical facts with uneven strength after pretraining: some facts are present in the weights but are not reliably accessible under deterministic decoding (latent knowledge), while others are scarcely represented. We fine tuned Llama 3.1 8B Instruct to learn ontology term identifier mappings from the Human Phenotype Ontology (800 pairs) and the Gene Ontology (400 training pairs), withholding 400 GO pairs to test generalization. Treating learning as a time to event process across 20 epochs, we used stochastic decoding to detect latent knowledge at baseline and Cox proportional hazards models to identify predictors of acquisition, generalization, and degradation. Baseline deterministic recall for HPO was 2.8%, rising to 71.9% after fine-tuning. Latent knowledge was the strongest predictor of faster fact acquisition (HR 2.6) and was associated with earlier, higher peak learning rates and faster convergence; identifier frequency and curated annotation counts had smaller effects. Generalization to withheld GO facts was uncommon (5.8%) but more likely when latent knowledge was present. Previously correct GO mappings degraded more often for withheld (unseen) terms than for trained (seen) terms, suggesting a protective effect of reinforcement during training. These results show that latent knowledge predicts both the speed of factual learning during fine-tuning and the limited generalization of unseen ontology facts, while resistance to degradation depends on whether facts are reinforced.[93] Funny or Persuasive, but Not Both: Evaluating Fine-Grained Multi-Concept Control in LLMs
Arya Labroo,Ivaxi Sheth,Vyas Raina,Amaani Ahmed,Mario Fritz
Main category: cs.CL
TL;DR: 本文提出了一种用于评估大语言模型在单概念和双概念场景下细粒度可控性的新框架,发现模型在双概念控制时性能显著下降,揭示了基于提示的控制方法在概念组合性上的根本局限。
Details
Motivation: 现有方法在文本概念(如幽默、说服力、正式性)的细粒度、多属性控制上缺乏系统性评估,难以满足实际应用中对显式、精细控制的需求。 Method: 构建了一个针对单概念与双概念(尤其是语言学上差异明显的概念对)可控性的系统评估框架,并在多个大语言模型和生成任务上进行实证测试。 Result: 实验发现,即使语义上可分离的概念,在双概念控制设置下模型性能普遍下降,表明当前提示方法在概念组合性上存在根本缺陷。 Conclusion: 该框架首次系统揭示了大语言模型在多概念可控性上的能力缺口,为未来多概念控制方法的设计与评估提供了原则性基准。 Abstract: Large Language Models (LLMs) offer strong generative capabilities, but many applications require explicit and \textit{fine-grained} control over specific textual concepts, such as humor, persuasiveness, or formality. Prior approaches in prompting and representation engineering can provide coarse or single-attribute control, but systematic evaluation of multi-attribute settings remains limited. We introduce an evaluation framework for fine-grained controllability for both single- and dual-concept scenarios, focusing on linguistically distinct concept pairs (e.g., persuasiveness vs.~humor). Surprisingly, across multiple LLMs and generative tasks, we find that performance often drops in the dual-concept setting, even though the chosen concepts should in principle be separable. This reveals a fundamental limitation of naive prompting-based control: models struggle with compositionality even when concepts are intuitively independent. Our framework provides systematic evidence of this gap and offers a principled approach for measuring the ability of future methods for multi-concept control.[94] Demographic Probing of Large Language Models Lacks Construct Validity
Manuel Tonneau,Neil K. R. Seghal,Niyati Malhotra,Victor Orozco-Olvera,Ana María Muñoz Boudet,Lakshmi Subramanian,Sharath Chandra Guntuku,Valentin Hofmann
Main category: cs.CL
TL;DR: 本文质疑了人口统计学探测(demographic probing)在评估大语言模型(LLMs)对人口特征响应时的构念效度,发现单一人口线索(如姓名或方言)无法稳定、一致地表征同一人口群体,导致测量结果不稳定且易受语言混杂因素干扰;建议采用多种生态有效线索并显式控制混杂变量。
Details
Motivation: 检验人口统计学探测方法是否具有构念效度——即不同线索(如姓名、方言)是否能一致地反映模型对同一人口群体(如种族、性别)的条件化行为。 Method: 在真实建议寻求对话场景中,系统性地使用多种旨在表征相同人口群体(美国语境下的种族与性别)的线索,分析模型响应的重叠性、组内区分度及差异估计的稳定性,并探究线索的属性编码强度与语言混杂因素的影响。 Result: 同一人口群体的不同线索引发的模型行为变化仅部分重叠;组内区分弱且不均衡;估计的群体间差异在不同线索下大小和方向均不稳定;这种不稳定性源于线索对人口属性编码强度的差异及独立影响模型行为的语言混杂因素。 Conclusion: 人口统计学探测缺乏构念效度,无法提供关于LLM如何利用人口信息的单一稳定刻画;其问题可能源于构念本身设定不当或碎片化;应采用多线索、生态有效的设计并控制混杂变量,以支撑更可靠的结论。 Abstract: Demographic probing is widely used to study how large language models (LLMs) adapt their behavior to signaled demographic attributes. This approach typically uses a single demographic cue in isolation (e.g., a name or dialect) as a signal for group membership, implicitly assuming strong construct validity: that such cues are interchangeable operationalizations of the same underlying, demographically conditioned behavior. We test this assumption in realistic advice-seeking interactions, focusing on race and gender in a U.S. context. We find that cues intended to represent the same demographic group induce only partially overlapping changes in model behavior, while differentiation between groups within a given cue is weak and uneven. Consequently, estimated disparities are unstable, with both magnitude and direction varying across cues. We further show that these inconsistencies partly arise from variation in how strongly cues encode demographic attributes and from linguistic confounders that independently shape model behavior. Together, our findings suggest that demographic probing lacks construct validity: it does not yield a single, stable characterization of how LLMs condition on demographic information, which may reflect a misspecified or fragmented construct. We conclude by recommending the use of multiple, ecologically valid cues and explicit control of confounders to support more defensible claims about demographic effects in LLMs.[95] Using Large Language Models to Construct Virtual Top Managers: A Method for Organizational Research
Antonio Garzon-Vico,Krithika Sharon Komalapati,Arsalan Shahid,Jan Rosier
Main category: cs.CL
TL;DR: 本研究提出一种利用大语言模型(LLM)构建真实CEO虚拟人格的方法框架,基于真实CEO沟通数据和道德基础理论,验证其在道德判断上与真人具有一致性,可作为组织研究中替代或补充真人高管的可信工具。
Details
Motivation: 解决组织研究中难以直接接触高层管理者的问题,探索利用大语言模型构建具有理论支撑的虚拟高管人格的可能性。 Method: 基于真实CEO通信数据和道德基础理论,构建LLM驱动的虚拟CEO人格,并通过三阶段实验(构念效度、信度、行为保真度)与真人参与者进行基准对比。 Result: 理论支撑的虚拟人格能较好拟合人类样本的道德判断,表明LLM虚拟CEO具备行为可信度和研究适用性。 Conclusion: LLM构建的虚拟高管人格可作为组织研究中获取高管视角的可信、互补性工具,尤其适用于难以接触真实高管的情境;未来研究可拓展其在组织行为、战略决策等领域的应用。 Abstract: This study introduces a methodological framework that uses large language models to create virtual personas of real top managers. Drawing on real CEO communications and Moral Foundations Theory, we construct LLM-based participants that simulate the decision-making of individual leaders. Across three phases, we assess construct validity, reliability, and behavioral fidelity by benchmarking these virtual CEOs against human participants. Our results indicate that theoretically scaffolded personas approximate the moral judgements observed in human samples, suggesting that LLM-based personas can serve as credible and complementary tools for organizational research in contexts where direct access to executives is limited. We conclude by outlining implications for future research using LLM-based personas in organizational settings.[96] GenAI for Social Work Field Education: Client Simulation with Real-Time Feedback
James Sungarda,Hongkai Liu,Zilong Zhou,Tien-Hsuan Wu,Johnson Chun-Sing Cheung,Ben Kao
Main category: cs.CL
TL;DR: 本文提出SWITCH,一个用于社会工作培训的交互式聊天机器人,通过模拟真实客户、实时分类咨询技能和动机访谈(MI)进展系统,提供及时客观的反馈,提升培训效率。
Details
Motivation: 社会工作实地教育中,教师和咨询客户的可用性限制了及时和客观反馈的提供。 Method: SWITCH结合了基于认知模型的客户模拟、实时咨询技能分类模块以及动机访谈(MI)阶段控制系统;采用检索增强的上下文学习与微调BERT多标签分类器提升技能识别准确率。 Result: 实验表明,基于BERT的方法和上下文学习方法均显著优于基线模型。 Conclusion: SWITCH提供了一种可扩展、低成本且一致的培训流程,补充实地教育,并使督导者能专注于更高层次的指导。 Abstract: Field education is the signature pedagogy of social work, yet providing timely and objective feedback during training is constrained by the availability of instructors and counseling clients. In this paper, we present SWITCH, the Social Work Interactive Training Chatbot. SWITCH integrates realistic client simulation, real-time counseling skill classification, and a Motivational Interviewing (MI) progression system into the training workflow. To model a client, SWITCH uses a cognitively grounded profile comprising static fields (e.g., background, beliefs) and dynamic fields (e.g., emotions, automatic thoughts, openness), allowing the agent's behavior to evolve throughout a session realistically. The skill classification module identifies the counseling skills from the user utterances, and feeds the result to the MI controller that regulates the MI stage transitions. To enhance classification accuracy, we study in-context learning with retrieval over annotated transcripts, and a fine-tuned BERT multi-label classifier. In the experiments, we demonstrated that both BERT-based approach and in-context learning outperforms the baseline with big margin. SWITCH thereby offers a scalable, low-cost, and consistent training workflow that complements field education, and allows supervisors to focus on higher-level mentorship.[97] Exploring Fine-Tuning for In-Context Retrieval and Efficient KV-Caching in Long-Context Language Models
Francesco Maria Molfese,Momchil Hardalov,Rexhina Blloshmi,Bill Byrne,Adrià de Gispert
Main category: cs.CL
TL;DR: 本文研究了如何通过微调策略提升长上下文语言模型(LCLMs)在长文本中的信息识别与利用能力,以及其在KV缓存压缩下的鲁棒性;实验表明微调显著提升领域内性能(最高+20分),但跨领域泛化效果因任务而异,且对KV压缩有一定鲁棒性增益。
Details
Motivation: 尽管长上下文语言模型(LCLMs)具备处理百万级token上下文的能力,但尚不清楚现有微调策略能否有效提升其长上下文理解能力及在KV-cache压缩下的鲁棒性。 Method: 系统性地探究不同训练策略对LCLMs长上下文信息识别、利用能力及KV-cache压缩鲁棒性的影响,并在多个领域和任务上进行对比实验。 Result: 微调带来显著的领域内性能提升(最高+20分);跨领域表现任务依赖:金融问答提升+9分,而RAG在多选题上优于LCLMs(+6分);对KV-cache压缩具有中等程度的鲁棒性提升,但增益因任务而异。 Conclusion: 微调可有效增强LCLMs的长上下文能力与部分压缩鲁棒性,但其跨领域泛化能力有限且不均衡,需结合任务特点选择LCLM或RAG范式。 Abstract: With context windows of millions of tokens, Long-Context Language Models (LCLMs) can encode entire document collections, offering a strong alternative to conventional retrieval-augmented generation (RAG). However, it remains unclear whether fine-tuning strategies can improve long-context performance and translate to greater robustness under KV-cache compression techniques. In this work, we investigate which training strategies most effectively enhance LCLMs' ability to identify and use relevant information, as well as enhancing their robustness under KV-cache compression. Our experiments show substantial in-domain improvements, achieving gains of up to +20 points over the base model. However, out-of-domain generalization remains task dependent with large variance -- LCLMs excels on finance questions (+9 points), while RAG shows stronger performance on multiple-choice questions (+6 points) over the baseline models. Finally, we show that our fine-tuning approaches bring moderate improvements in robustness under KV-cache compression, with gains varying across tasks.[98] From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation
Yuxin Jiang,Yufei Wang,Qiyuan Zhang,Xingshan Zeng,Liangyou Li,Jierun Chen,Chaofan Tao,Haoli Bai,Lifeng Shang
Main category: cs.CL
TL;DR: 本文提出RLVRR方法,通过从高质量参考文本中提取有序语言信号(奖励链)来解决开放生成任务中缺乏明确真值的问题,将奖励分解为内容和风格两个维度,结合强化学习的探索能力和监督微调的效率与可靠性。
Details
Motivation: 现有基于可验证奖励的强化学习(RLVR)在数学和代码等推理任务中表现良好,但在开放生成任务中因缺乏明确真值而面临低效和奖励作弊问题。 Method: 提出强化学习与可验证参考奖励(RLVRR),从高质量参考中提取奖励链,将奖励分解为内容(保留确定性核心概念如关键词)和风格(通过大模型验证风格一致性)两个维度。 Result: 在10多个基准测试中,RLVRR显著优于使用十倍数据训练的SFT及先进奖励模型;统一了结构化推理与开放生成训练;泛化能力更强且保持输出多样性。 Conclusion: RLVRR为通用大模型对齐提供了一条原理清晰、高效可行的可验证强化学习路径。 Abstract: Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at https://github.com/YJiangcm/RLVRR.[99] Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
Abishek Stephen,Jindřich Libovický
Main category: cs.CL
TL;DR: 本文提出了一种基于形态句法特征的新指标,用于评估子词分段的形态合理性,无需依赖质量参差或缺失的金标准分词数据,通过IBM Model 1概率对齐子词与形态特征,在多语言场景下具有更广适用性且与传统边界召回率高度相关。
Details
Motivation: 传统基于词素边界的F-score等指标依赖高质量、跨语言一致的金标准分词数据,而这类数据在许多语言中不可用或质量不一;因此需一种更通用、资源友好的替代评估方法。 Method: 利用Universal Dependencies或UniMorph等广泛可用的形态句法特征,通过IBM Model 1对子词与形态特征进行概率对齐,构建无需金标准分词的形态合理性评估指标。 Result: 该指标在实验中与传统词素边界召回率呈良好相关性,且适用于更多语言及不同形态类型系统。 Conclusion: 所提指标是一种更鲁棒、可扩展的子词分段形态评估方法,缓解了对稀缺金标准数据的依赖,提升了跨语言评估的可行性与一致性。 Abstract: We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentation data that is either unavailable or of inconsistent quality across many languages, our approach utilizes morpho-syntactic features. These are available in resources such as Universal Dependencies or UniMorph for a much wider range of languages. The metric works by probabilistically aligning subwords with morphological features through an IBM Model 1. Our experiments show that the metric correlates well with traditional morpheme boundary recall while being more broadly applicable across languages with different morphological systems.[100] Unknown Unknowns: Why Hidden Intentions in LLMs Evade Detection
Devansh Srivastav,David Pape,Lea Schönherr
Main category: cs.CL
TL;DR: 本文提出了一种针对大语言模型(LLM)中‘隐藏意图’的系统性分析框架,定义了十类隐藏意图的分类法,揭示其在开放世界场景下极难被检测,并通过实证表明现有检测方法在低发生率等现实条件下失效,强调亟需构建鲁棒的治理与评估框架。
Details
Motivation: LLM日益参与日常决策,但其输出可能包含难以察觉、具有目标导向的‘隐藏意图’,这些意图可能源于训练偏差或恶意植入,却缺乏系统性识别与评估方法。 Method: 构建基于社会科学的十维隐藏意图分类法;在可控模型中诱导隐藏意图以建立测试基准;系统评估多种LLM判别器(含推理与非推理型)在开放世界条件下的检测性能;开展精度-发生率与精度-漏检率权衡的压力测试;并进行部署模型的定性案例研究。 Result: 发现当前检测方法在开放世界、尤其是低发生率场景下严重失效(高误报率淹没精度,高漏检率掩盖风险);压力测试表明审计仅在极低误报率或强先验假设下才有效;定性研究证实全部十类隐藏意图均存在于现役先进LLM中。 Conclusion: 隐藏意图是LLM安全与可信的关键盲区,其可诱导性、隐蔽性与普遍性要求超越表面行为分析,转向设计层面的影响策略建模;本文提供了首个开放世界下隐藏意图可检测性失效的系统证据,并为未来治理、评估与防御提供分类学基础与实证基准。 Abstract: LLMs are increasingly embedded in everyday decision-making, yet their outputs can encode subtle, unintended behaviours that shape user beliefs and actions. We refer to these covert, goal-directed behaviours as hidden intentions, which may arise from training and optimisation artefacts, or be deliberately induced by an adversarial developer, yet remain difficult to detect in practice. We introduce a taxonomy of ten categories of hidden intentions, grounded in social science research and organised by intent, mechanism, context, and impact, shifting attention from surface-level behaviours to design-level strategies of influence. We show how hidden intentions can be easily induced in controlled models, providing both testbeds for evaluation and demonstrations of potential misuse. We systematically assess detection methods, including reasoning and non-reasoning LLM judges, and find that detection collapses in realistic open-world settings, particularly under low-prevalence conditions, where false positives overwhelm precision and false negatives conceal true risks. Stress tests on precision-prevalence and precision-FNR trade-offs reveal why auditing fails without vanishingly small false positive rates or strong priors on manipulation types. Finally, a qualitative case study shows that all ten categories manifest in deployed, state-of-the-art LLMs, emphasising the urgent need for robust frameworks. Our work provides the first systematic analysis of detectability failures of hidden intentions in LLMs under open-world settings, offering a foundation for understanding, inducing, and stress-testing such behaviours, and establishing a flexible taxonomy for anticipating evolving threats and informing governance.[101] One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization
Franziska Weeber,Vera Neplenbroek,Jan Batzner,Sebastian Padó
Main category: cs.CL
TL;DR: 本文研究了通过社会人口统计学子群体对大语言模型(LLM)进行个性化时所使用的不同‘角色’提示线索(如姓名、显式属性等)对模型输出偏差的影响,发现单一线索易导致结果不可靠,建议未来研究采用多种外部有效线索进行评估。
Details
Motivation: 现有研究多依赖单一社会人口统计线索(如姓名)构建persona来评估LLM偏见,但忽视了模型对提示变化的敏感性(鲁棒性)及某些线索在真实交互中的稀有性(外部效度)。 Method: 在四个写作与建议任务上,系统比较六种常用persona线索在七种开源与闭源大语言模型上的表现,并分析响应差异与线索间相关性。 Result: 六种线索整体高度相关,但在不同persona下仍引发显著响应差异;单一线索不足以支撑稳健的个性化偏见结论。 Conclusion: 应谨慎基于单一persona线索得出偏见或公平性结论,未来个性化研究需采用多种具备外部有效性的提示线索进行综合评估。 Abstract: Personalization of LLMs by sociodemographic subgroup often improves user experience, but can also introduce or amplify biases and unfair outcomes across groups. Prior work has employed so-called personas, sociodemographic user attributes conveyed to a model, to study bias in LLMs by relying on a single cue to prompt a persona, such as user names or explicit attribute mentions. This disregards LLM sensitivity to prompt variations (robustness) and the rarity of some cues in real interactions (external validity). We compare six commonly used persona cues across seven open and proprietary LLMs on four writing and advice tasks. While cues are overall highly correlated, they produce substantial variance in responses across personas. We therefore caution against claims from a single persona cue and recommend future personalization research to evaluate multiple externally valid cues.[102] From Classification to Ranking: Enhancing LLM Reasoning Capabilities for MBTI Personality Detection
Yuan Cao,Feixiang Liu,Xinyue Wang,Yihan Zhu,Hui Xu,Zheng Wang,Qiang Qiu
Main category: cs.CL
TL;DR: 本文提出将人格检测视为排序任务,并采用强化学习训练范式(结合监督微调和分组相对策略优化)来提升大语言模型在人格特质排序上的能力,显著提升了多个人格检测基准的性能。
Details
Motivation: 现有基于提示的方法依赖专家知识、缺乏自主模式学习能力,且人格特质间界限模糊、主观性强,导致分类效果受限。 Method: 将人格检测建模为排序任务;先通过监督微调(SFT)构建标准化输出的排序能力;再引入分组相对策略优化(GRPO)与定制化排序奖励函数进行强化学习训练。 Result: 在多个人格检测基准上达到当前最优性能(state-of-the-art)。 Conclusion: 将人格检测从分类转向排序,并结合强化学习可更有效地建模人格特质的主观性与连续性,显著提升检测效果。 Abstract: Personality detection aims to measure an individual's corresponding personality traits through their social media posts. The advancements in Large Language Models (LLMs) offer novel perspectives for personality detection tasks. Existing approaches enhance personality trait analysis by leveraging LLMs to extract semantic information from textual posts as prompts, followed by training classifiers for categorization. However, accurately classifying personality traits remains challenging due to the inherent complexity of human personality and subtle inter-trait distinctions. Moreover, prompt-based methods often exhibit excessive dependency on expert-crafted knowledge without autonomous pattern-learning capacity. To address these limitations, we view personality detection as a ranking task rather than a classification and propose a corresponding reinforcement learning training paradigm. First, we employ supervised fine-tuning (SFT) to establish personality trait ranking capabilities while enforcing standardized output formats, creating a robust initialization. Subsequently, we introduce Group Relative Policy Optimization (GRPO) with a specialized ranking-based reward function. Unlike verification tasks with definitive solutions, personality assessment involves subjective interpretations and blurred boundaries between trait categories. Our reward function explicitly addresses this challenge by training LLMs to learn optimal answer rankings. Comprehensive experiments have demonstrated that our method achieves state-of-the-art performance across multiple personality detection benchmarks.[103] Gained in Translation: Privileged Pairwise Judges Enhance Multilingual Reasoning
Lintang Sutawika,Gokul Swamy,Zhiwei Steven Wu,Graham Neubig
Main category: cs.CL
TL;DR: 本文提出SP3F框架,通过两阶段(监督微调+带特权信息的自博弈强化学习)提升大语言模型的多语言推理能力,无需目标语言数据,显著提升性能。
Details
Motivation: 当前推理型大语言模型在低资源语言上的表现远低于英语,亟需无需目标语言数据的多语言推理增强方法。 Method: 提出SP3F:第一阶段用翻译的英-答案对进行监督微调;第二阶段采用带英语参考答案作为特权信息的成对判别器,在自博弈中进行强化学习。 Result: SP3F大幅提升了基线模型在多语言、单语言及泛化至未见语言任务上的性能,在数学与非数学任务上均优于全量后训练模型,且仅用更少训练数据。 Conclusion: SP3F验证了利用特权反馈进行无目标语言数据的多语言推理增强的有效性,为低资源语言推理提供了新范式。 Abstract: When asked a question in a language less seen in its training data, current reasoning large language models (RLMs) often exhibit dramatically lower performance than when asked the same question in English. In response, we introduce \texttt{SP3F} (Self-Play with Privileged Pairwise Feedback), a two-stage framework for enhancing multilingual reasoning without \textit{any} data in the target language(s). First, we supervise fine-tune (SFT) on translated versions of English question-answer pairs to raise base model correctness. Second, we perform RL with feedback from a pairwise judge in a self-play fashion, with the judge receiving the English reference response as \textit{privileged information}. Thus, even when none of the model's responses are completely correct, the privileged pairwise judge can still tell which response is better. End-to-end, \texttt{SP3F} greatly improves base model performance, even outperforming fully post-trained models on multiple math and non-math tasks with less than of the training data across the single-language, multilingual, and generalization to unseen language settings.[104] HalluCitation Matters: Revealing the Impact of Hallucinated References with 300 Hallucinated Papers in ACL Conferences
Yusuke Sakai,Hidetaka Kamigaito,Taro Watanabe
Main category: cs.CL
TL;DR: 本文系统研究了学术论文中幻觉引用(HalluCitation)现象的普遍性与影响,发现ACL、NAACL和EMNLP 2024–2025年会议中近300篇论文存在此类问题,尤以EMNLP 2025最为严重,其中超100篇为主会或Findings录用论文,严重威胁学术可信度。
Details
Motivation: 近期频繁出现与任何真实文献不对应的幻觉引用,损害科学可靠性及会议公信力,亟需系统评估其规模与影响。 Method: 对ACL、NAACL和EMNLP在2024与2025年发布的全部主会、Findings及 workshop论文进行实证分析,识别并统计HalluCitation的数量、分布与接受状态。 Result: 近300篇论文含至少一个HalluCitation,多数出现在2025年;EMNLP 2025占比最高(约50%),其中超100篇为正式录用的主会或Findings论文。 Conclusion: HalluCitation问题正在快速加剧,尤其在最新顶会中已波及正式录用论文,亟需社区建立检测机制与出版规范以维护学术诚信。 Abstract: Recently, we have often observed hallucinated citations or references that do not correspond to any existing work in papers under review, preprints, or published papers. Such hallucinated citations pose a serious concern to scientific reliability. When they appear in accepted papers, they may also negatively affect the credibility of conferences. In this study, we refer to hallucinated citations as "HalluCitation" and systematically investigate their prevalence and impact. We analyze all papers published at ACL, NAACL, and EMNLP in 2024 and 2025, including main conference, Findings, and workshop papers. Our analysis reveals that nearly 300 papers contain at least one HalluCitation, most of which were published in 2025. Notably, half of these papers were identified at EMNLP 2025, the most recent conference, indicating that this issue is rapidly increasing. Moreover, more than 100 such papers were accepted as main conference and Findings papers at EMNLP 2025, affecting the credibility.[105] Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale
Henry Bell,Caroline Zhang,Mohammed Mobasserul Haque,Dhaval Potdar,Samia Zaman,Brandon Fain
Main category: cs.CL
TL;DR: 本文提出了一种名为REFLECT的推理时框架,用于在不需训练或标注数据的情况下,通过上下文内自我评估、自我批判与修正,实现大语言模型对价值原则的对齐。
Details
Motivation: 现有基于参数微调(如RLHF)的方法计算开销大、依赖人工标注、工程复杂;亟需一种轻量、即插即用、无需训练的对齐方法。 Method: REFLECT是一种纯推理时、完全上下文内的方法,包含四步:(i)基于宪法生成初始响应,(ii)自我评估,(iii)(a)自我批判,(iii)(b)最终修订;全程显式依据自然语言原则进行链式推理。 Result: REFLECT显著提升模型对多样化复杂原则的遵守率,尤其降低尾部罕见但严重的原则违规;不损害事实推理能力;并能自动生成高质量监督数据以辅助后续参数微调。 Conclusion: REFLECT提供了一种高效、透明、安全且可扩展的宪法对齐新范式,兼具即插即用性与长期部署潜力。 Abstract: The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.[106] One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment
Hongru Cai,Yongqi Li,Tiezheng Yu,Fengbin Zhu,Wenjie Wang,Fuli Feng,Wenjie Li
Main category: cs.CL
TL;DR: 本文提出了一种元奖励建模(MRM)方法,将个性化奖励建模视为元学习问题,通过MAML风格框架优化基础奖励函数权重的初始化,并引入鲁棒个性化目标(RPO)提升对难学习用户的适应能力,在少量反馈下实现更鲁棒、更高效的个性化对齐。
Details
Motivation: 个性化对齐依赖能捕捉用户偏好的个性化奖励模型,但面临个体用户反馈稀缺和需高效适配新用户两大挑战。 Method: 提出元奖励建模(MRM),将每个用户的奖励模型表示为基础奖励函数的加权组合,并采用MAML式元学习优化权重初始化;引入鲁棒个性化目标(RPO),在元优化中加强对难学习用户的关注。 Result: 在个性化偏好数据集上的大量实验表明,MRM提升了少样本个性化能力、用户鲁棒性,并持续优于基线方法。 Conclusion: MRM通过从‘拟合偏好数据’转向‘学习偏好适应过程’,实现了更高效、更鲁棒的个性化奖励建模,为LLM个性化对齐提供了新范式。 Abstract: Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.[107] Dep-Search: Learning Dependency-Aware Reasoning Traces with Persistent Memory
Yanming Liu,Xinyue Peng,Zixuan Yan,Yanxin Shen,Wenjie Xu,Yuefeng Huang,Xinyi Wang,Jiannan Cao,Jianwei Yin,Xuhong Zhang
Main category: cs.CL
TL;DR: 本文提出Dep-Search框架,通过引入依赖感知的结构化推理、显式检索控制与持久化记忆机制(基于GRPO),显著提升大语言模型在复杂多跳问答任务中的表现。
Details
Motivation: 现有基于搜索的推理框架过度依赖隐式自然语言推理来决定搜索策略和信息利用方式,导致子问题依赖管理困难、知识复用效率低、强化学习策略难以优化。 Method: 提出Dep-Search框架,整合结构化推理、显式检索控制与基于GRPO的持久化记忆机制,支持问题分解、按需检索、记忆访问与长上下文压缩存储。 Result: 在七个多样化问答数据集上实验表明,Dep-Search在不同规模大模型上均显著优于强基线方法。 Conclusion: Dep-Search通过显式建模依赖关系与结构化控制,有效克服了隐式推理带来的局限性,为复杂推理任务提供了更鲁棒、可学习、可复用的搜索框架。 Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks, particularly when augmented with search mechanisms that enable systematic exploration of external knowledge bases. The field has evolved from traditional retrieval-augmented generation (RAG) frameworks to more sophisticated search-based frameworks that orchestrate multi-step reasoning through explicit search strategies. However, existing search frameworks still rely heavily on implicit natural language reasoning to determine search strategies and how to leverage retrieved information across reasoning steps. This reliance on implicit reasoning creates fundamental challenges for managing dependencies between sub-questions, efficiently reusing previously retrieved knowledge, and learning optimal search strategies through reinforcement learning. To address these limitations, we propose Dep-Search, a dependency-aware search framework that advances beyond existing search frameworks by integrating structured reasoning, retrieval, and persistent memory through GRPO. Dep-Search introduces explicit control mechanisms that enable the model to decompose questions with dependency relationships, retrieve information when needed, access previously stored knowledge from memory, and summarize long reasoning contexts into reusable memory entries. Through extensive experiments on seven diverse question answering datasets, we demonstrate that Dep-Search significantly enhances LLMs' ability to tackle complex multi-hop reasoning tasks, achieving substantial improvements over strong baselines across different model scales.[108] Unsupervised Text Segmentation via Kernel Change-Point Detection on Sentence Embeddings
Mumin Jia,Jairo Diaz-Rodriguez
Main category: cs.CL
TL;DR: 本文提出了一种无需训练的无监督文本分割方法Embed-KCPD,利用句子嵌入向量和带惩罚的KCPD目标函数检测边界,并首次为m-依赖序列建立了KCPD的依赖感知理论,同时设计了基于大语言模型的仿真框架验证理论预测。
Details
Motivation: 无监督文本分割至关重要,因为边界标注成本高、主观性强,且难以跨领域和不同粒度迁移。 Method: 提出Embed-KCPD方法:将句子表示为嵌入向量,通过最小化带惩罚的KCPD(Kernel Change Point Detection)目标函数估计边界;建立适用于m-依赖序列的KCPD理论;设计基于LLM的仿真框架生成可控有限记忆依赖的合成文档。 Result: 理论方面证明了带惩罚的总体风险oracle不等式与定位保证(每个真实变点可在相对于段长较小的窗口内被恢复);实验表明Embed-KCPD在多个标准分割基准上常优于强无监督基线;在Taylor Swift推文案例中验证了其理论保证、仿真可靠性与实际有效性。 Conclusion: Embed-KCPD是一种兼具坚实理论基础、可验证可靠性与实用性能的无监督文本分割新方法,推动了KCPD在自然语言处理中的理论与应用发展。 Abstract: Unsupervised text segmentation is crucial because boundary labels are expensive, subjective, and often fail to transfer across domains and granularity choices. We propose Embed-KCPD, a training-free method that represents sentences as embedding vectors and estimates boundaries by minimizing a penalized KCPD objective. Beyond the algorithmic instantiation, we develop, to our knowledge, the first dependence-aware theory for KCPD under $m$-dependent sequences, a finite-memory abstraction of short-range dependence common in language. We prove an oracle inequality for the population penalized risk and a localization guarantee showing that each true change point is recovered within a window that is small relative to segment length. To connect theory to practice, we introduce an LLM-based simulation framework that generates synthetic documents with controlled finite-memory dependence and known boundaries, validating the predicted scaling behavior. Across standard segmentation benchmarks, Embed-KCPD often outperforms strong unsupervised baselines. A case study on Taylor Swift's tweets illustrates that Embed-KCPD combines strong theoretical guarantees, simulated reliability, and practical effectiveness for text segmentation.[109] MortalMATH: Evaluating the Conflict Between Reasoning Objectives and Emergency Contexts
Etienne Lanzeray,Stephane Meilliez,Malo Ruelle,Damien Sileo
Main category: cs.CL
TL;DR: 本文提出MortalMATH基准,揭示专注深度推理的大模型在用户遭遇生命危险时仍执着完成数学任务,忽视紧急求助,暴露出‘隧道视野’式安全风险。
Details
Motivation: 探究大语言模型过度优化深度推理能力是否导致忽视关键安全情境(如用户描述生命危险时仍只做数学题)的‘隧道视野’问题。 Method: 构建包含150个生命威胁场景(如中风、自由落体)的MortalMATH基准,测试通用模型(如Llama-3.1)与专用推理模型(如Qwen-3-32b、GPT-5-nano)在用户求助时的响应行为,并测量响应延迟。 Result: 通用模型多能拒绝解题并响应危机;专用推理模型95%以上仍坚持完成数学任务,且推理耗时高达15秒,严重延误救助。 Conclusion: 过度强调任务正确性可能削弱模型的基础安全本能,需在推理训练中显式融入安全优先机制。 Abstract: Large Language Models are increasingly optimized for deep reasoning, prioritizing the correct execution of complex tasks over general conversation. We investigate whether this focus on calculation creates a "tunnel vision" that ignores safety in critical situations. We introduce MortalMATH, a benchmark of 150 scenarios where users request algebra help while describing increasingly life-threatening emergencies (e.g., stroke symptoms, freefall). We find a sharp behavioral split: generalist models (like Llama-3.1) successfully refuse the math to address the danger. In contrast, specialized reasoning models (like Qwen-3-32b and GPT-5-nano) often ignore the emergency entirely, maintaining over 95 percent task completion rates while the user describes dying. Furthermore, the computational time required for reasoning introduces dangerous delays: up to 15 seconds before any potential help is offered. These results suggest that training models to relentlessly pursue correct answers may inadvertently unlearn the survival instincts required for safe deployment.[110] Subword-Based Comparative Linguistics across 242 Languages Using Wikipedia Glottosets
Iaroslav Chelombitko,Mika Hämäläinen,Aleksey Komissarov
Main category: cs.CL
TL;DR: 本文通过构建基于维基百科词典的'glottosets',利用字节对编码(BPE)对242种拉丁和西里尔字母语言进行大规模跨语言比较,发现BPE分词在形态边界识别上显著优于随机基线,并且其词汇相似性与语言遗传相关性高度相关。
Details
Motivation: 旨在建立一个统一的分析框架,以量化方式研究大量使用拉丁和西里尔字母的语言之间的词汇重叠、词汇分化和语言相似性。 Method: 基于维基百科词典构建'glottosets',采用字节对编码(BPE)进行子词分割,并利用基于排名的子词向量分析词汇重叠、词汇分化及语言相似性。 Result: BPE分词在形态边界识别上比随机基线好95%(F1=0.34 vs 0.15);BPE词汇相似性与语言遗传相关性显著相关(Mantel r = 0.329, p < 0.001);罗曼语族聚类最紧密(平均距离0.51),跨语系对明显分离(0.82);48.7%的跨语言同形词在相关语言中被不同分割,且变异程度与系统发育距离相关。 Conclusion: 该研究为类型多样的语言提供了宏观语言学层面的词汇模式定量洞察,并验证了BPE在跨语言分析中的有效性。 Abstract: We present a large-scale comparative study of 242 Latin and Cyrillic-script languages using subword-based methodologies. By constructing 'glottosets' from Wikipedia lexicons, we introduce a framework for simultaneous cross-linguistic comparison via Byte-Pair Encoding (BPE). Our approach utilizes rank-based subword vectors to analyze vocabulary overlap, lexical divergence, and language similarity at scale. Evaluations demonstrate that BPE segmentation aligns with morpheme boundaries 95% better than random baseline across 15 languages (F1 = 0.34 vs 0.15). BPE vocabulary similarity correlates significantly with genetic language relatedness (Mantel r = 0.329, p < 0.001), with Romance languages forming the tightest cluster (mean distance 0.51) and cross-family pairs showing clear separation (0.82). Analysis of 26,939 cross-linguistic homographs reveals that 48.7% receive different segmentations across related languages, with variation correlating to phylogenetic distance. Our results provide quantitative macro-linguistic insights into lexical patterns across typologically diverse languages within a unified analytical framework.[111] ctELM: Decoding and Manipulating Embeddings of Clinical Trials with Embedding Language Models
Brian Ondov,Chia-Hsuan Chang,Yujia Zhou,Mauro Giuffrè,Hua Xu
Main category: cs.CL
TL;DR: 本文提出了一种将大语言模型(LLM)对齐到临床试验文本嵌入空间的方法(ELM),开发了开源、领域无关的架构与训练框架,并构建专家验证的合成数据集;最终模型ctELM能从嵌入反推可解释描述、生成合理新临床试验文本,且生成内容可受年龄/性别等概念向量调控。
Details
Motivation: 现有文本嵌入方法缺乏可解释性、可探索性和可逆性,限制了透明度和生成式应用潜力。 Method: 基于Embedding Language Model(ELM)方法,设计领域无关的ELM架构与训练框架,构建专家验证的临床试验合成数据集,并系统探索不同训练任务与策略的影响。 Result: ctELM模型能准确从嵌入重建临床试验描述、比较未见试验,并生成合理新试验文本;生成的摘要可响应年龄、性别等概念向量的嵌入空间移动。 Conclusion: 该工作推动了大语言模型与嵌入空间的对齐,为生物医学及其他领域提供了可解释、可控生成的新范式。 Abstract: Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we align Large Language Models to embeddings of clinical trials using the recently reported Embedding Language Model (ELM) method. We develop an open-source, domain-agnostic ELM architecture and training framework, design training tasks for clinical trials, and introduce an expert-validated synthetic dataset. We then train a series of ELMs exploring the impact of tasks and training regimes. Our final model, ctELM, can accurately describe and compare unseen clinical trials from embeddings alone and produce plausible clinical trials from novel vectors. We further show that generated trial abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.cs.CV [Back]
[112] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility
Honglin Lin,Chonghan Qin,Zheng Liu,Qizhi Pei,Yu Li,Zhanping Zhong,Xin Gao,Yanfeng Wang,Conghui He,Lijun Wu
Main category: cs.CV
TL;DR: 本文提出ImgCoder框架和SciGenBench评测基准,系统研究科学图像合成问题,揭示像素生成模型的系统性缺陷,并验证高保真科学图像合成可显著提升多模态大模型的科学推理能力。
Details
Motivation: 现有文生图模型虽视觉合理但科学错误,导致视觉-逻辑割裂,限制其在科学推理中的下游应用。 Method: 提出逻辑驱动的ImgCoder框架(理解-规划-编码流程),对比分析像素生成与程序化合成范式,并构建SciGenBench评测基准评估图像的信息效用与逻辑有效性。 Result: 发现像素模型存在系统性失败模式,揭示表达力与精度间的权衡;经严格验证的合成图像微调LMM可稳定提升科学推理性能,呈现类似文本领域的可扩展趋势。 Conclusion: 高保真科学图像合成是解锁大规模多模态科学推理能力的有效路径。 Abstract: While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit "understand - plan - code" workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.[113] Data-Efficient Meningioma Segmentation via Implicit Spatiotemporal Mixing and Sim2Real Semantic Injection
Yunhao Xu,Fuquan Zong,Yexuan Xing,Chulong Zhang,Guang Yang,Shilong Yang,Xiaokun Liang,Juan Yu
Main category: cs.CV
TL;DR: 本文提出了一种双增强框架,结合空间流形扩展与语义病灶注入,利用隐式神经表示建模连续形变场,并通过形变空间线性插值生成解剖学合理的多样性样本;同时引入Sim2Real病灶注入模块,在健康背景中移植病灶纹理,提升小样本下医学图像分割性能。
Details
Motivation: 医学图像分割性能日益取决于数据利用效率而非原始数据量;针对脑膜瘤等复杂病理,需在有限高质量标注下充分挖掘潜在信息。 Method: 提出双增强框架:1)基于隐式神经表示(INR)建模连续速度场,对积分形变场进行线性混合以实现形变空间插值;2)设计Sim2Real病灶注入模块,在健康解剖背景中高保真移植病灶纹理。 Result: 在混合数据集上的实验表明,该方法显著提升了nnU-Net和U-Mamba等SOTA模型的数据效率与鲁棒性。 Conclusion: 所提框架为标注资源受限下的高性能医学图像分析提供了有效新策略。 Abstract: The performance of medical image segmentation is increasingly defined by the efficiency of data utilization rather than merely the volume of raw data. Accurate segmentation, particularly for complex pathologies like meningiomas, demands that models fully exploit the latent information within limited high-quality annotations. To maximize the value of existing datasets, we propose a novel dual-augmentation framework that synergistically integrates spatial manifold expansion and semantic object injection. Specifically, we leverage Implicit Neural Representations (INR) to model continuous velocity fields. Unlike previous methods, we perform linear mixing on the integrated deformation fields, enabling the efficient generation of anatomically plausible variations by interpolating within the deformation space. This approach allows for the extensive exploration of structural diversity from a small set of anchors. Furthermore, we introduce a Sim2Real lesion injection module. This module constructs a high-fidelity simulation domain by transplanting lesion textures into healthy anatomical backgrounds, effectively bridging the gap between synthetic augmentation and real-world pathology. Comprehensive experiments on a hybrid dataset demonstrate that our framework significantly enhances the data efficiency and robustness of state-of-the-art models, including nnU-Net and U-Mamba, offering a potent strategy for high-performance medical image analysis with limited annotation budgets.[114] Diagnosis Support of Sickle Cell Anemia by Classifying Red Blood Cell Shape in Peripheral Blood Images
Wilkie Delgado-Font,Miriela Escobedo-Nicot,Manuel González-Hidalgo,Silena Herold-Garcia,Antoni Jaume-i-Capó,Arnau Mir
Main category: cs.CV
TL;DR: 本文提出了一种基于外周血涂片图像分析的自动化红细胞(RBC)分类方法,利用Chan-Vese活动轮廓模型进行分割,并结合圆形/椭圆形形状因子(CSF/ESF)对正常、伸长及其他变形RBC进行区分,尤其通过椭圆校正处理聚簇中部分遮挡的细胞;在古巴圣地亚哥某医院真实临床数据上验证,F值达0.97(正常)和0.95(伸长),优于现有方法,适用于镰状细胞贫血的临床辅助诊断。
Details
Motivation: 传统显微镜下人工观察外周血涂片费时、依赖专家且主观性强、误差高,亟需自动化、客观、可靠的RBC形变分析方法以支持镰状细胞贫血等疾病的诊断。 Method: 采用Chan-Vese主动轮廓模型分割图像中的目标细胞;提取圆形形状因子(CSF)和椭圆形形状因子(ESF)作为基本形态描述符进行分类;针对制片中聚簇导致的细胞部分遮挡问题,引入椭圆调整策略以准确分析盘状与伸长型红细胞。 Result: 在真实临床血样图像上取得优异性能:正常RBC F-measure为0.97,伸长RBC为0.95;多项多类别综合指标优于当前先进方法;结果满足临床诊疗与诊断支持需求。 Conclusion: 所提方法能高效、准确、鲁棒地实现红细胞形态自动分类,尤其适用于镰状细胞贫血的辅助诊断,具备临床实用价值。 Abstract: Red blood cell (RBC) deformation is the consequence of several diseases, including sickle cell anemia, which causes recurring episodes of pain and severe pronounced anemia. Monitoring patients with these diseases involves the observation of peripheral blood samples under a microscope, a time-consuming procedure. Moreover, a specialist is required to perform this technique, and owing to the subjective nature of the observation of isolated RBCs, the error rate is high. In this paper, we propose an automated method for differentially enumerating RBCs that uses peripheral blood smear image analysis. In this method, the objects of interest in the image are segmented using a Chan-Vese active contour model. An analysis is then performed to classify the RBCs, also called erythrocytes, as normal or elongated or having other deformations, using the basic shape analysis descriptors: circular shape factor (CSF) and elliptical shape factor (ESF). To analyze cells that become partially occluded in a cluster during sample preparation, an elliptical adjustment is performed to allow the analysis of erythrocytes with discoidal and elongated shapes. The images of patient blood samples used in the study were acquired by a clinical laboratory specialist in the Special Hematology Department of the ``Dr. Juan Bruno Zayas'' General Hospital in Santiago de Cuba. A comparison of the results obtained by the proposed method in our experiments with those obtained by some state-of-the-art methods showed that the proposed method is superior for the diagnosis of sickle cell anemia. This superiority is achieved for evidenced by the obtained F-measure value (0.97 for normal cells and 0.95 for elongated ones) and several overall multiclass performance measures. The results achieved by the proposed method are suitable for the purpose of clinical treatment and diagnostic support of sickle cell anemia.[115] AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs
Aahana Basappa,Pranay Goel,Anusri Karra,Anish Karra,Asa Gilmore,Kevin Zhu
Main category: cs.CV
TL;DR: 本文提出了AMVICC基准,系统评估多模态大语言模型(MLLMs)和图像生成模型(IGMs)在视觉推理任务中的失败模式,揭示了跨模态(图像到文本、文本到图像)共性与特异性缺陷,尤其指出IGMs在显式提示下对细粒度视觉属性控制能力薄弱。
Details
Motivation: 现有视觉语言模型在基本视觉概念(如物体朝向、数量、空间关系)理解与生成上仍存在明显缺陷,缺乏系统性跨模态失败分析框架。 Method: 构建新型跨模态基准AMVICC,改编MMVP问题为显式/隐式提示,对11个MLLMs和3个IGMs在9类视觉推理任务中进行测试与失败模式分析。 Result: 发现MLLMs与IGMs存在共享及各自特有的失败模式;IGMs在显式提示下难以精确操控特定视觉组件,表明其对细粒度视觉属性控制能力差。 Conclusion: 视觉理解与生成的失败可能源于共同底层局限,AMVICC为未来统一视觉语言建模与跨模态对齐研究提供了可扩展评估框架。 Abstract: We investigated visual reasoning limitations of both multimodal large language models (MLLMs) and image generation models (IGMs) by creating a novel benchmark to systematically compare failure modes across image-to-text and text-to-image tasks, enabling cross-modal evaluation of visual understanding. Despite rapid growth in machine learning, vision language models (VLMs) still fail to understand or generate basic visual concepts such as object orientation, quantity, or spatial relationships, which highlighted gaps in elementary visual reasoning. By adapting MMVP benchmark questions into explicit and implicit prompts, we create \textit{AMVICC}, a novel benchmark for profiling failure modes across various modalities. After testing 11 MLLMs and 3 IGMs in nine categories of visual reasoning, our results show that failure modes are often shared between models and modalities, but certain failures are model-specific and modality-specific, and this can potentially be attributed to various factors. IGMs consistently struggled to manipulate specific visual components in response to prompts, especially in explicit prompts, suggesting poor control over fine-grained visual attributes. Our findings apply most directly to the evaluation of existing state-of-the-art models on structured visual reasoning tasks. This work lays the foundation for future cross-modal alignment studies, offering a framework to probe whether generation and interpretation failures stem from shared limitations to guide future improvements in unified vision-language modeling.[116] Hybrid Deep Feature Extraction and ML for Construction and Demolition Debris Classification
Obai Alashram,Nejad Alagha,Mahmoud AlKakuri,Zeeshan Swaveel,Abigail Copiaco
Main category: cs.CV
TL;DR: 本文提出了一种结合Xception特征提取与经典机器学习分类器的混合视觉方法,用于建筑垃圾自动分类,在自建的1800张四类真实工地图像数据集上达到99.5%准确率,优于端到端深度学习方法。
Details
Motivation: 建筑行业产生大量废弃物,亟需高效、可持续的分类技术以支持资源回收和现场自动化。 Method: 构建包含陶瓷/瓷砖、混凝土、垃圾/废料、木材四类共1800张平衡高质量图像的真实工地数据集;采用预训练Xception网络提取深层特征;系统评估SVM、kNN、Bagged Trees、LDA和逻辑回归等多种经典机器学习分类器。 Result: 基于Xception特征的线性SVM、kNN和Bagged Trees等简单分类器实现最高99.5%准确率和宏F1分数,性能超越更复杂的端到端深度学习模型。 Conclusion: 该混合视觉方案兼具高精度与工程实用性,适合现场部署,并为后续与机器人及自动化系统集成提供了可行路径。 Abstract: The construction industry produces significant volumes of debris, making effective sorting and classification critical for sustainable waste management and resource recovery. This study presents a hybrid vision-based pipeline that integrates deep feature extraction with classical machine learning (ML) classifiers for automated construction and demolition (C\&D) debris classification. A novel dataset comprising 1,800 balanced, high-quality images representing four material categories, Ceramic/Tile, Concrete, Trash/Waste, and Wood was collected from real construction sites in the UAE, capturing diverse real-world conditions. Deep features were extracted using a pre-trained Xception network, and multiple ML classifiers, including SVM, kNN, Bagged Trees, LDA, and Logistic Regression, were systematically evaluated. The results demonstrate that hybrid pipelines using Xception features with simple classifiers such as Linear SVM, kNN, and Bagged Trees achieve state-of-the-art performance, with up to 99.5\% accuracy and macro-F1 scores, surpassing more complex or end-to-end deep learning approaches. The analysis highlights the operational benefits of this approach for robust, field-deployable debris identification and provides pathways for future integration with robotics and onsite automation systems.[117] MANGO: A Global Single-Date Paired Dataset for Mangrove Segmentation
Junhyuk Heo,Beomkyu Choi,Hyunjin Shin,Darongsae Kwon
Main category: cs.CV
TL;DR: 本文介绍了MANGO,一个大规模全球红树林监测数据集,包含124个国家的42,703对标注图像-掩膜样本,并提出了一种基于目标检测驱动的单日期影像筛选方法,以支持可靠的全球红树林语义分割建模。
Details
Motivation: 现有红树林监测数据集存在缺乏单日期图像-掩膜配对、地域覆盖有限、数据不公开等问题,限制了深度学习在该领域的应用。 Method: 构建MANGO数据集:基于2020年Sentinel-2影像,在全球红树林区域选取最佳单日期影像,通过目标检测驱动的方法,利用像素级坐标参考实现图像与年度红树林掩膜的自适应配对;并设计国家互斥划分的语义分割基准测试。 Result: 发布MANGO数据集(42,703对图像-掩膜,覆盖124国);在多种语义分割模型上完成基准测试,验证其有效性与泛化性。 Conclusion: MANGO为全球红树林遥感监测提供了首个大规模、开放、高质量的单日期图像-掩膜数据集,推动了深度学习在气候减缓关键生态系统监测中的可靠应用。 Abstract: Mangroves are critical for climate-change mitigation, requiring reliable monitoring for effective conservation. While deep learning has emerged as a powerful tool for mangrove detection, its progress is hindered by the limitations of existing datasets. In particular, many resources provide only annual map products without curated single-date image-mask pairs, limited to specific regions rather than global coverage, or remain inaccessible to the public. To address these challenges, we introduce MANGO, a large-scale global dataset comprising 42,703 labeled image-mask pairs across 124 countries. To construct this dataset, we retrieve all available Sentinel-2 imagery within the year 2020 for mangrove regions and select the best single-date observations that align with the mangrove annual mask. This selection is performed using a target detection-driven approach that leverages pixel-wise coordinate references to ensure adaptive and representative image-mask pairings. We also provide a benchmark across diverse semantic segmentation architectures under a country-disjoint split, establishing a foundation for scalable and reliable global mangrove monitoring.[118] FP-THD: Full page transcription of historical documents
H Neji,J Nogueras-Iso,J Lacasta,MÁ Latre,FJ García-Marco
Main category: cs.CV
TL;DR: 本文提出了一种用于15-16世纪拉丁文历史文献转录的端到端流程,结合布局分析与OCR模型,在保留特殊字符和符号的前提下实现高效、准确的数字化。
Details
Motivation: 15-16世纪拉丁文历史文献包含具有特定历史含义的特殊字符和符号,传统OCR难以保真转录,需专门方法兼顾准确性与风格保留。 Method: 提出融合布局分析模型与文本行识别OCR的两阶段流水线:先用布局分析模型定位并提取文本行,再送入OCR模型进行识别;特别采用掩码自编码器(MAE)提升对多种文本(手写/印刷/多语)的鲁棒性。 Result: 在多个数据集上验证了该流程的有效性,证明其能高效处理整页文档,并显著提升对历史文献中特殊字符、混合字体及多语言文本的识别性能。 Conclusion: 所提流水线在保持历史文本原始风格与语义完整性方面优于现有方法,为古籍数字化提供了可扩展、高保真的实用解决方案。 Abstract: The transcription of historical documents written in Latin in XV and XVI centuries has special challenges as it must maintain the characters and special symbols that have distinct meanings to ensure that historical texts retain their original style and significance. This work proposes a pipeline for the transcription of historical documents preserving these special features. We propose to extend an existing text line recognition method with a layout analysis model. We analyze historical text images using a layout analysis model to extract text lines, which are then processed by an OCR model to generate a fully digitized page. We showed that our pipeline facilitates the processing of the page and produces an efficient result. We evaluated our approach on multiple datasets and demonstrate that the masked autoencoder effectively processes different types of text, including handwritten, printed and multi-language.[119] Arabic Sign Language Recognition using Multimodal Approach
Ghadeer Alanazi,Abir Benabid
Main category: cs.CV
TL;DR: 本文提出了一种结合Leap Motion与RGB相机的多模态阿拉伯手语(ArSL)识别方法,通过双分支网络融合时空手势特征,在自建18词数据集上达到78%准确率。
Details
Motivation: 现有ArSL识别系统依赖单一传感器(如Leap Motion或RGB相机),难以准确捕捉复杂手部朝向和三维运动,导致识别性能受限。 Method: 构建双并行子网络:Leap Motion分支采用带Dropout和L2正则的自定义稠密神经网络;RGB分支基于数据增强的微调VGG16;两路特征拼接后经全连接层与SoftMax分类。 Result: 在包含18个ArSL词汇的自建数据集上,正确识别13词,总体准确率为78%。 Conclusion: 多模态融合方案初步验证了其在ArSL识别中的可行性,但需进一步优化模型结构与扩充数据集以提升性能。 Abstract: Arabic Sign Language (ArSL) is an essential communication method for individuals in the Deaf and Hard-of-Hearing community. However, existing recognition systems face significant challenges due to their reliance on single sensor approaches like Leap Motion or RGB cameras. These systems struggle with limitations such as inadequate tracking of complex hand orientations and imprecise recognition of 3D hand movements. This research paper aims to investigate the potential of a multimodal approach that combines Leap Motion and RGB camera data to explore the feasibility of recognition of ArSL. The system architecture includes two parallel subnetworks: a custom dense neural network for Leap Motion data, incorporating dropout and L2 regularization, and an image subnetwork based on a fine-tuned VGG16 model enhanced with data augmentation techniques. Feature representations from both modalities are concatenated in a fusion model and passed through fully connected layers, with final classification performed via SoftMax activation to analyze spatial and temporal features of hand gestures. The system was evaluated on a custom dataset comprising 18 ArSL words, of which 13 were correctly recognized, yielding an overall accuracy of 78%. These results offer preliminary insights into the viability of multimodal fusion for sign language recognition and highlight areas for further optimization and dataset expansion.[120] Interpretable and Sparse Linear Attention with Decoupled Membership-Subspace Modeling via MCR2 Objective
Tianyuan Liu,Libin Hou,Linyuan Wang,Bin Yan
Main category: cs.CV
TL;DR: 本文提出了一种解耦成员矩阵与子空间矩阵的MCR2优化方法,导出可解释的稀疏线性注意力算子DMSA,并在ToST中替换原有注意力模块形成DMST,在ImageNet-1K上提升精度并加快编码压缩率。
Details
Motivation: 现有MCR2驱动的白盒Transformer中,成员矩阵与子空间矩阵U紧密耦合,导致错误token投影下出现冗余编码,需解耦以提升表示效率与可解释性。 Method: 解耦MCR2目标函数中成员矩阵与子空间U的功能关系,通过优化目标的梯度展开推导出可解释的稀疏线性注意力算子DMSA:直接从输入学习成员矩阵,并由全空间S导出稀疏子空间。 Result: 将DMSA嵌入ToST形成DMST,在ImageNet-1K上top-1准确率较ToST提升1.08%-1.45%,编码压缩率更快,且相比标准Transformer具有更高计算效率与可解释性。 Conclusion: 解耦设计有效缓解了MCR2中冗余编码问题,DMSA为视觉建模提供了兼具高效性与可解释性的新注意力范式。 Abstract: Maximal Coding Rate Reduction (MCR2)-driven white-box transformer, grounded in structured representation learning, unifies interpretability and efficiency, providing a reliable white-box solution for visual modeling. However, in existing designs, tight coupling between "membership matrix" and "subspace matrix U" in MCR2 causes redundant coding under incorrect token projection. To this end, we decouple the functional relationship between the "membership matrix" and "subspaces U" in the MCR2 objective and derive an interpretable sparse linear attention operator from unrolled gradient descent of the optimized objective. Specifically, we propose to directly learn the membership matrix from inputs and subsequently derive sparse subspaces from the fullspace S. Consequently, gradient unrolling of the optimized MCR2 objective yields an interpretable sparse linear attention operator: Decoupled Membership-Subspace Attention (DMSA). Experimental results on visual tasks show that simply replacing the attention module in Token Statistics Transformer (ToST) with DMSA (we refer to as DMST) not only achieves a faster coding reduction rate but also outperforms ToST by 1.08%-1.45% in top-1 accuracy on the ImageNet-1K dataset. Compared with vanilla Transformer architectures, DMST exhibits significantly higher computational efficiency and interpretability.[121] Atomic Depth Estimation From Noisy Electron Microscopy Data Via Deep Learning
Matan Leibovich,Mai Tan,Adria Marcos-Morales,Sreyas Mohan,Peter A. Crozier,Carlos Fernandez-Granda
Main category: cs.CV
TL;DR: 本文提出了一种基于语义分割的深度估计方法,利用带噪声的模拟数据训练深度卷积神经网络,以从透射电子显微镜(TEM)图像中提取原子级三维信息,并在模拟和真实CeO2纳米颗粒数据上验证了其准确性、校准性和抗噪鲁棒性。
Details
Motivation: 从受显著噪声影响的透射电子显微镜(TEM)图像中提取3D原子级信息具有挑战性,现有方法难以兼顾精度与鲁棒性。 Method: 将深度估计建模为语义分割问题,使用含合成噪声的模拟数据训练深度卷积神经网络,生成像素级深度分割图。 Result: 该方法在CeO2纳米颗粒的模拟图像和真实TEM数据上均实现了准确、可校准且抗噪鲁棒的原子列深度估计。 Conclusion: 所提方法为噪声环境下TEM图像的3D原子结构解析提供了一种有效、可靠的新范式。 Abstract: We present a novel approach for extracting 3D atomic-level information from transmission electron microscopy (TEM) images affected by significant noise. The approach is based on formulating depth estimation as a semantic segmentation problem. We address the resulting segmentation problem by training a deep convolutional neural network to generate pixel-wise depth segmentation maps using simulated data corrupted by synthetic noise. The proposed method was applied to estimate the depth of atomic columns in CeO2 nanoparticles from simulated images and real-world TEM data. Our experiments show that the resulting depth estimates are accurate, calibrated and robust to noise.[122] A Contrastive Pre-trained Foundation Model for Deciphering Imaging Noisomics across Modalities
Yuanjie Gu,Yiqun Wang,Chaohui Yu,Ang Xuan,Fan Wang,Zhi Lu,Biqin Dong
Main category: cs.CV
TL;DR: 本文提出“Noisomics”框架,利用对比预训练(CoP)基础模型,将成像噪声从干扰源重新定义为可解码的信息资源,在仅用100个样本的情况下显著超越传统监督方法,实现零样本跨域泛化与高精度噪声解析。
Details
Motivation: 现有成像噪声建模方法严重依赖大规模标注数据且设备特异性强,忽视噪声本身蕴含的硬件与信号交互信息;亟需一种数据高效、泛化性强、能将噪声视为信息源的新范式。 Method: 提出“Noisomics”框架,核心为Contrastive Pre-trained(CoP)基础模型;基于流形假设与合成“噪声基因组”,采用对比学习解耦语义信号与随机扰动,突破深度学习数据规模依赖。 Result: CoP仅用100个训练样本即超越使用10万样本的监督基线;在12个跨域数据集上实现零样本泛化:估计误差降低63.8%,决定系数提升85.1%;成功应用于消费摄影噪声解析与深组织显微成像光子优化。 Conclusion: 噪声可被系统性解码为多参数指纹,是无需设备先验校准的宝贵信息资源;Noisomics重塑成像诊断范式,实现低数据、低算力、高鲁棒性的精准成像分析。 Abstract: Characterizing imaging noise is notoriously data-intensive and device-dependent, as modern sensors entangle physical signals with complex algorithmic artifacts. Current paradigms struggle to disentangle these factors without massive supervised datasets, often reducing noise to mere interference rather than an information resource. Here, we introduce "Noisomics", a framework shifting the focus from suppression to systematic noise decoding via the Contrastive Pre-trained (CoP) Foundation Model. By leveraging the manifold hypothesis and synthetic noise genome, CoP employs contrastive learning to disentangle semantic signals from stochastic perturbations. Crucially, CoP breaks traditional deep learning scaling laws, achieving superior performance with only 100 training samples, outperforming supervised baselines trained on 100,000 samples, thereby reducing data and computational dependency by three orders of magnitude. Extensive benchmarking across 12 diverse out-of-domain datasets confirms its robust zero-shot generalization, demonstrating a 63.8% reduction in estimation error and an 85.1% improvement in the coefficient of determination compared to the conventional training strategy. We demonstrate CoP's utility across scales: from deciphering non-linear hardware-noise interplay in consumer photography to optimizing photon-efficient protocols for deep-tissue microscopy. By decoding noise as a multi-parametric footprint, our work redefines stochastic degradation as a vital information resource, empowering precise imaging diagnostics without prior device calibration.[123] SiMiC: Context-Aware Silicon Microstructure Characterization Using Attention-Based Convolutional Neural Networks for Field-Emission Tip Analysis
Jing Jie Tan,Rupert Schreiner,Matthias Hausladen,Ali Asgharzade,Simon Edler,Julian Bartsch,Michael Bachmann,Andreas Schels,Ban-Hoe Kwan,Danny Wee-Kiat Ng,Yan-Chai Hum
Main category: cs.CV
TL;DR: 本文提出SiMiC方法,利用带注意力机制的卷积神经网络自动、准确地从SEM图像中提取硅微结构(如场发射尖端)的形态特征,提升分析效率与一致性,并建立几何特征与场发射性能间的关联。
Details
Motivation: 传统SEM图像的手动分析耗时、低效且可重复性差,难以满足微纳制造中对硅微结构快速、精准表征的需求。 Method: 构建专用硅基场发射尖端SEM图像数据集,设计融合注意力机制的定制化CNN模型,实现多类别微结构分类与尺寸预测。 Result: SiMiC在精度上优于经典图像处理方法,同时保持良好可解释性;成功将微结构几何参数与场发射性能相关联。 Conclusion: SiMiC为硅微结构的数据驱动分析提供了新范式,支撑冷阴极与SEM电子源的优化设计,所开源数据集与算法可作为该领域基准。 Abstract: Accurate characterization of silicon microstructures is essential for advancing microscale fabrication, quality control, and device performance. Traditional analysis using Scanning Electron Microscopy (SEM) often requires labor-intensive, manual evaluation of feature geometry, limiting throughput and reproducibility. In this study, we propose SiMiC: Context-Aware Silicon Microstructure Characterization Using Attention-Based Convolutional Neural Networks for Field-Emission Tip Analysis. By leveraging deep learning, our approach efficiently extracts morphological features-such as size, shape, and apex curvature-from SEM images, significantly reducing human intervention while improving measurement consistency. A specialized dataset of silicon-based field-emitter tips was developed, and a customized CNN architecture incorporating attention mechanisms was trained for multi-class microstructure classification and dimensional prediction. Comparative analysis with classical image processing techniques demonstrates that SiMiC achieves high accuracy while maintaining interpretability. The proposed framework establishes a foundation for data-driven microstructure analysis directly linked to field-emission performance, opening avenues for correlating emitter geometry with emission behavior and guiding the design of optimized cold-cathode and SEM electron sources. The related dataset and algorithm repository that could serve as a baseline in this area can be found at https://research.jingjietan.com/?q=SIMIC[124] Summary of the Unusual Activity Recognition Challenge for Developmental Disability Support
Christina Garcia,Nhat Tan Le,Taihei Fujioka,Umang Dobhal,Milyun Ni'ma Shoumi,Thanh Nha Nguyen,Sozo Inoue
Main category: cs.CV
TL;DR: 本文概述了ISAS 2025举办的‘识别未见行为:基于姿态数据的异常行为识别挑战赛’,聚焦于利用非侵入式姿态估计数据自动识别发育障碍人士设施中的异常行为。
Details
Motivation: 解决在发育障碍人士照护设施中,利用非侵入式姿态数据自动识别异常行为的迫切需求。 Method: 基于模拟场景视频提取骨架关键点,采用Leave-One-Subject-Out(LOSO)策略评估,使用宏平均F1分数衡量模型性能。 Result: 40支队伍参与,结果表明在噪声大、维度低的数据中建模罕见突发行为极具挑战,需兼顾时序与上下文特征。 Conclusion: 该挑战揭示了异常行为识别的关键难点,为面向医疗与行为监测的社会责任AI发展提供重要启示。 Abstract: This paper presents an overview of the Recognize the Unseen: Unusual Behavior Recognition from Pose Data Challenge, hosted at ISAS 2025. The challenge aims to address the critical need for automated recognition of unusual behaviors in facilities for individuals with developmental disabilities using non-invasive pose estimation data. Participating teams were tasked with distinguishing between normal and unusual activities based on skeleton keypoints extracted from video recordings of simulated scenarios. The dataset reflects real-world imbalance and temporal irregularities in behavior, and the evaluation adopted a Leave-One-Subject-Out (LOSO) strategy to ensure subject-agnostic generalization. The challenge attracted broad participation from 40 teams applying diverse approaches ranging from classical machine learning to deep learning architectures. Submissions were assessed primarily using macro-averaged F1 scores to account for class imbalance. The results highlight the difficulty of modeling rare, abrupt actions in noisy, low-dimensional data, and emphasize the importance of capturing both temporal and contextual nuances in behavior modeling. Insights from this challenge may contribute to future developments in socially responsible AI applications for healthcare and behavior monitoring.[125] Single-Pixel Vision-Language Model for Intrinsic Privacy-Preserving Behavioral Intelligence
Hongjun An,Yiliang Song,Jiawei Shao,Zhe Sun,Xuelong Li
Main category: cs.CV
TL;DR: 本文提出了一种单像素视觉-语言模型(SP-VLM),通过低维单像素传感实现隐私优先的安全监控,在保障个人身份不可识别的同时,仍能有效检测异常行为、统计人数和理解活动。
Details
Motivation: 在厕所、更衣室等隐私敏感场所,传统监控因隐私法规和伦理限制难以应用,而不良社交行为(如欺凌、骚扰)又亟需监测与干预。 Method: 提出单像素视觉-语言模型(SP-VLM),利用单像素传感固有低维特性实现隐私内生保护,并结合视觉-语言融合推理行为语义;理论与实验分析单像素采样率对身份可恢复性与行为可理解性的权衡。 Result: 证实单像素传感在临界采样率以下可使现有的人脸识别系统失效,同时仍支持高鲁棒性的异常检测、人数统计与活动理解;确定了一个实用采样率区间,使行为智能涌现而身份信息强受保护。 Conclusion: SP-VLM为隐私敏感空间提供了符合人权理念的安全监控新范式,兼顾公共安全与个体隐私,避免侵入式监控的常态化。 Abstract: Adverse social interactions, such as bullying, harassment, and other illicit activities, pose significant threats to individual well-being and public safety, leaving profound impacts on physical and mental health. However, these critical events frequently occur in privacy-sensitive environments like restrooms, and changing rooms, where conventional surveillance is prohibited or severely restricted by stringent privacy regulations and ethical concerns. Here, we propose the Single-Pixel Vision-Language Model (SP-VLM), a novel framework that reimagines secure environmental monitoring. It achieves intrinsic privacy-by-design by capturing human dynamics through inherently low-dimensional single-pixel modalities and inferring complex behavioral patterns via seamless vision-language integration. Building on this framework, we demonstrate that single-pixel sensing intrinsically suppresses identity recoverability, rendering state-of-the-art face recognition systems ineffective below a critical sampling rate. We further show that SP-VLM can nonetheless extract meaningful behavioral semantics, enabling robust anomaly detection, people counting, and activity understanding from severely degraded single-pixel observations. Combining these findings, we identify a practical sampling-rate regime in which behavioral intelligence emerges while personal identity remains strongly protected. Together, these results point to a human-rights-aligned pathway for safety monitoring that can support timely intervention without normalizing intrusive surveillance in privacy-sensitive spaces.[126] Synthetic Data Guided Feature Selection for Robust Activity Recognition in Older Adults
Shuhao Que,Dieuwke van Dartel,Ilse Heeringa,Han Hegeman,Miriam Vollenbroek-Hutten,Ying Wang
Main category: cs.CV
TL;DR: 本研究开发了一种面向高龄老年人(>80岁)髋部骨折康复场景的鲁棒人体活动识别(HAR)系统,通过下背部与大腿前侧双加速度计采集数据,并引入合成数据增强与特征干预模型(FIM),显著提升了包括临床关键但易被忽略的姿势转移在内的各类活动识别准确率。
Details
Motivation: 现有基于可穿戴设备的连续活动监测系统多在中年人群中开发,在高龄老人(步态更慢、变异性更大)中性能不可靠;而髋部骨折康复期间的体力活动量化对防止功能衰退至关重要却常被临床忽视。 Method: 招募24名健康高龄老人(>80岁),在模拟自由生活条件下完成步行、站立、坐、卧及姿势转移等ADL活动75分钟;佩戴双位置加速度计(下背部+大腿前侧);采用留一受试者交叉验证评估模型鲁棒性;引入合成数据辅助训练,并构建特征干预模型(FIM)。 Result: FIM模型在各活动类别的平均F1分数达:步行0.896、站立0.927、坐0.997、卧0.937、姿势转移0.816;相比未用合成数据的对照模型,显著提升了姿势转移识别性能。 Conclusion: 该初步结果证实了在高龄人群中实现鲁棒活动识别的可行性;但需进一步在真实髋部骨折患者中验证其临床实用性。 Abstract: Physical activity during hip fracture rehabilitation is essential for mitigating long-term functional decline in geriatric patients. However, it is rarely quantified in clinical practice. Existing continuous monitoring systems with commercially available wearable activity trackers are typically developed in middle-aged adults and therefore perform unreliably in older adults with slower and more variable gait patterns. This study aimed to develop a robust human activity recognition (HAR) system to improve continuous physical activity recognition in the context of hip fracture rehabilitation. 24 healthy older adults aged over 80 years were included to perform activities of daily living (walking, standing, sitting, lying down, and postural transfers) under simulated free-living conditions for 75 minutes while wearing two accelerometers positioned on the lower back and anterior upper thigh. Model robustness was evaluated using leave-one-subject-out cross-validation. The synthetic data demonstrated potential to improve generalization across participants. The resulting feature intervention model (FIM), aided by synthetic data guidance, achieved reliable activity recognition with mean F1-scores of 0.896 for walking, 0.927 for standing, 0.997 for sitting, 0.937 for lying down, and 0.816 for postural transfers. Compared with a control condition model without synthetic data, the FIM significantly improved the postural transfer detection, i.e., an activity class of high clinical relevance that is often overlooked in existing HAR literature. In conclusion, these preliminary results demonstrate the feasibility of robust activity recognition in older adults. Further validation in hip fracture patient populations is required to assess the clinical utility of the proposed monitoring system.[127] Ego4OOD: Rethinking Egocentric Video Domain Generalization via Covariate Shift Scoring
Zahra Vaseqi,James Clark
Main category: cs.CV
TL;DR: 本文提出Ego4OOD基准,用于评估自我中心视频动作识别在协变量偏移下的泛化能力,并设计了一种one-vs-all二分类训练策略,在减少参数和模态依赖的同时实现了与SOTA方法相当的性能。
Details
Motivation: 现有自我中心域泛化基准混淆了协变量偏移与概念偏移,难以可靠评估模型对输入分布变化的泛化能力;同时,自我中心视频存在类内时空变异大、特征长尾分布及动作-环境强相关等挑战。 Method: 构建Ego4OOD基准(源自Ego4D),强调可度量的协变量多样性、降低概念偏移;提出基于聚类的协变量偏移度量;采用one-vs-all二分类训练目标,将多类动作识别分解为独立二分类任务。 Result: 一个轻量级两层全连接网络在Argo1M和Ego4OOD上达到与当前最优方法相当的性能,且参数更少、无需额外模态;实证表明协变量偏移程度与识别性能存在清晰负相关。 Conclusion: 可控的基准设计与定量的域表征对研究自我中心视频的分布外泛化至关重要;one-vs-all二分类策略更适应协变量偏移场景。 Abstract: Egocentric video action recognition under domain shifts remains challenging due to large intra-class spatio-temporal variability, long-tailed feature distributions, and strong correlations between actions and environments. Existing benchmarks for egocentric domain generalization often conflate covariate shifts with concept shifts, making it difficult to reliably evaluate a model's ability to generalize across input distributions. To address this limitation, we introduce Ego4OOD, a domain generalization benchmark derived from Ego4D that emphasizes measurable covariate diversity while reducing concept shift through semantically coherent, moment-level action categories. Ego4OOD spans eight geographically distinct domains and is accompanied by a clustering-based covariate shift metric that provides a quantitative proxy for domain difficulty. We further leverage a one-vs-all binary training objective that decomposes multi-class action recognition into independent binary classification tasks. This formulation is particularly well-suited for covariate shift by reducing interference between visually similar classes under feature distribution shift. Using this formulation, we show that a lightweight two-layer fully connected network achieves performance competitive with state-of-the-art egocentric domain generalization methods on both Argo1M and Ego4OOD, despite using fewer parameters and no additional modalities. Our empirical analysis demonstrates a clear relationship between measured covariate shift and recognition performance, highlighting the importance of controlled benchmarks and quantitative domain characterization for studying out-of-distribution generalization in egocentric video.[128] A Computer Vision Pipeline for Iterative Bullet Hole Tracking in Rifle Zeroing
Robert M. Belcher,Brendan C. Degryse,Leonard R. Kosta,Christopher J. Lowrance
Main category: cs.CV
TL;DR: 本文提出了一种端到端计算机视觉系统,用于自动检测和按射击轮次追踪靶纸上弹孔,结合YOLOv8检测、IoU时序匹配、ORB视角校正预处理及新型去对象数据增强方法,实现了97.0% mAP和88.8%迭代分配准确率。
Details
Motivation: 传统步枪归零需人工识别多轮射击的弹孔,受限于靶场安全规程导致延迟,并易受人为误差影响。 Method: 采用YOLOv8进行小目标弹孔检测,结合IoU分析实现跨图像序列的弹孔匹配;提出基于‘去除对象’的数据增强策略模拟真实射击序列;引入基于ORB的透视校正预处理以统一靶纸朝向。 Result: 在弹孔检测任务上达到97.0%平均精度(mAP),在弹孔归属正确射击轮次的任务上达88.8%准确率。 Conclusion: 该系统可高效、准确支持步枪归零自动化,其时序区分相似目标的框架亦适用于其他需区分视觉相似对象时间序列的应用场景。 Abstract: Adjusting rifle sights, a process commonly called "zeroing," requires shooters to identify and differentiate bullet holes from multiple firing iterations. Traditionally, this process demands physical inspection, introducing delays due to range safety protocols and increasing the risk of human error. We present an end-to-end computer vision system for automated bullet hole detection and iteration-based tracking directly from images taken at the firing line. Our approach combines YOLOv8 for accurate small-object detection with Intersection over Union (IoU) analysis to differentiate bullet holes across sequential images. To address the scarcity of labeled sequential data, we propose a novel data augmentation technique that removes rather than adds objects to simulate realistic firing sequences. Additionally, we introduce a preprocessing pipeline that standardizes target orientation using ORB-based perspective correction, improving model accuracy. Our system achieves 97.0% mean average precision on bullet hole detection and 88.8% accuracy in assigning bullet holes to the correct firing iteration. While designed for rifle zeroing, this framework offers broader applicability in domains requiring the temporal differentiation of visually similar objects.[129] A Mechanistic View on Video Generation as World Models: State and Dynamics
Luozhou Wang,Zhifei Chen,Yihua Du,Dongyu Yan,Wenhang Ge,Guibao Shen,Xinli Xu,Leyi Wu,Man Chen,Tianshuo Xu,Peiran Ren,Xin Tao,Pengfei Wan,Ying-Cong Chen
Main category: cs.CV
TL;DR: 本文提出了一种新的视频生成模型分类法,聚焦于状态构建与动力学建模,并倡导从视觉保真度转向功能性评估,以推动视频生成模型向通用世界模拟器发展。
Details
Motivation: 当前大规模视频生成模型虽展现出物理一致性,但其“无状态”架构与经典以状态为中心的世界模型理论之间仍存在鸿沟。 Method: 提出以‘状态构建’和‘动力学建模’为双支柱的新分类体系;将状态构建分为隐式(上下文管理)与显式(潜在压缩)范式,动力学建模则从知识融合与架构重构两方面分析;同时主张采用物理持久性与因果推理等功能型评测标准。 Result: 明确了两大关键前沿方向:一是通过数据驱动记忆与压缩保真度提升状态持久性,二是通过潜在因子解耦与推理先验集成推进因果建模能力。 Conclusion: 只有解决状态持久性与因果建模两大挑战,视频生成模型才能从生成视觉上合理的内容,真正演进为鲁棒、通用的世界模拟器。 Abstract: Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning-prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.[130] Superpixel-Based Image Segmentation Using Squared 2-Wasserstein Distances
Jisui Huang,Andreas Alpers,Ke Chen,Na Lei
Main category: cs.CV
TL;DR: 本文提出了一种基于最优传输(OT)的两层图像分割方法:先用线性最小二乘分配生成超像素,再用2-Wasserstein距离合并超像素,兼顾精度与效率。
Details
Motivation: 解决图像中存在强不均匀性时的传统分割方法性能下降问题,尤其是传统超像素合并依赖均值颜色距离、缺乏数学统一性。 Method: 将分割建模为两层聚类:第一层通过线性最小二乘分配(一种离散最优传输问题)生成超像素;第二层基于超像素的经验分布间的平方2-Wasserstein距离进行贪心合并。 Result: 在具有挑战性的图像上取得了更高的分割精度,同时保持高计算效率。 Conclusion: 采用分布式的最优传输距离替代均值距离,实现了两层聚类的数学统一,提升了强不均匀场景下的分割鲁棒性与准确性。 Abstract: We present an efficient method for image segmentation in the presence of strong inhomogeneities. The approach can be interpreted as a two-level clustering procedure: pixels are first grouped into superpixels via a linear least-squares assignment problem, which can be viewed as a special case of a discrete optimal transport (OT) problem, and these superpixels are subsequently greedily merged into object-level segments using the squared 2-Wasserstein distance between their empirical distributions. In contrast to conventional superpixel merging strategies based on mean-color distances, our framework employs a distributional OT distance, yielding a mathematically unified formulation across both clustering levels. Numerical experiments demonstrate that this perspective leads to improved segmentation accuracy on challenging images while retaining high computational efficiency.[131] GlassesGB: Controllable 2D GAN-Based Eyewear Personalization for 3D Gaussian Blendshapes Head Avatars
Rui-Yang Ju,Jen-Shiun Chiang
Main category: cs.CV
TL;DR: 本文提出GlassesGB框架,结合GlassesGAN的2D个性化眼镜生成能力与3D Gaussian Blendshapes的头部重建技术,实现面向VR应用的、可定制的3D眼镜生成,支持用户驱动的细粒度定制。
Details
Motivation: 现有虚拟试戴系统多局限于预定义眼镜模板,缺乏用户驱动的细粒度定制能力;GlassesGAN虽支持2D个性化设计,但无法拓展至3D;而3D Gaussian Blendshapes在头部重建中表现优异,因此作者尝试融合二者以解决VR中个性化3D眼镜生成难题。 Method: 将GlassesGAN的2D生成能力与3D Gaussian Blendshapes的头部建模技术相结合,构建端到端框架GlassesGB,实现从2D定制化设计到3D头像配戴渲染的映射。 Result: GlassesGB成功实现了面向3D头像的可定制眼镜生成,在保持2D设计自由度的同时完成高质量3D渲染,适用于VR虚拟试戴场景。 Conclusion: GlassesGB有效弥合了2D生成与3D头像渲染之间的鸿沟,为VR中的个性化虚拟试戴提供了新范式,具备实际应用潜力。 Abstract: Virtual try-on systems allow users to interactively try different products within VR scenarios. However, most existing VTON methods operate only on predefined eyewear templates and lack support for fine-grained, user-driven customization. While GlassesGAN enables personalized 2D eyewear design, its capability remains limited to 2D image generation. Motivated by the success of 3D Gaussian Blendshapes in head reconstruction, we integrate these two techniques and propose GlassesGB, a framework that supports customizable eyewear generation for 3D head avatars. GlassesGB effectively bridges 2D generative customization with 3D head avatar rendering, addressing the challenge in achieving personalized eyewear design for VR applications. The implementation code is available at https://ruiyangju.github.io/GlassesGB.[132] GRASP: Guided Region-Aware Sparse Prompting for Adapting MLLMs to Remote Sensing
Qigan Sun,Chaoning Zhang,Jianwei Zhang,Xudong Wang,Jiehui Xie,Pengcheng Zheng,Haoyu Wang,Sungyoung Lee,Chi-lok Andy Tai,Yang Yang,Heng Tao Shen
Main category: cs.CV
TL;DR: 本文提出了一种面向遥感图像视觉问答的参数高效微调方法GRASP,通过空间结构化软提示与问题引导的稀疏融合机制,提升模型对目标区域的关注并抑制背景噪声。
Details
Motivation: 现有MLLMs在遥感图像VQA任务中易过拟合背景噪声或忽略目标细节,源于遥感图像的大尺度变化、目标稀疏分布及复杂区域语义特征。 Method: 提出参数高效微调策略GRASP:将空间块从冻结视觉token网格中提取,并关联空间结构化软提示;通过问题引导的稀疏融合机制动态聚合任务上下文为紧凑全局提示。 Result: 在多个RSVQA基准上,GRASP在保持高参数效率的同时,性能优于现有微调和基于提示的方法。 Conclusion: GRASP有效缓解了遥感图像VQA中背景干扰与目标细节缺失问题,验证了区域感知与稀疏提示融合在参数高效微调中的有效性。 Abstract: In recent years, Multimodal Large Language Models (MLLMs) have made significant progress in visual question answering tasks. However, directly applying existing fine-tuning methods to remote sensing (RS) images often leads to issues such as overfitting on background noise or neglecting target details. This is primarily due to the large-scale variations, sparse target distributions, and complex regional semantic features inherent in RS images. These challenges limit the effectiveness of MLLMs in RS tasks. To address these challenges, we propose a parameter-efficient fine-tuning (PEFT) strategy called Guided Region-Aware Sparse Prompting (GRASP). GRASP introduces spatially structured soft prompts associated with spatial blocks extracted from a frozen visual token grid. Through a question-guided sparse fusion mechanism, GRASP dynamically aggregates task-specific context into a compact global prompt, enabling the model to focus on relevant regions while filtering out background noise. Extensive experiments on multiple RSVQA benchmarks show that GRASP achieves competitive performance compared to existing fine-tuning and prompt-based methods while maintaining high parameter efficiency.[133] LoD Sketch Extraction from Architectural Models Using Generative AI: Dataset Construction for Multi-Level Architectural Design Generation
Xusheng Du,Athiwat Kongkaeo,Ye Zhang,Haoran Xie
Main category: cs.CV
TL;DR: 本文提出了一种基于生成式AI的自动多级细节(LoD)草图提取框架,通过渐进简化高细节建筑模型,生成几何一致、层次连贯的多LoD表示,解决了AI生成建筑模型中高质量配对LoD训练数据缺乏的问题。
Details
Motivation: 传统LoD建模依赖人工操作,耗时费力且易导致几何不一致;生成式AI在多级建筑建模中受限于高质量配对LoD训练数据的缺乏。 Method: 融合计算机视觉与生成式AI,构建从高细节模型到体素抽象的渐进式LoD草图提取流水线。 Result: 在LoD3→LoD2和LoD2→LoD1转换中SSIM分别达0.7319和0.7532,归一化Hausdorff距离分别为图像对角线的25.1%和61.0%,验证了几何一致性与结构保持能力。 Conclusion: 该框架能有效保持全局结构并实现跨LoD层级的渐进语义简化,为AI驱动的多级建筑设计与分层建模提供可靠数据与技术支撑。 Abstract: For architectural design, representation across multiple Levels of Details (LoD) is essential for achieving a smooth transition from conceptual massing to detailed modeling. However, traditional LoD modeling processes rely on manual operations that are time-consuming, labor-intensive, and prone to geometric inconsistencies. While the rapid advancement of generative artificial intelligence (AI) has opened new possibilities for generating multi-level architectural models from sketch inputs, its application remains limited by the lack of high-quality paired LoD training data. To address this issue, we propose an automatic LoD sketch extraction framework using generative AI models, which progressively simplifies high-detail architectural models to automatically generate geometrically consistent and hierarchically coherent multi-LoD representations. The proposed framework integrates computer vision techniques with generative AI methods to establish a progressive extraction pipeline that transitions from detailed representations to volumetric abstractions. Experimental results demonstrate that the method maintains strong geometric consistency across LoD levels, achieving SSIM values of 0.7319 and 0.7532 for the transitions from LoD3 to LoD2 and from LoD2 to LoD1, respectively, with corresponding normalized Hausdorff distances of 25.1% and 61.0% of the image diagonal, reflecting controlled geometric deviation during abstraction. These results verify that the proposed framework effectively preserves global structure while achieving progressive semantic simplification across different LoD levels, providing reliable data and technical support for AI-driven multi-level architectural generation and hierarchical modeling.[134] Performance uncertainty in medical image analysis: a large-scale investigation of confidence intervals
Pascaline André,Charles Heitz,Evangelia Christodoulou,Annika Reinke,Carole H. Sudre,Michela Antonelli,Patrick Godau,M. Jorge Cardoso,Antoine Gilson,Sophie Tezenas du Montcel,Gaël Varoquaux,Lena Maier-Hein,Olivier Colliot
Main category: cs.CV
TL;DR: 本研究通过大规模实证分析,系统评估了多种置信区间(CI)方法在医学影像AI性能不确定性量化中的可靠性(覆盖率)和精确性(宽度),揭示了样本量需求、性能指标选择、聚合策略、任务类型(分割vs分类)及CI方法本身对CI表现的显著影响,为制定医学影像AI性能报告规范提供关键依据。
Details
Motivation: 医学影像AI的临床转化需要可靠的性能不确定性量化,但目前学界对各类置信区间(CI)方法在该领域的适用性与行为特征缺乏系统认知。 Method: 开展大规模实证分析,覆盖24个分割与分类任务,每组使用19个训练模型,涵盖多种常用性能指标、多种聚合策略(如macro/micro)及多种广泛采用的CI方法,并在所有设置下评估各CI方法的可靠性(coverage)和精确性(width)。 Result: 发现五个核心结论:1)可靠CI所需样本量因研究参数而异(几十至数千例);2)CI行为显著受性能指标选择影响;3)聚合策略(如macro vs micro)显著影响CI可靠性;4)任务类型(分割/分类)调节上述效应;5)不同CI方法在不同场景下的可靠性与精确性不一致。 Conclusion: 不同CI方法在医学影像AI性能评估中表现差异显著,需根据具体任务、指标和聚合方式谨慎选择;研究结果为未来制定医学影像AI性能不确定性报告指南提供了关键实证基础。 Abstract: Performance uncertainty quantification is essential for reliable validation and eventual clinical translation of medical imaging artificial intelligence (AI). Confidence intervals (CIs) play a central role in this process by indicating how precise a reported performance estimate is. Yet, due to the limited amount of work examining CI behavior in medical imaging, the community remains largely unaware of how many diverse CI methods exist and how they behave in specific settings. The purpose of this study is to close this gap. To this end, we conducted a large-scale empirical analysis across a total of 24 segmentation and classification tasks, using 19 trained models per task group, a broad spectrum of commonly used performance metrics, multiple aggregation strategies, and several widely adopted CI methods. Reliability (coverage) and precision (width) of each CI method were estimated across all settings to characterize their dependence on study characteristics. Our analysis revealed five principal findings: 1) the sample size required for reliable CIs varies from a few dozens to several thousands of cases depending on study parameters; 2) CI behavior is strongly affected by the choice of performance metric; 3) aggregation strategy substantially influences the reliability of CIs, e.g. they require more observations for macro than for micro; 4) the machine learning problem (segmentation versus classification) modulates these effects; 5) different CI methods are not equally reliable and precise depending on the use case. These results form key components for the development of future guidelines on reporting performance uncertainty in medical imaging AI.[135] StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors
Qinkai Yu,Chong Zhang,Gaojie Jin,Tianjin Huang,Wei Zhou,Wenhui Li,Xiaobo Jin,Bo Huang,Yitian Zhao,Guang Yang,Gregory Y. H. Lip,Yalin Zheng,Aline Villavicencio,Yanda Meng
Main category: cs.CV
TL;DR: 本文提出了一种名为StealthMark的新型、隐蔽且无害的水印方法,用于在黑盒条件下验证医学图像分割模型的所有权,通过调制模型不确定性并结合解释性方法(如LIME)嵌入QR码水印,实现在不损害模型性能前提下的高精度所有权验证。
Details
Motivation: 医学数据标注成本高、专家稀缺,加之隐私与伦理限制,使得训练好的医学分割模型成为需强保护的知识产权;而现有模型保护技术多集中于分类和生成任务,对关键的医学分割模型保护研究不足。 Method: 提出StealthMark方法:在黑盒设定下,通过微妙调制模型不确定性(不改变最终分割输出)来嵌入水印;利用模型无关的可解释性方法(如LIME)提取特征归因,在特定触发条件下揭示嵌入的QR码水印。 Result: 在四个医学影像数据集和五个主流分割模型(含SAM)上验证有效:ASR超95%,Dice和AUC下降<1%,显著优于后门式水印方法,兼具高有效性、隐蔽性与无害性。 Conclusion: StealthMark为医学分割模型提供了实用、鲁棒且合规的所有权保护新范式,具备良好落地潜力。 Abstract: Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.[136] iFSQ: Improving FSQ for Image Generation with 1 Line of Code
Bin Lin,Zongjian Li,Yuwei Niu,Kaixiong Gong,Yunyang Ge,Yunlong Lin,Mingzhe Zheng,JianWei Zhang,Miles Yang,Zhao Zhong,Liefeng Bo,Li Yuan
Main category: cs.CV
TL;DR: 本文提出iFSQ,通过分布匹配映射解决FSQ中激活坍塌问题,实现最优量化与重建精度;并基于此发现离散与连续表征最优平衡点约为4比特/维,且AR模型收敛快但扩散模型性能上限更高;最后将REPA适配到AR模型,得到LlamaGen-REPA。
Details
Motivation: 当前图像生成领域中自回归(AR)模型与扩散模型因离散token与连续latent的建模差异而割裂,阻碍统一建模与公平评测;FSQ虽提供理论桥梁,但其等间隔量化易导致激活坍塌,造成重建保真度与信息效率难以兼顾。 Method: 提出iFSQ:在原始FSQ中用分布匹配映射替代原激活函数,强制隐空间服从均匀先验;以此构建可控基准,并对比分析AR与扩散模型在相同重建约束下的训练动态与性能极限;进一步将Representation Alignment(REPA)方法适配至AR模型,构建LlamaGen-REPA。 Result: iFSQ仅需一行代码修改即可数学保证最优bin利用率与重建精度;实验发现离散-连续表征最优平衡点约为4 bits/dim;在相同重建约束下,AR模型初期收敛更快,但扩散模型最终性能上限更高;REPA适配AR后提升其生成质量。 Conclusion: iFSQ有效弥合了AR与扩散模型间的表征鸿沟,揭示了二者本质权衡关系;4 bits/dim是关键设计准则;严格顺序建模可能限制AR模型的质量上限;REPA可迁移至AR框架,推动其性能提升。 Abstract: The field of image generation is currently bifurcated into autoregressive (AR) models operating on discrete tokens and diffusion models utilizing continuous latents. This divide, rooted in the distinction between VQ-VAEs and VAEs, hinders unified modeling and fair benchmarking. Finite Scalar Quantization (FSQ) offers a theoretical bridge, yet vanilla FSQ suffers from a critical flaw: its equal-interval quantization can cause activation collapse. This mismatch forces a trade-off between reconstruction fidelity and information efficiency. In this work, we resolve this dilemma by simply replacing the activation function in original FSQ with a distribution-matching mapping to enforce a uniform prior. Termed iFSQ, this simple strategy requires just one line of code yet mathematically guarantees both optimal bin utilization and reconstruction precision. Leveraging iFSQ as a controlled benchmark, we uncover two key insights: (1) The optimal equilibrium between discrete and continuous representations lies at approximately 4 bits per dimension. (2) Under identical reconstruction constraints, AR models exhibit rapid initial convergence, whereas diffusion models achieve a superior performance ceiling, suggesting that strict sequential ordering may limit the upper bounds of generation quality. Finally, we extend our analysis by adapting Representation Alignment (REPA) to AR models, yielding LlamaGen-REPA. Codes is available at https://github.com/Tencent-Hunyuan/iFSQ[137] Scaling medical imaging report generation with multimodal reinforcement learning
Qianchu Liu,Sheng Zhang,Guanghui Qin,Yu Gu,Ying Jin,Sam Preston,Yanbo Xu,Sid Kiblawi,Wen-wai Yim,Tim Ossowski,Tristan Naumann,Mu Wei,Hoifung Poon
Main category: cs.CV
TL;DR: 本文提出了一种名为UniRG的通用医学影像报告生成框架,利用强化学习直接优化面向应用的评估指标,显著优于监督微调,在多个机构和临床实践中具有强泛化能力,并在CXR报告生成任务的ReXrank基准上达到新SOTA。
Details
Motivation: 前沿大模型在多模态理解与推理(尤其是生物医学等高价值垂直领域)中仍存在明显能力短板,如医学影像报告生成任务;监督微调易过拟合模板化语言模式,泛化性差。 Method: 提出通用报告生成框架UniRG,以强化学习为统一机制,直接优化面向终端应用设计的评估指标;基于公开胸部X光片(CXR)数据训练UniRG-CXR模型。 Result: 在权威ReXrank基准上,UniRG-CXR整体性能达到新SOTA,大幅超越此前最优方法。 Conclusion: UniRG通过强化学习实现对评估指标的端到端优化,显著提升医学影像报告生成性能与跨机构、跨临床实践的鲁棒泛化能力,为多模态医疗AI提供了新范式。 Abstract: Frontier models have demonstrated remarkable capabilities in understanding and reasoning with natural-language text, but they still exhibit major competency gaps in multimodal understanding and reasoning especially in high-value verticals such as biomedicine. Medical imaging report generation is a prominent example. Supervised fine-tuning can substantially improve performance, but they are prone to overfitting to superficial boilerplate patterns. In this paper, we introduce Universal Report Generation (UniRG) as a general framework for medical imaging report generation. By leveraging reinforcement learning as a unifying mechanism to directly optimize for evaluation metrics designed for end applications, UniRG can significantly improve upon supervised fine-tuning and attain durable generalization across diverse institutions and clinical practices. We trained UniRG-CXR on publicly available chest X-ray (CXR) data and conducted a thorough evaluation in CXR report generation with rigorous evaluation scenarios. On the authoritative ReXrank benchmark, UniRG-CXR sets new overall SOTA, outperforming prior state of the art by a wide margin.[138] LGDWT-GS: Local and Global Discrete Wavelet-Regularized 3D Gaussian Splatting for Sparse-View Scene Reconstruction
Shima Salehi,Atharva Agashe,Andrew J. McFarland,Joshua Peeples
Main category: cs.CV
TL;DR: 本文提出了一种结合全局与局部频率正则化的新方法,用于稀疏视角下的少样本3D重建,并发布了一个多光谱温室数据集及开源基准测试工具包。
Details
Motivation: 解决现有3D高斯泼溅(3DGS)模型在稀疏视角下几何不稳定、细节丢失的关键问题。 Method: 引入全局与局部频率正则化机制以稳定几何结构并保留细节;构建含四个光谱波段的多光谱温室数据集;发布定义标准化少样本重建协议的开源基准包。 Result: 在自建多光谱数据集及标准基准上的实验表明,该方法重建结果更锐利、更稳定、光谱一致性更强。 Conclusion: 所提方法显著提升了少样本3D重建质量,配套数据集与代码已开源,推动了3DGS在农业等跨光谱场景中的应用与评估标准化。 Abstract: We propose a new method for few-shot 3D reconstruction that integrates global and local frequency regularization to stabilize geometry and preserve fine details under sparse-view conditions, addressing a key limitation of existing 3D Gaussian Splatting (3DGS) models. We also introduce a new multispectral greenhouse dataset containing four spectral bands captured from diverse plant species under controlled conditions. Alongside the dataset, we release an open-source benchmarking package that defines standardized few-shot reconstruction protocols for evaluating 3DGS-based methods. Experiments on our multispectral dataset, as well as standard benchmarks, demonstrate that the proposed method achieves sharper, more stable, and spectrally consistent reconstructions than existing baselines. The dataset and code for this work are publicly available[139] Decoding Psychological States Through Movement: Inferring Human Kinesic Functions with Application to Built Environments
Cheyu Lin,Katherine A. Flanigan,Sirajum Munir
Main category: cs.CV
TL;DR: 本文提出了DUET数据集和嵌入式动作识别框架,以隐私保护方式量化建筑环境中人际互动的交际功能,弥补社会基础设施研究中缺乏一致、可扩展的社交互动测量方法的空白。
Details
Motivation: 现有研究在衡量社会基础设施中的社交互动时缺乏一致且隐私保护的方法,导致对设计干预效果的评估受限,难以验证其是否真正影响社会资本理论所关注的互动形式。 Method: 构建了包含12组双人互动的DUET数据集,覆盖5类Ekman-Friesen动作功能(象征、说明、情绪表达、调节、适应),涵盖4种感知模态与3种建成环境;提出基于骨骼运动的迁移学习识别框架,无需人工定义动作-功能映射字典,直接推断交际功能。 Result: 基准测试显示现有单人动作识别模型在双人、社交导向的交际功能识别任务上表现不佳;新框架能有效识别交际功能,表现出功能聚类结构,且表征质量与分类性能强相关,并具备跨被试与跨场景泛化能力。 Conclusion: DUET数据集与所提框架为建成环境中的社会互动提供了可扩展、隐私保护、理论驱动的测量新范式,推动社会基础设施评估从行为描述迈向交际意图理解。 Abstract: Social infrastructure and other built environments are increasingly expected to support well-being and community resilience by enabling social interaction. Yet in civil and built-environment research, there is no consistent and privacy-preserving way to represent and measure socially meaningful interaction in these spaces, leaving studies to operationalize "interaction" differently across contexts and limiting practitioners' ability to evaluate whether design interventions are changing the forms of interaction that social capital theory predicts should matter. To address this field-level and methodological gap, we introduce the Dyadic User Engagement DataseT (DUET) dataset and an embedded kinesics recognition framework that operationalize Ekman and Friesen's kinesics taxonomy as a function-level interaction vocabulary aligned with social capital-relevant behaviors (e.g., reciprocity and attention coordination). DUET captures 12 dyadic interactions spanning all five kinesic functions-emblems, illustrators, affect displays, adaptors, and regulators-across four sensing modalities and three built-environment contexts, enabling privacy-preserving analysis of communicative intent through movement. Benchmarking six open-source, state-of-the-art human activity recognition models quantifies the difficulty of communicative-function recognition on DUET and highlights the limitations of ubiquitous monadic, action-level recognition when extended to dyadic, socially grounded interaction measurement. Building on DUET, our recognition framework infers communicative function directly from privacy-preserving skeletal motion without handcrafted action-to-function dictionaries; using a transfer-learning architecture, it reveals structured clustering of kinesic functions and a strong association between representation quality and classification performance while generalizing across subjects and contexts.[140] Structural Complexity of Brain MRI reveals age-associated patterns
Anzhe Cheng,Italo Ivo Lima Dias Pinto,Paul Bogdan
Main category: cs.CV
TL;DR: 本文提出了一种适用于三维信号(特别是脑部MRI)的结构复杂度分析框架,通过滑动窗口粗粒化方法改进了传统块状粗粒化在大尺度下的不稳定性,并发现脑结构复杂度随年龄增长系统性下降,尤其在较粗尺度上表现明显,可用于预测生物年龄。
Details
Motivation: 传统基于块的粗粒化方法在大尺度下因采样不足而变得不稳定,亟需一种更鲁棒的三维信号多尺度结构复杂度分析方法。 Method: 引入滑动窗口粗粒化方案替代传统块状粗粒化,对三维MRI信号进行多尺度粗粒化,并量化相邻尺度间的信息损失以评估结构复杂度。 Result: 在中老年大型结构MRI数据集上发现结构复杂度随年龄增长系统性下降,且该效应在较粗空间尺度上最强。 Conclusion: 结构复杂度是一种可靠的三维影像多尺度分析工具,不仅提升了方法鲁棒性,还成功用于从脑MRI预测生物年龄。 Abstract: We adapt structural complexity analysis to three-dimensional signals, with an emphasis on brain magnetic resonance imaging (MRI). This framework captures the multiscale organization of volumetric data by coarse-graining the signal at progressively larger spatial scales and quantifying the information lost between successive resolutions. While the traditional block-based approach can become unstable at coarse resolutions due to limited sampling, we introduce a sliding-window coarse-graining scheme that provides smoother estimates and improved robustness at large scales. Using this refined method, we analyze large structural MRI datasets spanning mid- to late adulthood and find that structural complexity decreases systematically with age, with the strongest effects emerging at coarser scales. These findings highlight structural complexity as a reliable signal processing tool for multiscale analysis of 3D imaging data, while also demonstrating its utility in predicting biological age from brain MRI.[141] Spatiotemporal Semantic V2X Framework for Cooperative Collision Prediction
Murat Arda Onsu,Poonam Lohan,Burak Kantarci,Aisha Syed,Matthew Andrews,Sean Kennedy
Main category: cs.CV
TL;DR: 本文提出了一种基于语义V2X的实时碰撞预测框架,利用RSU端的V-JEPA模型生成未来帧的时空语义嵌入,仅传输轻量级语义信息而非原始视频,在大幅降低通信开销(降低4个数量级)的同时提升F1-score达10%。
Details
Motivation: 传统ITS依赖传输原始视频或高维感知数据,受限于车载通信带宽和时延,难以满足实时碰撞预测需求。 Method: 构建数字孪生城市交通环境生成多样化场景;RSU端使用V-JEPA模型提取未来帧的时空语义嵌入;通过V2X链路传输嵌入;车端采用轻量级注意力探针与分类器解码并预测碰撞。 Result: 相比原始视频传输,通信需求降低四个数量级,碰撞预测F1-score提升10%。 Conclusion: 语义V2X通信可有效支持ITS中协作式、实时的碰撞预测,兼顾效率与准确性。 Abstract: Intelligent Transportation Systems (ITS) demand real-time collision prediction to ensure road safety and reduce accident severity. Conventional approaches rely on transmitting raw video or high-dimensional sensory data from roadside units (RSUs) to vehicles, which is impractical under vehicular communication bandwidth and latency constraints. In this work, we propose a semantic V2X framework in which RSU-mounted cameras generate spatiotemporal semantic embeddings of future frames using the Video Joint Embedding Predictive Architecture (V-JEPA). To evaluate the system, we construct a digital twin of an urban traffic environment enabling the generation of d verse traffic scenarios with both safe and collision events. These embeddings of the future frame, extracted from V-JEPA, capture task-relevant traffic dynamics and are transmitted via V2X links to vehicles, where a lightweight attentive probe and classifier decode them to predict imminent collisions. By transmitting only semantic embeddings instead of raw frames, the proposed system significantly reduces communication overhead while maintaining predictive accuracy. Experimental results demonstrate that the framework with an appropriate processing method achieves a 10% F1-score improvement for collision prediction while reducing transmission requirements by four orders of magnitude compared to raw video. This validates the potential of semantic V2X communication to enable cooperative, real-time collision prediction in ITS.[142] Semi-Supervised Domain Adaptation with Latent Diffusion for Pathology Image Classification
Tengyue Zhang,Ruiwen Ding,Luoting Zhuang,Yuxiao Wu,Erika F. Rodriguez,William Hsu
Main category: cs.CV
TL;DR: 本文提出了一种基于潜在扩散模型的半监督域自适应(SSDA)框架,利用源域和目标域的无标签数据生成形态保持且目标域感知的合成图像,以提升计算病理学中深度学习模型在跨机构/队列场景下的泛化能力。
Details
Motivation: 深度学习模型在计算病理学中常因域偏移(domain shift)而在不同队列或机构间泛化不佳;现有方法要么未有效利用目标域无标签数据,要么依赖图像到图像翻译,易破坏组织结构并降低准确性。 Method: 构建一个以基础模型特征、队列身份和组织制备方法为条件的潜在扩散模型,生成兼具源域组织结构保真度与目标域外观特性的合成图像;将这些目标感知合成图像与源域有标签真实图像结合,训练下游分类器。 Result: 在肺腺癌预后预测任务上,目标队列独立测试集的加权F1分数从0.611提升至0.706,宏平均F1从0.641提升至0.716,且源队列性能未下降。 Conclusion: 目标域感知的基于扩散模型的合成数据增强是一种有前景且有效的提升计算病理学域泛化能力的方法。 Abstract: Deep learning models in computational pathology often fail to generalize across cohorts and institutions due to domain shift. Existing approaches either fail to leverage unlabeled data from the target domain or rely on image-to-image translation, which can distort tissue structures and compromise model accuracy. In this work, we propose a semi-supervised domain adaptation (SSDA) framework that utilizes a latent diffusion model trained on unlabeled data from both the source and target domains to generate morphology-preserving and target-aware synthetic images. By conditioning the diffusion model on foundation model features, cohort identity, and tissue preparation method, we preserve tissue structure in the source domain while introducing target-domain appearance characteristics. The target-aware synthetic images, combined with real, labeled images from the source cohort, are subsequently used to train a downstream classifier, which is then tested on the target cohort. The effectiveness of the proposed SSDA framework is demonstrated on the task of lung adenocarcinoma prognostication. The proposed augmentation yielded substantially better performance on the held-out test set from the target cohort, without degrading source-cohort performance. The approach improved the weighted F1 score on the target-cohort held-out test set from 0.611 to 0.706 and the macro F1 score from 0.641 to 0.716. Our results demonstrate that target-aware diffusion-based synthetic data augmentation provides a promising and effective approach for improving domain generalization in computational pathology.[143] C-RADIOv4 (Tech Report)
Mike Ranzinger,Greg Heinrich,Collin McCarthy,Jan Kautz,Andrew Tao,Bryan Catanzaro,Pavlo Molchanov
Main category: cs.CV
TL;DR: C-RADIOv4 是一种基于多教师蒸馏的视觉骨干网络,通过融合 SigLIP2、DINOv3 和 SAM3 等多个教师模型的能力,在保持计算复杂度不变的同时显著提升下游任务性能,并增强任意分辨率支持与高分辨率效率。
Details
Motivation: 为统一并增强多个先进视觉教师模型(如SigLIP2、DINOv3、SAM3)的互补能力,同时兼顾计算效率与泛化性(如任意分辨率支持)。 Method: 采用多教师知识蒸馏方法,构建统一学生模型;在 AM-RADIO/RADIOv2.5 基础上改进架构;引入新教师组合,并恢复 ViTDet 选项以提升高分辨率推理效率。 Result: 推出 C-RADIOv4-SO400M(412M 参数)和 -H(631M 参数)两个变体,在核心指标上显著提升,新增 SAM3 相关能力,增强任意分辨率支持与高分辨率效率,并采用宽松许可协议。 Conclusion: C-RADIOv4 成功整合多教师优势,在性能、灵活性与实用性上实现全面升级,是面向实际部署的高效通用视觉骨干模型。 Abstract: By leveraging multi-teacher distillation, agglomerative vision backbones provide a unified student model that retains and improves the distinct capabilities of multiple teachers. In this tech report, we describe the most recent release of the C-RADIO family of models, C-RADIOv4, which builds upon AM-RADIO/RADIOv2.5 in design, offering strong improvements on key downstream tasks at the same computational complexity. We release -SO400M (412M params), and -H (631M) model variants, both trained with an updated set of teachers: SigLIP2, DINOv3, and SAM3. In addition to improvements on core metrics and new capabilities from imitating SAM3, the C-RADIOv4 model family further improves any-resolution support, brings back the ViTDet option for drastically enhanced efficiency at high-resolution, and comes with a permissive license.[144] Multi-stage Bridge Inspection System: Integrating Foundation Models with Location Anonymization
Takato Yasuno
Main category: cs.CV
TL;DR: 本文提出了一种开源桥梁损伤检测系统,结合SAM3模型进行钢筋腐蚀检测、DBSCAN补全遗漏区域、高斯模糊保护施工标识区域,并通过四种预处理方法提升OCR精度,GPU优化实现单图1.7秒处理,兼顾损伤识别精度与区域隐私保护。
Details
Motivation: 在日本,桥梁等基础设施需每五年进行一次目视检查,但现场采集的损伤图像常含施工标识等可泄露地域信息的内容;为保障设施安全使用且避免引发公众焦虑,亟需在准确提取损伤特征的同时保护地域隐私信息。 Method: 采用Segment Anything Model (SAM3) 进行钢筋腐蚀检测,用DBSCAN算法自动补全检测遗漏区域;对施工标识区域采用高斯模糊进行隐私保护;设计四种图像预处理方法提升OCR识别精度;并利用GPU加速整个流程。技术栈包括SAM3、PyTorch、OpenCV、pytesseract和scikit-learn。 Result: 系统实现单张图像平均1.7秒的处理速度,有效识别混凝土裂缝与钢筋暴露,完成施工标识区域的自动检测与模糊保护,显著提升OCR准确率,并支持关键损伤指标可视化以辅助维修决策。 Conclusion: 该系统在保持高损伤检测精度的同时,成功实现了区域敏感信息的隐私保护,为合规、高效、可信的基础设施智能巡检提供了可行的开源解决方案。 Abstract: In Japan, civil infrastructure condition monitoring is mandated through visual inspection every five years. Field-captured damage images frequently contain concrete cracks and rebar exposure, often accompanied by construction signs revealing regional information. To enable safe infrastructure use without causing public anxiety, it is essential to protect regional information while accurately extracting damage features and visualizing key indicators for repair decision-making. This paper presents an open-source bridge damage detection system with regional privacy protection capabilities. We employ Segment Anything Model (SAM) 3 for rebar corrosion detection and utilize DBSCAN for automatic completion of missed regions. Construction sign regions are detected and protected through Gaussian blur. Four preprocessing methods improve OCR accuracy, and GPU optimization enables 1.7-second processing per image. The technology stack includes SAM3, PyTorch, OpenCV, pytesseract, and scikit-learn, achieving efficient bridge inspection with regional information protection.[145] FineVAU: A Novel Human-Aligned Benchmark for Fine-Grained Video Anomaly Understanding
João Pereira,Vasco Lopes,João Neves,David Semedo
Main category: cs.CV
TL;DR: 本文提出FineVAU基准,聚焦视频异常理解(VAU)任务,通过FVScore指标和FineW3数据集提升对异常事件(What)、参与实体(Who)和位置(Where)的细粒度、视觉 grounded 评估,显著改善与人类感知的一致性。
Details
Motivation: 现有VAU评估方法(n-gram或LLM-based)难以准确衡量LVLM对视觉异常的细粒度、事实性理解,且与人类感知不一致。 Method: 提出FineVAU基准,将VAU建模为What/Who/Where三要素理解问题;设计人类对齐的FVScore指标,评估答案中关键视觉元素的存在性;构建全自动增强的FineW3细粒度数据集。 Result: FVScore在人类评估中显著优于现有指标;实验揭示LVLM在需空间与细粒度时序理解的异常事件上存在明显短板,尽管其在静态、粗粒度或强视觉线索事件上表现良好。 Conclusion: FineVAU推动VAU向更精细、视觉 grounded 和人类对齐的方向发展,暴露了当前LVLM在复杂时空异常理解上的根本局限。 Abstract: Video Anomaly Understanding (VAU) is a novel task focused on describing unusual occurrences in videos. Despite growing interest, the evaluation of VAU remains an open challenge. Existing benchmarks rely on n-gram-based metrics (e.g., BLEU, ROUGE-L) or LLM-based evaluation. The first fails to capture the rich, free-form, and visually grounded nature of LVLM responses, while the latter focuses on assessing language quality over factual relevance, often resulting in subjective judgments that are misaligned with human perception. In this work, we address this issue by proposing FineVAU, a new benchmark for VAU that shifts the focus towards rich, fine-grained and domain-specific understanding of anomalous videos. We formulate VAU as a three-fold problem, with the goal of comprehensively understanding key descriptive elements of anomalies in video: events (What), participating entities (Who) and location (Where). Our benchmark introduces a) FVScore, a novel, human-aligned evaluation metric that assesses the presence of critical visual elements in LVLM answers, providing interpretable, fine-grained feedback; and b) FineW3, a novel, comprehensive dataset curated through a structured and fully automatic procedure that augments existing human annotations with high quality, fine-grained visual information. Human evaluation reveals that our proposed metric has a superior alignment with human perception of anomalies in comparison to current approaches. Detailed experiments on FineVAU unveil critical limitations in LVLM's ability to perceive anomalous events that require spatial and fine-grained temporal understanding, despite strong performance on coarse grain, static information, and events with strong visual cues.[146] Inference-Time Loss-Guided Colour Preservation in Diffusion Sampling
Angad Singh Ahuja,Aarush Ram Anandh
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理时区域约束颜色保持方法,通过ROI掩码、背景潜变量重置和基于CIE Lab与线性RGB的梯度引导潜变量微调,结合分布感知损失(含CVaR与软最大值惩罚),显著提升文本到图像生成中指定区域的颜色准确性。
Details
Motivation: 现有文本到图像扩散模型在设计导向任务中难以精确满足用户指定的颜色目标,尤其易出现局部颜色失真,而现有方法多依赖平均颜色约束,忽视像素误差分布尾部问题。 Method: 采用三步策略:(i) 基于感兴趣区域(ROI)的掩码修复实现空间选择性;(ii) 重置背景潜变量以防止ROI外颜色漂移;(iii) 在CIE Lab与线性RGB空间中定义复合损失函数,引入CVaR风格和软最大值惩罚控制像素误差分布尾部,并辅以晚启动门控与时间依赖调度稳定去噪过程中的梯度引导。 Result: 该方法在不修改预训练模型权重的前提下,显著提升了区域颜色保真度,相比仅优化均值的基线方法,有效抑制了感知显著的局部颜色失败,且可无缝集成进标准Stable Diffusion修复流程。 Conclusion: 本文验证了分布感知的推理时颜色控制优于传统均值导向方法,提供了一种实用、免训练、即插即用的区域颜色精准调控方案,适用于对色彩精度要求严苛的设计类生成任务。 Abstract: Precise color control remains a persistent failure mode in text-to-image diffusion systems, particularly in design-oriented workflows where outputs must satisfy explicit, user-specified color targets. We present an inference-time, region-constrained color preservation method that steers a pretrained diffusion model without any additional training. Our approach combines (i) ROI-based inpainting for spatial selectivity, (ii) background-latent re-imposition to prevent color drift outside the ROI, and (iii) latent nudging via gradient guidance using a composite loss defined in CIE Lab and linear RGB. The loss is constructed to control not only the mean ROI color but also the tail of the pixelwise error distribution through CVaR-style and soft-maximum penalties, with a late-start gate and a time-dependent schedule to stabilize guidance across denoising steps. We show that mean-only baselines can satisfy average color constraints while producing perceptually salient local failures, motivating our distribution-aware objective. The resulting method provides a practical, training-free mechanism for targeted color adherence that can be integrated into standard Stable Diffusion inpainting pipelines.[147] Cross360: 360° Monocular Depth Estimation via Cross Projections Across Scales
Kun Huang,Fang-Lue Zhang,Neil Dodgson
Main category: cs.CV
TL;DR: 本文提出Cross360,一种基于跨注意力机制的360°深度估计新架构,通过融合低畸变切向投影局部特征与等距矩形全局特征,解决球面图像中全局连续性与局部一致性难以兼顾的问题。
Details
Motivation: 现有方法难以在球面图像中同时保持全局连续性和避免畸变,且多投影融合策略在局部感知与边界特征对齐方面存在不足。 Method: 提出Cross360架构,包含Cross Projection Feature Alignment模块(利用跨注意力对齐切向投影局部特征与等距矩形全局特征)和Progressive Feature Aggregation with Attention模块(渐进式聚合多尺度特征)。 Result: Cross360在多个基准数据集上显著优于现有方法,尤其在完整360°图像可用场景下表现突出;代码与模型已开源。 Conclusion: Cross360有效提升了360°深度估计的精度与全局一致性,验证了跨注意力机制融合局部与全局信息的可行性与优越性。 Abstract: 360° depth estimation is a challenging research problem due to the difficulty of finding a representation that both preserves global continuity and avoids distortion in spherical images. Existing methods attempt to leverage complementary information from multiple projections, but struggle with balancing global and local consistency. Their local patch features have limited global perception, and the combined global representation does not address discrepancies in feature extraction at the boundaries between patches. To address these issues, we propose Cross360, a novel cross-attention-based architecture integrating local and global information using less-distorted tangent patches along with equirectangular features. Our Cross Projection Feature Alignment module employs cross-attention to align local tangent projection features with the equirectangular projection's 360° field of view, ensuring each tangent projection patch is aware of the global context. Additionally, our Progressive Feature Aggregation with Attention module refines multi-scaled features progressively, enhancing depth estimation accuracy. Cross360 significantly outperforms existing methods across most benchmark datasets, especially those in which the entire 360° image is available, demonstrating its effectiveness in accurate and globally consistent depth estimation. The code and model are available at https://github.com/huangkun101230/Cross360.[148] Fluxamba: Topology-Aware Anisotropic State Space Models for Geological Lineament Segmentation in Multi-Source Remote Sensing
Jin Bai,Huiyao Zhang,Qi Wen,Shengyang Li,Xiaolin Tian,Atta ur Rahman
Main category: cs.CV
TL;DR: 本文提出Fluxamba,一种轻量级地质线性特征分割架构,通过拓扑感知的特征校正框架(含结构通量块SFB、各向异性结构门ASG、先验调制流PMF、分层空间调节器HSR和高保真聚焦单元HFFU)解决传统状态空间模型在处理曲线状地质特征时的拓扑失配问题,在多个地质数据集上达到SOTA性能,兼具高精度与实时部署能力。
Details
Motivation: 传统状态空间模型(SSMs)因依赖刚性轴对齐扫描路径,与地质线性特征的各向异性、曲线拓扑不匹配,导致上下文碎片化和特征退化,难以实现精确分割。 Method: 提出Fluxamba架构,核心为结构通量块(SFB),结合各向异性结构门(ASG)与先验调制流(PMF),解耦特征方向与空间位置,沿目标本征几何动态聚合上下文;引入分层空间调节器(HSR)进行多尺度语义对齐,以及高保真聚焦单元(HFFU)提升低对比度下微弱特征的信噪比。 Result: 在LROC-Lineament、LineaMapper和GeoCrack等多样地质基准上达到新SOTA:LROC-Lineament上F1-score达89.22%,mIoU达89.87%;推理速度超24 FPS,仅3.4M参数、6.3G FLOPs,计算成本较重型基线降低两个数量级。 Conclusion: Fluxamba成功弥合了建模复杂地质拓扑与高效计算之间的鸿沟,在分割精度与边缘部署可行性之间建立了新的Pareto前沿。 Abstract: The precise segmentation of geological linear features, spanning from planetary lineaments to terrestrial fractures, demands capturing long-range dependencies across complex anisotropic topologies. Although State Space Models (SSMs) offer near-linear computational complexity, their dependence on rigid, axis-aligned scanning trajectories induces a fundamental topological mismatch with curvilinear targets, resulting in fragmented context and feature erosion. To bridge this gap, we propose Fluxamba, a lightweight architecture that introduces a topology-aware feature rectification framework. Central to our design is the Structural Flux Block (SFB), which orchestrates an anisotropic information flux by integrating an Anisotropic Structural Gate (ASG) with a Prior-Modulated Flow (PMF). This mechanism decouples feature orientation from spatial location, dynamically gating context aggregation along the target's intrinsic geometry rather than rigid paths. Furthermore, to mitigate serialization-induced noise in low-contrast environments, we incorporate a Hierarchical Spatial Regulator (HSR) for multi-scale semantic alignment and a High-Fidelity Focus Unit (HFFU) to explicitly maximize the signal-to-noise ratio of faint features. Extensive experiments on diverse geological benchmarks (LROC-Lineament, LineaMapper, and GeoCrack) demonstrate that Fluxamba establishes a new state-of-the-art. Notably, on the challenging LROC-Lineament dataset, it achieves an F1-score of 89.22% and mIoU of 89.87%. Achieving a real-time inference speed of over 24 FPS with only 3.4M parameters and 6.3G FLOPs, Fluxamba reduces computational costs by up to two orders of magnitude compared to heavy-weight baselines, thereby establishing a new Pareto frontier between segmentation fidelity and onboard deployment feasibility.[149] Dynamic Meta-Ensemble Framework for Efficient and Accurate Deep Learning in Plant Leaf Disease Detection on Resource-Constrained Edge Devices
Weloday Fikadu Moges,Jianmei Su,Amin Waqas
Main category: cs.CV
TL;DR: 本文提出了一种动态元集成框架(DMEF),通过自适应加权融合三个轻量级CNN模型,在资源受限的边缘设备上实现高精度植物病害检测。
Details
Motivation: 深度学习模型在边缘设备(如IoT传感器、智能手机)上部署受限于计算资源和能耗,亟需兼顾精度与效率的解决方案。 Method: 设计动态元集成框架(DMEF),采用自适应权重机制,依据准确率提升(DeltaAcc)与模型大小之间的权衡,动态融合MobileNetV2、NASNetMobile和InceptionV3的预测结果;训练中迭代更新权重,偏好高性能低复杂度模型。 Result: 在马铃薯和玉米病害数据集上分别达到99.53%和96.61%的分类准确率,优于单模型及静态集成;推理延迟<75ms,参数量<1M。 Conclusion: DMEF有效平衡了精度与边缘计算约束,为可扩展的农田作物病害智能监测提供了可行方案,弥合了高精度AI与实地应用之间的鸿沟。 Abstract: Deploying deep learning models for plant disease detection on edge devices such as IoT sensors, smartphones, and embedded systems is severely constrained by limited computational resources and energy budgets. To address this challenge, we introduce a novel Dynamic Meta-Ensemble Framework (DMEF) for high-accuracy plant disease diagnosis under resource constraints. DMEF employs an adaptive weighting mechanism that dynamically combines the predictions of three lightweight convolutional neural networks (MobileNetV2, NASNetMobile, and InceptionV3) by optimizing a trade-off between accuracy improvements (DeltaAcc) and computational efficiency (model size). During training, the ensemble weights are updated iteratively, favoring models exhibiting high performance and low complexity. Extensive experiments on benchmark datasets for potato and maize diseases demonstrate state-of-the-art classification accuracies of 99.53% and 96.61%, respectively, surpassing standalone models and static ensembles by 2.1% and 6.3%. With computationally efficient inference latency (<75ms) and a compact footprint (<1 million parameters), DMEF shows strong potential for edge-based agricultural monitoring, suggesting viability for scalable crop disease management. This bridges the gap between high-accuracy AI and practical field applications.[150] ClinNet: Evidential Ordinal Regression with Bilateral Asymmetry and Prototype Memory for Knee Osteoarthritis Grading
Xiaoyang Li,Runni Zhou
Main category: cs.CV
TL;DR: 本文提出ClinNet,一种用于膝骨关节炎(KOA)X光分级的可信框架,将问题建模为证据性序数回归,结合双侧不对称编码器、诊断记忆库和基于NIG分布的序数头部,在准确率与不确定性估计上均优于现有方法。
Details
Motivation: KOA放射学分级面临细微级间差异、专家标注不确定性及疾病进展固有序数性等挑战,传统深度学习方法将其视为确定性多分类,忽略了退变的连续性和标注不确定性。 Method: 提出ClinNet框架,包含:(1) 双侧不对称编码器(BAE),显式建模内外侧结构差异;(2) 诊断记忆库,维护类原型以稳定特征表示;(3) 基于正态-逆伽马(NIG)分布的证据性序数头,联合估计连续KL分级与认知不确定性。 Result: ClinNet在Quadratic Weighted Kappa达0.892,准确率达0.768,显著优于SOTA基线(p < 0.001);其不确定性估计能有效识别分布外样本和潜在误诊。 Conclusion: ClinNet通过引入证据性序数回归与多模块协同设计,提升了KOA分级的准确性与可信度,为临床安全部署提供了新路径。 Abstract: Knee osteoarthritis (KOA) grading based on radiographic images is a critical yet challenging task due to subtle inter-grade differences, annotation uncertainty, and the inherently ordinal nature of disease progression. Conventional deep learning approaches typically formulate this problem as deterministic multi-class classification, ignoring both the continuous progression of degeneration and the uncertainty in expert annotations. In this work, we propose ClinNet, a novel trustworthy framework that addresses KOA grading as an evidential ordinal regression problem. The proposed method integrates three key components: (1) a Bilateral Asymmetry Encoder (BAE) that explicitly models medial-lateral structural discrepancies; (2) a Diagnostic Memory Bank that maintains class-wise prototypes to stabilize feature representations; and (3) an Evidential Ordinal Head based on the Normal-Inverse-Gamma (NIG) distribution to jointly estimate continuous KL grades and epistemic uncertainty. Extensive experiments demonstrate that ClinNet achieves a Quadratic Weighted Kappa of 0.892 and Accuracy of 0.768, statistically outperforming state-of-the-art baselines (p < 0.001). Crucially, we demonstrate that the model's uncertainty estimates successfully flag out-of-distribution samples and potential misdiagnoses, paving the way for safe clinical deployment.[151] SkyReels-V3 Technique Report
Debang Li,Zhengcong Fei,Tuanhui Li,Yikun Dou,Zheng Chen,Jiangping Yang,Mingyuan Fan,Jingtao Xu,Jiahua Wang,Baoxuan Gu,Mingshan Chang,Yuqiang Xie,Binjie Mao,Youqiang Zhang,Nuo Pang,Hao Zhang,Yuzhe Jin,Zhiheng Xu,Dixuan Lin,Guibin Chen,Yahui Zhou
Main category: cs.CV
TL;DR: SkyReels-V3 是一个基于扩散Transformer的统一多模态上下文学习框架的条件视频生成模型,支持图像到视频、视频扩展和音频驱动视频生成三种范式,并在视觉质量、指令遵循等方面达到SOTA或接近SOTA水平。
Details
Motivation: 视频生成是构建世界模型的核心能力,而多模态上下文推理是衡量其能力的关键标准;现有方法在主体一致性、时序连贯性、叙事一致性和音画同步等方面仍存在挑战。 Method: 提出SkyReels-V3模型,基于扩散Transformer与统一多模态上下文学习框架;设计跨帧配对、图像编辑与语义重写的数据处理流程以抑制复制粘贴伪影;采用图像-视频混合训练与多分辨率联合优化;在视频扩展中引入时空一致性建模与大规模视频理解;在语音驱动生成中引入首尾帧插入与关键帧推理重构。 Result: 在视觉质量、指令遵循及特定维度指标上达到SOTA或近SOTA,性能接近领先闭源系统。 Conclusion: SkyReels-V3通过统一架构支持多种视频生成范式,在保真度、一致性、可控性与音画同步方面取得显著提升,验证了多模态上下文学习在视频生成中的有效性。 Abstract: Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.[152] SymbolSight: Minimizing Inter-Symbol Interference for Reading with Prosthetic Vision
Jasmine Lesner,Michael Beyeler
Main category: cs.CV
TL;DR: 本文提出SymbolSight框架,通过优化视觉符号与字母的映射关系,减少视网膜假体中因时间干扰导致的字符识别错误,在阿拉伯语、保加利亚语和英语模拟中将预测混淆率降低22倍。
Details
Motivation: 视网膜假体空间分辨率低、存在时间残留,顺序呈现字母时前一符号残像会干扰后一符号感知,导致系统性识别错误;本文旨在不依赖硬件升级,而通过优化符号设计缓解该问题。 Method: 提出SymbolSight计算框架:利用模拟假体视觉(SPV)与神经代理观察者估计符号两两混淆度,并结合语言特异性双字母频率统计优化符号-字母映射。 Result: 在阿拉伯语、保加利亚语和英语的模拟中,所生成的异质符号集相较原生字母表,将预测混淆率中位数降低22倍。 Conclusion: 标准字体不适用于串行、低带宽的假体视觉;计算建模可高效缩小视觉编码设计空间,为后续心理物理与临床评估提供高潜力候选方案。 Abstract: Retinal prostheses restore limited visual perception, but low spatial resolution and temporal persistence make reading difficult. In sequential letter presentation, the afterimage of one symbol can interfere with perception of the next, leading to systematic recognition errors. Rather than relying on future hardware improvements, we investigate whether optimizing the visual symbols themselves can mitigate this temporal interference. We present SymbolSight, a computational framework that selects symbol-to-letter mappings to minimize confusion among frequently adjacent letters. Using simulated prosthetic vision (SPV) and a neural proxy observer, we estimate pairwise symbol confusability and optimize assignments using language-specific bigram statistics. Across simulations in Arabic, Bulgarian, and English, the resulting heterogeneous symbol sets reduced predicted confusion by a median factor of 22 relative to native alphabets. These results suggest that standard typography is poorly matched to serial, low-bandwidth prosthetic vision and demonstrate how computational modeling can efficiently narrow the design space of visual encodings to generate high-potential candidates for future psychophysical and clinical evaluation.[153] Learning with Geometric Priors in U-Net Variants for Polyp Segmentation
Fabian Vazquez,Jose A. Nuñez,Diego Adame,Alissen Moreno,Augustin Zhan,Huimin Li,Jinghao Yang,Haoteng Tang,Bin Fu,Pengfei Gu
Main category: cs.CV
TL;DR: 本文提出了一种几何先验引导模块(GPM),通过融合深度图和注意力机制,增强U-Net类模型在结肠镜图像中对低对比度和复杂背景下的息肉分割性能。
Details
Motivation: 现有基于CNN、Transformer和Mamba的U-Net变体在息肉分割中仍难以有效捕捉几何与结构线索,尤其在低对比度或杂乱场景下表现受限。 Method: 提出几何先验引导模块(GPM):首先微调视觉几何基础Transformer(VGGT)在模拟ColonDepth数据集上生成息肉深度图;再将深度图编码为几何先验注入U-Net编码器特征,并结合空间与通道注意力机制进行精细化增强;GPM具有即插即用特性,兼容多种U-Net架构。 Result: 在五个公开息肉分割数据集上的大量实验表明,GPM在三个强基线模型上均带来一致性能提升。 Conclusion: GPM通过显式引入几何先验并融合多维注意力机制,显著提升了息肉分割的准确性与鲁棒性,为结直肠癌早期筛查提供了更可靠的计算机辅助诊断工具。 Abstract: Accurate and robust polyp segmentation is essential for early colorectal cancer detection and for computer-aided diagnosis. While convolutional neural network-, Transformer-, and Mamba-based U-Net variants have achieved strong performance, they still struggle to capture geometric and structural cues, especially in low-contrast or cluttered colonoscopy scenes. To address this challenge, we propose a novel Geometric Prior-guided Module (GPM) that injects explicit geometric priors into U-Net-based architectures for polyp segmentation. Specifically, we fine-tune the Visual Geometry Grounded Transformer (VGGT) on a simulated ColonDepth dataset to estimate depth maps of polyp images tailored to the endoscopic domain. These depth maps are then processed by GPM to encode geometric priors into the encoder's feature maps, where they are further refined using spatial and channel attention mechanisms that emphasize both local spatial and global channel information. GPM is plug-and-play and can be seamlessly integrated into diverse U-Net variants. Extensive experiments on five public polyp segmentation datasets demonstrate consistent gains over three strong baselines. Code and the generated depth maps are available at: https://github.com/fvazqu/GPM-PolypSeg[154] AGE-Net: Spectral--Spatial Fusion and Anatomical Graph Reasoning with Evidential Ordinal Regression for Knee Osteoarthritis Grading
Xiaoyang Li,Runni Zhou
Main category: cs.CV
TL;DR: 本文提出AGE-Net,一种基于ConvNeXt的Kellgren-Lawrence(KL)分级自动评估框架,融合频谱-空间融合(SSF)、解剖图推理(AGR)和差异细化(DFR),并采用Normal-Inverse-Gamma(NIG)证据回归头与成对序数排序约束以建模预测不确定性与等级序数关系,在膝关节X光片KL分级任务中达到SOTA性能。
Details
Motivation: 自动化Kellgren-Lawrence(KL)分级面临结构变化细微、长程解剖依赖性强及等级边界模糊等挑战。 Method: 提出AGE-Net:基于ConvNeXt架构,集成频谱-空间融合(SSF)、解剖图推理(AGR)和差异细化(DFR)模块;采用Normal-Inverse-Gamma(NIG)证据回归头建模预测不确定性,并引入成对序数排序约束保持等级序数关系。 Result: 在膝关节KL数据集上,AGE-Net取得QWK 0.9017 ± 0.0045 和 MSE 0.2349 ± 0.0028,显著优于强CNN基线,并在消融实验中表现稳定;同时验证了不确定性质量、鲁棒性与可解释性。 Conclusion: AGE-Net通过多尺度特征融合、解剖先验建模与证据化序数回归,有效提升了KL分级的准确性与可靠性,为临床辅助诊断提供了新范式。 Abstract: Automated Kellgren--Lawrence (KL) grading from knee radiographs is challenging due to subtle structural changes, long-range anatomical dependencies, and ambiguity near grade boundaries. We propose AGE-Net, a ConvNeXt-based framework that integrates Spectral--Spatial Fusion (SSF), Anatomical Graph Reasoning (AGR), and Differential Refinement (DFR). To capture predictive uncertainty and preserve label ordinality, AGE-Net employs a Normal-Inverse-Gamma (NIG) evidential regression head and a pairwise ordinal ranking constraint. On a knee KL dataset, AGE-Net achieves a quadratic weighted kappa (QWK) of 0.9017 +/- 0.0045 and a mean squared error (MSE) of 0.2349 +/- 0.0028 over three random seeds, outperforming strong CNN baselines and showing consistent gains in ablation studies. We further outline evaluations of uncertainty quality, robustness, and explainability, with additional experimental figures to be included in the full manuscript.[155] TEXTS-Diff: TEXTS-Aware Diffusion Model for Real-World Text Image Super-Resolution
Haodong He,Xin Zhan,Yancheng Bai,Rui Lan,Lei Sun,Xiangxiang Chu
Main category: cs.CV
TL;DR: 本文提出Real-Texts数据集和TEXTS-Diff模型,解决真实场景文本图像超分中文字区域恢复差、背景重建受限的问题,显著提升文本可读性与整体视觉质量。
Details
Motivation: 现有文本图像超分方法受限于文本数据稀缺、孤立文本样本导致背景重建质量差,难以应对真实场景中多样的退化与文本畸变。 Method: 构建覆盖中英文、多样化场景的Real-Texts真实文本图像数据集;提出TEXTS-Aware Diffusion Model(TEXTS-Diff),融合抽象文本概念理解与具体文本区域建模,协同优化背景与文字区域生成。 Result: 在多个指标上达到SOTA性能,显著提升复杂场景下的文本恢复精度与泛化能力,有效缓解文字畸变与幻觉伪影。 Conclusion: Real-Texts数据集与TEXTS-Diff模型为真实文本图像超分提供了新基准与有效解决方案,代码、模型与数据将开源。 Abstract: Real-world text image super-resolution aims to restore overall visual quality and text legibility in images suffering from diverse degradations and text distortions. However, the scarcity of text image data in existing datasets results in poor performance on text regions. In addition, datasets consisting of isolated text samples limit the quality of background reconstruction. To address these limitations, we construct Real-Texts, a large-scale, high-quality dataset collected from real-world images, which covers diverse scenarios and contains natural text instances in both Chinese and English. Additionally, we propose the TEXTS-Aware Diffusion Model (TEXTS-Diff) to achieve high-quality generation in both background and textual regions. This approach leverages abstract concepts to improve the understanding of textual elements within visual scenes and concrete text regions to enhance textual details. It mitigates distortions and hallucination artifacts commonly observed in text regions, while preserving high-quality visual scene fidelity. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple evaluation metrics, exhibiting superior generalization ability and text restoration accuracy in complex scenarios. All the code, model, and dataset will be released.[156] STARS: Shared-specific Translation and Alignment for missing-modality Remote Sensing Semantic Segmentation
Tong Wang,Xiaodong Zhang,Guanzhou Chen,Jiaqi Wang,Chenxi Liu,Xiaoliang Tan,Wenchao Guo,Xuyang Li,Xuanrui Wang,Zifan Wang
Main category: cs.CV
TL;DR: 本文提出STARS框架,用于解决遥感多模态数据中模态缺失导致的语义分割性能下降问题,通过非对称对齐机制和像素级语义采样对齐策略,有效缓解特征坍缩与类别不平衡问题。
Details
Motivation: 实际应用中遥感多模态数据(如光学、SAR、DSM)常存在模态缺失,导致传统融合模型性能下降;现有补全方法存在特征坍缩和恢复特征泛化过度等问题。 Method: 提出STARS框架,包含两个核心设计:1)基于双向翻译与stop-gradient的非对称对齐机制,防止特征坍缩并降低超参敏感性;2)像素级语义采样对齐(PSA)策略,结合类别平衡采样与跨模态语义对齐损失,提升少数类识别能力。 Result: STARS在模态缺失场景下显著提升遥感语义分割鲁棒性与精度,尤其改善了 minority-class 的识别效果,并降低了对超参数调优的依赖。 Conclusion: STARS为不完整多模态遥感数据提供了高效、鲁棒的语义分割解决方案,推动了缺失模态下跨模态对齐与特征解耦技术的发展。 Abstract: Multimodal remote sensing technology significantly enhances the understanding of surface semantics by integrating heterogeneous data such as optical images, Synthetic Aperture Radar (SAR), and Digital Surface Models (DSM). However, in practical applications, the missing of modality data (e.g., optical or DSM) is a common and severe challenge, which leads to performance decline in traditional multimodal fusion models. Existing methods for addressing missing modalities still face limitations, including feature collapse and overly generalized recovered features. To address these issues, we propose \textbf{STARS} (\textbf{S}hared-specific \textbf{T}ranslation and \textbf{A}lignment for missing-modality \textbf{R}emote \textbf{S}ensing), a robust semantic segmentation framework for incomplete multimodal inputs. STARS is built on two key designs. First, we introduce an asymmetric alignment mechanism with bidirectional translation and stop-gradient, which effectively prevents feature collapse and reduces sensitivity to hyperparameters. Second, we propose a Pixel-level Semantic sampling Alignment (PSA) strategy that combines class-balanced pixel sampling with cross-modality semantic alignment loss, to mitigate alignment failures caused by severe class imbalance and improve minority-class recognition.[157] Revisiting Lightweight Low-Light Image Enhancement: From a YUV Color Space Perspective
Hailong Yan,Shice Liu,Xiangtao Zhang,Lujian Yao,Fengxiang Yang,Jinwei Chen,Bo Li
Main category: cs.CV
TL;DR: 本文提出了一种基于YUV颜色空间的轻量级低光照图像增强新范式,通过频域分析发现Y通道缺失低频信息、UV通道受高频噪声干扰,并据此设计了双流注意力与引导交互模块,在保持模型紧凑的同时显著提升视觉质量。
Details
Motivation: 现有轻量级低光照图像增强方法忽视了通道特异性退化模式及通道间交互,导致性能受限。 Method: 进行YUV颜色空间的频域分析,提出包含Dual-Stream Global-Local Attention(Y通道)、Y-guided Local-Aware Frequency Attention(UV通道)和Guided Interaction(特征融合)的YUV新范式。 Result: 在多个基准上达到新的SOTA,以更少参数量实现更优视觉质量。 Conclusion: YUV颜色空间更适合轻量级低光照增强;解耦建模通道特异性退化并加强跨通道引导交互,可有效平衡模型轻量化与增强效果。 Abstract: In the current era of mobile internet, Lightweight Low-Light Image Enhancement (L3IE) is critical for mobile devices, which faces a persistent trade-off between visual quality and model compactness. While recent methods employ disentangling strategies to simplify lightweight architectural design, such as Retinex theory and YUV color space transformations, their performance is fundamentally limited by overlooking channel-specific degradation patterns and cross-channel interactions. To address this gap, we perform a frequency-domain analysis that confirms the superiority of the YUV color space for L3IE. We identify a key insight: the Y channel primarily loses low-frequency content, while the UV channels are corrupted by high-frequency noise. Leveraging this finding, we propose a novel YUV-based paradigm that strategically restores channels using a Dual-Stream Global-Local Attention module for the Y channel, a Y-guided Local-Aware Frequency Attention module for the UV channels, and a Guided Interaction module for final feature fusion. Extensive experiments validate that our model establishes a new state-of-the-art on multiple benchmarks, delivering superior visual quality with a significantly lower parameter count.[158] NeRF-MIR: Towards High-Quality Restoration of Masked Images with Neural Radiance Fields
Xianliang Huang,Zhizhou Zhong,Shuhang Chen,Yi Xu,Juhong Guan,Shuigeng Zhou
Main category: cs.CV
TL;DR: 本文提出了NeRF-MIR,一种专为修复遮挡图像设计的新型神经渲染方法,通过PERE策略优化光线发射、PIRE机制实现渐进式自训练修复,并设计动态加权损失函数,显著提升了NeRF在遮挡图像恢复任务中的性能。
Details
Motivation: 现有NeRF方法在处理自然场景中常见的受损(如遮挡)图像时效果不佳,亟需提升其在 corrupted images 上的3D场景重建能力。 Method: 提出NeRF-MIR框架,包含:1)基于Patch的熵驱动光线发射策略(PERE),优化多视角信息融合;2)渐进式迭代修复机制(PIRE),实现自训练式遮挡区域恢复;3)动态加权损失函数,自动调整遮挡区域损失权重;并构建了三个模拟遮挡场景的新数据集。 Result: 在真实数据和自建遮挡数据集上的大量实验表明,NeRF-MIR在遮挡图像修复任务上显著优于现有方法。 Conclusion: NeRF-MIR有效拓展了NeRF在图像修复领域的应用,验证了其在复杂退化场景下的鲁棒性与潜力,为NeRF面向实际视觉修复任务提供了新思路。 Abstract: Neural Radiance Fields (NeRF) have demonstrated remarkable performance in novel view synthesis. However, there is much improvement room on restoring 3D scenes based on NeRF from corrupted images, which are common in natural scene captures and can significantly impact the effectiveness of NeRF. This paper introduces NeRF-MIR, a novel neural rendering approach specifically proposed for the restoration of masked images, demonstrating the potential of NeRF in this domain. Recognizing that randomly emitting rays to pixels in NeRF may not effectively learn intricate image textures, we propose a \textbf{P}atch-based \textbf{E}ntropy for \textbf{R}ay \textbf{E}mitting (\textbf{PERE}) strategy to distribute emitted rays properly. This enables NeRF-MIR to fuse comprehensive information from images of different views. Additionally, we introduce a \textbf{P}rogressively \textbf{I}terative \textbf{RE}storation (\textbf{PIRE}) mechanism to restore the masked regions in a self-training process. Furthermore, we design a dynamically-weighted loss function that automatically recalibrates the loss weights for masked regions. As existing datasets do not support NeRF-based masked image restoration, we construct three masked datasets to simulate corrupted scenarios. Extensive experiments on real data and constructed datasets demonstrate the superiority of NeRF-MIR over its counterparts in masked image restoration.[159] HyDeMiC: A Deep Learning-based Mineral Classifier using Hyperspectral Data
M. L. Mamud,Piyoosh Jaysaval,Frederick D Day-Lewis,M. K. Mudunuru
Main category: cs.CV
TL;DR: 本文提出了一种基于卷积神经网络的高光谱矿物分类模型HyDeMiC,利用USGS矿物光谱库数据进行训练,在不同噪声水平下表现出优异且稳健的分类性能,尤其适用于真实野外噪声环境下的矿物识别。
Details
Motivation: 传统高光谱矿物分类方法(如判别分析、逻辑回归、支持向量机)在应对环境噪声、传感器限制及高维数据计算复杂性方面存在不足,亟需更鲁棒的深度学习方案。 Method: 构建名为HyDeMiC的CNN模型;使用USGS光谱库中115种矿物的实验室测量光谱,通过卷积传感器响应函数生成带噪声的合成训练数据;以三种含铜矿物(赤铜矿、孔雀石、黄铜矿)为案例;在1%–10%噪声水平的2D合成高光谱数据集上评估模型性能,采用Matthews相关系数(MCC)作为评价指标。 Result: HyDeMiC在无噪和低噪数据上达到近乎完美的分类精度(MCC = 1.00),在中等噪声(如5%)下仍保持强鲁棒性;验证了其在真实噪声场景下的实用潜力。 Conclusion: HyDeMiC是一种高效、鲁棒的高光谱矿物分类深度学习模型,显著优于传统方法,具备面向实际矿物勘探应用的可行性。 Abstract: Hyperspectral imaging (HSI) has emerged as a powerful remote sensing tool for mineral exploration, capitalizing on unique spectral signatures of minerals. However, traditional classification methods such as discriminant analysis, logistic regression, and support vector machines often struggle with environmental noise in data, sensor limitations, and the computational complexity of analyzing high-dimensional HSI data. This study presents HyDeMiC (Hyperspectral Deep Learning-based Mineral Classifier), a convolutional neural network (CNN) model designed for robust mineral classification under noisy data. To train HyDeMiC, laboratory-measured hyperspectral data for 115 minerals spanning various mineral groups were used from the United States Geological Survey (USGS) library. The training dataset was generated by convolving reference mineral spectra with an HSI sensor response function. These datasets contained three copper-bearing minerals, Cuprite, Malachite, and Chalcopyrite, used as case studies for performance demonstration. The trained CNN model was evaluated on several synthetic 2D hyperspectral datasets with noise levels of 1%, 2%, 5%, and 10%. Our noisy data analysis aims to replicate realistic field conditions. The HyDeMiC's performance was assessed using the Matthews Correlation Coefficient (MCC), providing a comprehensive measure across different noise regimes. Results demonstrate that HyDeMiC achieved near-perfect classification accuracy (MCC = 1.00) on clean and low-noise datasets and maintained strong performance under moderate noise conditions. These findings emphasize HyDeMiC's robustness in the presence of moderate noise, highlighting its potential for real-world applications in hyperspectral imaging, where noise is often a significant challenge.[160] PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling
Wenzhi Guo,Guangchi Fang,Shu Yang,Bing Wang
Main category: cs.CV
TL;DR: 本文提出了PocketGS,一种专为移动设备设计的3D高斯泼溅(3DGS)建模方法,在极短训练时间和有限内存下实现高质量、实时的3D场景重建。
Details
Motivation: 现有3DGS方法依赖高资源训练假设,无法适配移动设备的严苛限制(如分钟级训练预算和峰值内存受限)。 Method: 提出三个协同设计的操作符:G(构建几何保真点云先验)、I(注入局部表面统计以初始化各向异性高斯分布)、T(缓存中间结果并索引映射梯度散射以稳定移动端反向传播)。 Result: PocketGS在移动设备上实现了优于主流工作站级3DGS基线的高质量重建效果,支持端到端的捕获-渲染流程。 Conclusion: PocketGS成功解决了移动端3DGS训练中效率、内存与保真度之间的根本矛盾,推动了轻量、高效、高保真3D建模的实用化落地。 Abstract: Efficient and high-fidelity 3D scene modeling is a long-standing pursuit in computer graphics. While recent 3D Gaussian Splatting (3DGS) methods achieve impressive real-time modeling performance, they rely on resource-unconstrained training assumptions that fail on mobile devices, which are limited by minute-scale training budgets and hardware-available peak-memory. We present PocketGS, a mobile scene modeling paradigm that enables on-device 3DGS training under these tightly coupled constraints while preserving high perceptual fidelity. Our method resolves the fundamental contradictions of standard 3DGS through three co-designed operators: G builds geometry-faithful point-cloud priors; I injects local surface statistics to seed anisotropic Gaussians, thereby reducing early conditioning gaps; and T unrolls alpha compositing with cached intermediates and index-mapped gradient scattering for stable mobile backpropagation. Collectively, these operators satisfy the competing requirements of training efficiency, memory compactness, and modeling fidelity. Extensive experiments demonstrate that PocketGS is able to outperform the powerful mainstream workstation 3DGS baseline to deliver high-quality reconstructions, enabling a fully on-device, practical capture-to-rendering workflow.[161] UCAD: Uncertainty-guided Contour-aware Displacement for semi-supervised medical image segmentation
Chengbo Ding,Fenghe Tang,Shaohua Kevin Zhou
Main category: cs.CV
TL;DR: 本文提出UCAD框架,通过超像素生成解剖一致区域和不确定性引导的位移策略,提升半监督医学图像分割的边界保持与语义一致性。
Details
Motivation: 现有半监督分割中的位移策略仅作用于矩形区域,忽略解剖结构,导致边界失真和语义不一致。 Method: 提出不确定性引导的轮廓感知位移框架UCAD:利用超像素生成解剖边界对齐区域,并设计不确定性引导的选择机制进行区域位移;引入动态不确定性加权一致性损失以稳定训练并正则化无标签区域。 Result: 在有限标注下,UCAD在多个数据集上持续超越当前最优半监督分割方法,显著提升分割精度。 Conclusion: UCAD通过解剖结构感知与不确定性建模,有效提升了半监督医学图像分割的边界保真度与语义一致性,为临床低标注场景提供了实用解决方案。 Abstract: Existing displacement strategies in semi-supervised segmentation only operate on rectangular regions, ignoring anatomical structures and resulting in boundary distortions and semantic inconsistency. To address these issues, we propose UCAD, an Uncertainty-Guided Contour-Aware Displacement framework for semi-supervised medical image segmentation that preserves contour-aware semantics while enhancing consistency learning. Our UCAD leverages superpixels to generate anatomically coherent regions aligned with anatomy boundaries, and an uncertainty-guided selection mechanism to selectively displace challenging regions for better consistency learning. We further propose a dynamic uncertainty-weighted consistency loss, which adaptively stabilizes training and effectively regularizes the model on unlabeled regions. Extensive experiments demonstrate that UCAD consistently outperforms state-of-the-art semi-supervised segmentation methods, achieving superior segmentation accuracy under limited annotation. The code is available at:https://github.com/dcb937/UCAD.[162] Physical Prompt Injection Attacks on Large Vision-Language Models
Chen Ling,Kai Hu,Hangcheng Liu,Xingshuo Han,Tianwei Zhang,Changhai Ou
Main category: cs.CV
TL;DR: 本文提出了一种名为物理提示注入攻击(PPIA)的新型黑盒、查询无关攻击方法,通过在物理对象上嵌入恶意文字指令来操控大型视觉语言模型(LVLMs),无需访问模型或其输入,在多种真实场景下实现了高达98%的攻击成功率。
Details
Motivation: 现有针对LVLMs的提示注入攻击大多需要访问输入通道或依赖用户查询信息,这在实际部署中往往不可行;因此亟需一种更实用、更隐蔽的攻击范式。 Method: 提出物理提示注入攻击(PPIA),采用离线筛选高识别性与语义有效性的视觉提示,并结合基于时空注意力的环境感知布放策略,在物理对象上嵌入恶意文字指令,仅通过视觉观测即可影响LVLM行为。 Result: 在10个主流LVLM上跨模拟与真实环境测试,涵盖视觉问答、规划与导航任务,攻击成功率最高达98%,且对距离、视角、光照等物理变化具有强鲁棒性。 Conclusion: PPIA是首个真正黑盒、查询无关的物理域提示注入攻击,揭示了LVLM在开放物理环境中面临的新安全威胁,为后续防御机制设计提供了重要警示和基准。 Abstract: Large Vision-Language Models (LVLMs) are increasingly deployed in real-world intelligent systems for perception and reasoning in open physical environments. While LVLMs are known to be vulnerable to prompt injection attacks, existing methods either require access to input channels or depend on knowledge of user queries, assumptions that rarely hold in practical deployments. We propose the first Physical Prompt Injection Attack (PPIA), a black-box, query-agnostic attack that embeds malicious typographic instructions into physical objects perceivable by the LVLM. PPIA requires no access to the model, its inputs, or internal pipeline, and operates solely through visual observation. It combines offline selection of highly recognizable and semantically effective visual prompts with strategic environment-aware placement guided by spatiotemporal attention, ensuring that the injected prompts are both perceivable and influential on model behavior. We evaluate PPIA across 10 state-of-the-art LVLMs in both simulated and real-world settings on tasks including visual question answering, planning, and navigation, PPIA achieves attack success rates up to 98%, with strong robustness under varying physical conditions such as distance, viewpoint, and illumination. Our code is publicly available at https://github.com/2023cghacker/Physical-Prompt-Injection-Attack.[163] ONRW: Optimizing inversion noise for high-quality and robust watermark
Xuan Ding,Xiu Yan,Chuanlong Xie,Yao Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于扩散模型的高质量、鲁棒水印框架,通过空文本优化获得反转噪声,并在潜在空间中优化该噪声,再经扩散模型迭代去噪生成水印图像;引入自注意力约束和伪掩码策略保持语义一致性,实验表明其在多种图像失真下显著优于现有方法。
Details
Motivation: 现有深度学习水印方法虽能保持图像质量,但在传输中遭遇图像损坏时鲁棒性不足,限制了实际应用价值。 Method: 基于扩散模型构建水印框架:先通过空文本优化将原始图像转换为反转噪声;在潜在空间中优化该噪声(引入自注意力约束和伪掩码策略防止语义失真);再通过扩散模型迭代去噪生成高质量、鲁棒的水印图像。 Result: 在COCO数据集上对12种图像变换的实验显示,本方法平均比Stable Signature方法提升10%鲁棒性;代码已开源。 Conclusion: 所提基于扩散模型的水印框架兼顾高视觉质量与强鲁棒性,自注意力约束与伪掩码有效保障语义一致性,为实用化图像水印提供了新思路。 Abstract: Watermarking methods have always been effective means of protecting intellectual property, yet they face significant challenges. Although existing deep learning-based watermarking systems can hide watermarks in images with minimal impact on image quality, they often lack robustness when encountering image corruptions during transmission, which undermines their practical application value. To this end, we propose a high-quality and robust watermark framework based on the diffusion model. Our method first converts the clean image into inversion noise through a null-text optimization process, and after optimizing the inversion noise in the latent space, it produces a high-quality watermarked image through an iterative denoising process of the diffusion model. The iterative denoising process serves as a powerful purification mechanism, ensuring both the visual quality of the watermarked image and enhancing the robustness of the watermark against various corruptions. To prevent the optimizing of inversion noise from distorting the original semantics of the image, we specifically introduced self-attention constraints and pseudo-mask strategies. Extensive experimental results demonstrate the superior performance of our method against various image corruptions. In particular, our method outperforms the stable signature method by an average of 10\% across 12 different image transformations on COCO datasets. Our codes are available at https://github.com/920927/ONRW.[164] SMV-EAR: Bring Spatiotemporal Multi-View Representation Learning into Efficient Event-Based Action Recognition
Rui Fan,Weidong Hao
Main category: cs.CV
TL;DR: 本文提出了一种面向事件相机动作识别(EAR)的新型时空多视角表示学习框架,通过平移不变的密集事件转换、双分支动态融合架构和生物启发式时间扭曲增强,在多个数据集上显著提升准确率并降低参数量与计算量。
Details
Motivation: 现有基于事件的对象识别(EOR)方法在动作识别任务中受限于平移可变的空间分箱表示和简单的早期拼接融合结构,难以有效建模动作的时间动态特性。 Method: (i)设计平移不变的稠密事件转换以生成时空多视角表示;(ii)构建双分支动态融合架构,建模不同视角运动特征的样本级互补性;(iii)引入生物启发的时间扭曲增强策略模拟真实人类动作的速度变化。 Result: 在HARDVS、DailyDVS-200和THU-EACT-50-CHL三个数据集上,Top-1准确率分别提升+7.0%、+10.7%和+10.2%,同时参数量减少30.1%,计算量降低35.7%。 Conclusion: 所提框架为事件相机动作识别提供了新颖、高效且强泛化能力的新范式。 Abstract: Event cameras action recognition (EAR) offers compelling privacy-protecting and efficiency advantages, where temporal motion dynamics is of great importance. Existing spatiotemporal multi-view representation learning (SMVRL) methods for event-based object recognition (EOR) offer promising solutions by projecting H-W-T events along spatial axis H and W, yet are limited by its translation-variant spatial binning representation and naive early concatenation fusion architecture. This paper reexamines the key SMVRL design stages for EAR and propose: (i) a principled spatiotemporal multi-view representation through translation-invariant dense conversion of sparse events, (ii) a dual-branch, dynamic fusion architecture that models sample-wise complementarity between motion features from different views, and (iii) a bio-inspired temporal warping augmentation that mimics speed variability of real-world human actions. On three challenging EAR datasets of HARDVS, DailyDVS-200 and THU-EACT-50-CHL, we show +7.0%, +10.7%, and +10.2% Top-1 accuracy gains over existing SMVRL EOR method with surprising 30.1% reduced parameters and 35.7% lower computations, establishing our framework as a novel and powerful EAR paradigm.[165] ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs
Rui Fang,Jian Li,Wei Chen,Bin Hu,Ying-Cong Chen,Xin Tang,Liang Diao
Main category: cs.CV
TL;DR: 本文提出ReLE系统,用于诊断大语言模型在不同领域能力上的各向异性,通过符号-基础混合评分机制和动态方差感知调度器,实现高效、鲁棒的实时评估。
Details
Motivation: 现有中文大模型评测面临基准饱和与计算成本高昂问题,静态排行榜掩盖了模型在不同能力维度上的结构性权衡。 Method: 提出ReLE系统,包含符号-基础混合评分机制(消除推理任务中嵌入式误判)和基于Neyman分配与噪声校正的动态方差感知调度器,支持对304个模型在Domain×Capability正交矩阵上的大规模评估。 Result: 评估发现模型排名对权重方案高度敏感,Rank Stability Amplitude达11.4,远高于传统基准的~5.0,证实当前模型高度专业化而非通用优越;计算成本降低70%,排名相关性保持ρ=0.96。 Conclusion: ReLE并非替代静态基准,而是作为高频诊断监控工具,服务于快速演化的模型生态。 Abstract: Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $ρ=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.[166] HAAF: Hierarchical Adaptation and Alignment of Foundation Models for Few-Shot Pathology Anomaly Detection
Chunze Yang,Wenjie Zhao,Yue Tang,Junbo Lu,Jiusong Ge,Qidong Liu,Zeyu Gao,Chen Li
Main category: cs.CV
TL;DR: 本文提出HAAF框架,通过跨层级缩放对齐(CLSA)机制和双分支推理策略,解决视觉-语言模型在精准病理学中因粒度不匹配导致的细粒度异常检测难题,显著提升低资源场景下的性能。
Details
Motivation: 精准病理诊断依赖于特定感兴趣区域(ROI)内的细粒度形态异常检测,而现有视觉-语言模型因粒度不匹配及模态独立处理,难以有效捕捉此类局部纹理线索。 Method: 提出分层适配与对齐框架(HAAF),核心为跨层级缩放对齐(CLSA)机制:先用视觉特征增强文本提示生成内容自适应描述符,再以该描述符空间引导视觉编码器聚焦异常;并采用融合语义分数与几何原型的双分支推理策略。 Result: 在四个基准上显著优于现有最先进方法,并能在低资源场景下有效适配领域专用骨干网络(如CONCH)。 Conclusion: HAAF成功弥合了视觉-语言模型在精准病理诊断中的粒度鸿沟,提升了细粒度异常检测的准确性与鲁棒性,尤其适用于标注数据稀缺的现实医疗场景。 Abstract: Precision pathology relies on detecting fine-grained morphological abnormalities within specific Regions of Interest (ROIs), as these local, texture-rich cues - rather than global slide contexts - drive expert diagnostic reasoning. While Vision-Language (V-L) models promise data efficiency by leveraging semantic priors, adapting them faces a critical Granularity Mismatch, where generic representations fail to resolve such subtle defects. Current adaptation methods often treat modalities as independent streams, failing to ground semantic prompts in ROI-specific visual contexts. To bridge this gap, we propose the Hierarchical Adaptation and Alignment Framework (HAAF). At its core is a novel Cross-Level Scaled Alignment (CLSA) mechanism that enforces a sequential calibration order: visual features first inject context into text prompts to generate content-adaptive descriptors, which then spatially guide the visual encoder to spotlight anomalies. Additionally, a dual-branch inference strategy integrates semantic scores with geometric prototypes to ensure stability in few-shot settings. Experiments on four benchmarks show HAAF significantly outperforms state-of-the-art methods and effectively scales with domain-specific backbones (e.g., CONCH) in low-resource scenarios.[167] Source-Free Domain Adaptation by Optimizing Batch-Wise Cosine Similarity
Harsharaj Pathak,Vineeth N Balasubramanian
Main category: cs.CV
TL;DR: 本文提出了一种基于邻域签名的新方法,用于无源域自适应(SFDA),通过单一损失项优化目标域样本预测的相似性与差异性,提升聚类信息量并抑制噪声邻居影响,在VisDA等基准数据集上取得优异性能。
Details
Motivation: 现有SFDA方法依赖邻域一致性,但易受误导性邻域信息影响,导致错误;本文旨在从学习更具信息量的簇和缓解噪声邻居影响的角度改进该范式。 Method: 引入邻域签名概念,设计单一损失项来优化目标域样本预测结果之间的相似性和差异性,从而实现更鲁棒的域自适应。 Result: 在具有挑战性的VisDA数据集上超越现有方法,并在其他基准数据集上也取得有竞争力的结果。 Conclusion: 仅用一个针对目标域预测相似性与差异性优化的损失项,即可有效实现无源域自适应,且性能优于主流方法。 Abstract: Source-Free Domain Adaptation (SFDA) is an emerging area of research that aims to adapt a model trained on a labeled source domain to an unlabeled target domain without accessing the source data. Most of the successful methods in this area rely on the concept of neighborhood consistency but are prone to errors due to misleading neighborhood information. In this paper, we explore this approach from the point of view of learning more informative clusters and mitigating the effect of noisy neighbors using a concept called neighborhood signature, and demonstrate that adaptation can be achieved using just a single loss term tailored to optimize the similarity and dissimilarity of predictions of samples in the target domain. In particular, our proposed method outperforms existing methods in the challenging VisDA dataset while also yielding competitive results on other benchmark datasets.[168] Cloud-Enabled IoT System for Real-Time Environmental Monitoring and Remote Device Control Using Firebase
Abdul Hasib,A. S. M. Ahsanul Sarkar Akib
Main category: cs.CV
TL;DR: 本文提出了一种基于Firebase实时数据库的低成本云物联网系统,使用ESP32采集温湿度与距离数据,并远程控制LED,实现高可靠性、低延迟、易部署的环境监控与设备控制。
Details
Motivation: 传统监控系统存在实时数据访问难、远程控制弱、云集成差等问题,亟需一种低成本、易部署、高可用的云物联网解决方案。 Method: 采用ESP32微控制器连接DHT22和HC-SR04传感器,通过Firebase Realtime Database实现传感器数据上传与LED远程控制,构建端-云协同的实时同步系统。 Result: 实验表明系统数据传输成功率达99.2%,控制延迟低于1.5秒,支持多终端同步访问与历史数据持久化,总成本仅32.50美元。 Conclusion: 该系统提供了一个可扩展、低门槛的云物联网框架,适用于智能家居、工业监测等场景,验证了Firebase在资源受限环境下支撑先进IoT应用的有效性。 Abstract: The proliferation of Internet of Things (IoT) devices has created unprecedented opportunities for remote monitoring and control applications across various domains. Traditional monitoring systems often suffer from limitations in real-time data accessibility, remote controllability, and cloud integration. This paper presents a cloud-enabled IoT system that leverages Google's Firebase Realtime Database for synchronized environmental monitoring and device control. The system utilizes an ESP32 microcontroller to interface with a DHT22 temperature/humidity sensor and an HC-SR04 ultrasonic distance sensor, while enabling remote control of two LED indicators through a cloud-based interface. Real-time sensor data is transmitted to Firebase, providing a synchronized platform accessible from multiple devices simultaneously. Experimental results demonstrate reliable data transmission with 99.2\% success rate, real-time control latency under 1.5 seconds, and persistent data storage for historical analysis. The system architecture offers a scalable framework for various IoT applications, from smart home automation to industrial monitoring, with a total implementation cost of \$32.50. The integration of Firebase provides robust cloud capabilities without requiring complex server infrastructure, making advanced IoT applications accessible to developers and researchers with limited resources.[169] CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction
Shiu-hong Kao,Chak Ho Huang,Huaiqian Liu,Yu-Wing Tai,Chi-Keung Tang
Main category: cs.CV
TL;DR: 本文提出CoT-Seg,一种无需训练的推理分割框架,融合思维链推理与自我修正机制,利用预训练多模态大模型(如GPT-4o)分解查询、提取图像语义、生成并迭代优化分割掩码,显著提升复杂及域外场景下的鲁棒性与可靠性。
Details
Motivation: 现有推理分割方法在处理复杂查询和跨域图像时表现不足;受人类逐步思考、自查自纠的启发,亟需一种能分步推理、自我评估与修正的系统。 Method: 提出无训练框架CoT-Seg:1)用预训练MLLM将查询分解为元指令并提取图像细粒度语义;2)生成初始分割结果;3)基于原始查询与推理轨迹进行自我评估,识别不匹配并迭代优化掩码;4)支持检索增强以引入外部知识。 Result: 在新构建的难例数据集ReasonSeg-Hard上验证有效,显著提升复杂/模糊/易错场景下的分割准确率与鲁棒性;验证了思维链+自我修正范式对视觉-语言联合分割的强大潜力。 Conclusion: CoT-Seg证明无需微调即可通过结构化推理与闭环自修正实现高性能推理分割,为视觉语言任务提供了一种更可靠、可解释且可扩展的新范式。 Abstract: Existing works of reasoning segmentation often fall short in complex cases, particularly when addressing complicated queries and out-of-domain images. Inspired by the chain-of-thought reasoning, where harder problems require longer thinking steps/time, this paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results, in the same way humans approach harder questions. We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction. Instead of fine-tuning, CoT-Seg leverages the inherent reasoning ability of pre-trained MLLMs (GPT-4o) to decompose queries into meta-instructions, extract fine-grained semantics from images, and identify target objects even under implicit or complex prompts. Moreover, CoT-Seg incorporates a self-correction stage: the model evaluates its own segmentation against the original query and reasoning trace, identifies mismatches, and iteratively refines the mask. This tight integration of reasoning and correction significantly improves reliability and robustness, especially in ambiguous or error-prone cases. Furthermore, our CoT-Seg framework allows easy incorporation of retrieval-augmented reasoning, enabling the system to access external knowledge when the input lacks sufficient information. To showcase CoT-Seg's ability to handle very challenging cases ,we introduce a new dataset ReasonSeg-Hard. Our results highlight that combining chain-of-thought reasoning, self-correction, offers a powerful paradigm for vision-language integration driven segmentation.[170] Coronary Artery Segmentation and Vessel-Type Classification in X-Ray Angiography
Mehdi Yousefzadeh,Siavash Shirzadeh Barough,Ashkan Fakharifar,Yashar Tayyarazad,Narges Eghbali,Mohaddeseh Mozaffari,Hoda Taeb,Negar Sadat Rafiee Tabatabaee,Parsa Esfahanian,Ghazaleh Sadeghi Gohar,Amineh Safavirad,Saeideh Mazloomzadeh,Ehsan khalilipur,Armin Elahifar,Majid Maleki
Main category: cs.CV
TL;DR: 本文提出了一种结合经典血管增强滤波与深度学习的X射线冠状动脉造影(XCA)图像分割方法,通过逐帧参数预测(SVR)优化传统滤波器,并采用FPN等网络结构及冠脉+导管联合标注策略提升分割精度与跨中心泛化能力。
Details
Motivation: XCA图像中低对比度、运动伪影、重叠、导管干扰等问题导致血管分割困难,影响临床定量分析和解剖定位;现有方法在跨中心数据上泛化性差,亟需更鲁棒的分割与血管类型识别方案。 Method: 从670段XCA序列中选取最佳造影峰值帧,进行联合超分辨率与增强;对比Meijering、Frangi、Sato等经典血管滤波器在Oracle调参、全局均值设定及SVR逐帧预测三种策略下的性能;评估U-Net、FPN、Swin Transformer等神经网络在冠脉单独标注与冠脉+导管联合标注下的表现;第二阶段采用分类/分割联合方式实现LAD/LCX/RCA三支血管的身份识别;外部验证使用公开DCA1数据集,并测试轻量微调效果。 Result: SVR逐帧调参使Frangi滤波Dice达0.759(优于全局均值0.741);FPN在冠脉单独监督下Dice为0.914±0.007,联合导管标注后提升至0.931±0.006;DCA1外部测试中Dice分别降至0.798和0.814,经轻量微调后恢复至约0.882;血管类型识别准确率分别为RCA 98.5%(Dice 0.844)、LAD 95.4%(0.786)、LCX 96.2%(0.794)。 Conclusion: 学习式逐帧参数优化可显著增强传统血管分割流程;高分辨率FPN模型配合冠脉与导管联合监督能提升模型稳定性与外部泛化能力,仅需少量微调即可适应新中心数据,为临床实用化提供可行路径。 Abstract: X-ray coronary angiography (XCA) is the clinical reference standard for assessing coronary artery disease, yet quantitative analysis is limited by the difficulty of robust vessel segmentation in routine data. Low contrast, motion, foreshortening, overlap, and catheter confounding degrade segmentation and contribute to domain shift across centers. Reliable segmentation, together with vessel-type labeling, enables vessel-specific coronary analytics and downstream measurements that depend on anatomical localization. From 670 cine sequences (407 subjects), we select a best frame near peak opacification using a low-intensity histogram criterion and apply joint super-resolution and enhancement. We benchmark classical Meijering, Frangi, and Sato vesselness filters under per-image oracle tuning, a single global mean setting, and per-image parameter prediction via Support Vector Regression (SVR). Neural baselines include U-Net, FPN, and a Swin Transformer, trained with coronary-only and merged coronary+catheter supervision. A second stage assigns vessel identity (LAD, LCX, RCA). External evaluation uses the public DCA1 cohort. SVR per-image tuning improves Dice over global means for all classical filters (e.g., Frangi: 0.759 vs. 0.741). Among deep models, FPN attains 0.914+/-0.007 Dice (coronary-only), and merged coronary+catheter labels further improve to 0.931+/-0.006. On DCA1 as a strict external test, Dice drops to 0.798 (coronary-only) and 0.814 (merged), while light in-domain fine-tuning recovers to 0.881+/-0.014 and 0.882+/-0.015. Vessel-type labeling achieves 98.5% accuracy (Dice 0.844) for RCA, 95.4% (0.786) for LAD, and 96.2% (0.794) for LCX. Learned per-image tuning strengthens classical pipelines, while high-resolution FPN models and merged-label supervision improve stability and external transfer with modest adaptation.[171] ReflexSplit: Single Image Reflection Separation via Layer Fusion-Separation
Chia-Ming Lee,Yu-Fan Lin,Jing-Hui Jung,Yu-Jou Hsiao,Chih-Chung Hsu,Yu-Lun Liu
Main category: cs.CV
TL;DR: 本文提出ReflexSplit框架,通过跨尺度门控融合、层融合-分离模块和课程训练策略,有效解决单图像反射分离中传输层与反射层混淆的问题,显著提升分离性能与泛化能力。
Details
Motivation: 现有单图像反射分离方法在非线性混合情况下存在传输层与反射层混淆问题,尤其在深层解码器中,原因在于隐式融合机制和多尺度协调不足。 Method: 提出ReflexSplit双流框架,包含三个创新:(1) 跨尺度门控融合(CrGF)自适应聚合多层级语义先验、纹理细节和解码器上下文;(2) 层融合-分离模块(LFSB)交替执行融合与差异分离,并引入跨流相减实现注意力消除;(3) 课程训练策略,通过深度依赖初始化和轮次渐进预热增强差异分离。 Result: 在合成与真实数据集上达到当前最优性能,具有更优的感知质量和鲁棒泛化能力。 Conclusion: ReflexSplit通过结构化双流设计与协同优化机制,有效缓解了反射分离中的层间混淆问题,为复杂混合建模提供了新思路。 Abstract: Single Image Reflection Separation (SIRS) disentangles mixed images into transmission and reflection layers. Existing methods suffer from transmission-reflection confusion under nonlinear mixing, particularly in deep decoder layers, due to implicit fusion mechanisms and inadequate multi-scale coordination. We propose ReflexSplit, a dual-stream framework with three key innovations. (1) Cross-scale Gated Fusion (CrGF) adaptively aggregates semantic priors, texture details, and decoder context across hierarchical depths, stabilizing gradient flow and maintaining feature consistency. (2) Layer Fusion-Separation Blocks (LFSB) alternate between fusion for shared structure extraction and differential separation for layer-specific disentanglement. Inspired by Differential Transformer, we extend attention cancellation to dual-stream separation via cross-stream subtraction. (3) Curriculum training progressively strengthens differential separation through depth-dependent initialization and epoch-wise warmup. Extensive experiments on synthetic and real-world benchmarks demonstrate state-of-the-art performance with superior perceptual quality and robust generalization. Our code is available at https://github.com/wuw2135/ReflexSplit.[172] PhaSR: Generalized Image Shadow Removal with Physically Aligned Priors
Chia-Ming Lee,Yu-Fan Lin,Yu-Jou Hsiao,Jing-Hui Jung,Yu-Lun Liu,Chih-Chung Hsu
Main category: cs.CV
TL;DR: 本文提出PhaSR方法,通过双重物理先验对齐(光照归一化和跨模态几何-语义校正)实现复杂光照下的鲁棒阴影去除。
Details
Motivation: 在多样光照条件下进行阴影去除需解耦光照与固有反射率,而现有方法常因物理先验不匹配导致性能下降。 Method: 提出双级先验对齐:1)物理对齐归一化(PAN),结合灰度世界归一化、对数域Retinex分解与动态范围重组;2)几何-语义校正注意力(GSRA),融合深度几何与DINO-v2语义嵌入以解决模态冲突。 Result: 实验表明PhaSR在阴影去除任务中性能领先,计算复杂度更低,并能泛化至多光源环境,克服传统方法在此类场景下的失效问题。 Conclusion: PhaSR通过显式建模物理一致性与跨模态对齐,在单光与多源环境光照下均实现鲁棒阴影去除,为无监督/弱监督阴影去除提供了新范式。 Abstract: Shadow removal under diverse lighting conditions requires disentangling illumination from intrinsic reflectance, a challenge compounded when physical priors are not properly aligned. We propose PhaSR (Physically Aligned Shadow Removal), addressing this through dual-level prior alignment to enable robust performance from single-light shadows to multi-source ambient lighting. First, Physically Aligned Normalization (PAN) performs closed-form illumination correction via Gray-world normalization, log-domain Retinex decomposition, and dynamic range recombination, suppressing chromatic bias. Second, Geometric-Semantic Rectification Attention (GSRA) extends differential attention to cross-modal alignment, harmonizing depth-derived geometry with DINO-v2 semantic embeddings to resolve modal conflicts under varying illumination. Experiments show competitive performance in shadow removal with lower complexity and generalization to ambient lighting where traditional methods fail under multi-source illumination. Our source code is available at https://github.com/ming053l/PhaSR.[173] BMDS-Net: A Bayesian Multi-Modal Deep Supervision Network for Robust Brain Tumor Segmentation
Yan Zhou,Zhen Huang,Yingqiu Li,Yue Ouyang,Suncheng Xiang,Zehua Wang
Main category: cs.CV
TL;DR: 本文提出BMDS-Net,一种面向临床鲁棒性与可信度的脑肿瘤分割框架,解决Transformer模型在缺失模态和不确定性校准方面的不足。
Details
Motivation: 现有Transformer模型(如Swin UNETR)虽在基准数据上表现优异,但在临床中易受模态缺失影响且缺乏置信度校准,难以满足真实医疗场景的安全性要求。 Method: 提出三部分方法:1)零初始化多模态上下文融合(MMCF)与残差门控深层解码监督(DDS)构建鲁棒确定性主干;2)内存高效的贝叶斯微调策略实现体素级不确定性估计;3)在BraTS 2021上系统验证。 Result: BMDS-Net在保持竞争性精度的同时,在模态缺失场景下展现出显著优于基线模型的稳定性,并提供可解释的不确定性图。 Conclusion: BMDS-Net将临床鲁棒性与可信度置于指标优化之上,为医学图像分割的实际部署提供了更安全、更可靠的解决方案。 Abstract: Accurate brain tumor segmentation from multi-modal magnetic resonance imaging (MRI) is a prerequisite for precise radiotherapy planning and surgical navigation. While recent Transformer-based models such as Swin UNETR have achieved impressive benchmark performance, their clinical utility is often compromised by two critical issues: sensitivity to missing modalities (common in clinical practice) and a lack of confidence calibration. Merely chasing higher Dice scores on idealized data fails to meet the safety requirements of real-world medical deployment. In this work, we propose BMDS-Net, a unified framework that prioritizes clinical robustness and trustworthiness over simple metric maximization. Our contribution is three-fold. First, we construct a robust deterministic backbone by integrating a Zero-Init Multimodal Contextual Fusion (MMCF) module and a Residual-Gated Deep Decoder Supervision (DDS) mechanism, enabling stable feature learning and precise boundary delineation with significantly reduced Hausdorff Distance, even under modality corruption. Second, and most importantly, we introduce a memory-efficient Bayesian fine-tuning strategy that transforms the network into a probabilistic predictor, providing voxel-wise uncertainty maps to highlight potential errors for clinicians. Third, comprehensive experiments on the BraTS 2021 dataset demonstrate that BMDS-Net not only maintains competitive accuracy but, more importantly, exhibits superior stability in missing-modality scenarios where baseline models fail. The source code is publicly available at https://github.com/RyanZhou168/BMDS-Net.[174] FMIR, a foundation model-based Image Registration Framework for Robust Image Registration
Fengting Zhang,Yue He,Qinghao Liu,Yaonan Wang,Xiang Chen,Hang Zhang
Main category: cs.CV
TL;DR: 本文提出FMIR,一种基于基础模型的医学图像配准框架,通过结合基础模型特征编码器和通用配准头,并采用通道正则化策略,在单个数据集上训练即可实现域内SOTA性能及跨域鲁棒性。
Details
Motivation: 深度学习在医学图像配准中虽提速显著,但受限于小规模医学数据,泛化能力差,阻碍临床应用。 Method: 提出FMIR框架:使用基础模型作为特征编码器提取解剖结构,搭配通用配准头,并在单一数据集上采用通道正则化策略进行训练。 Result: 在域内达到SOTA性能,同时对域外图像保持鲁棒配准效果。 Conclusion: FMIR为在资源有限条件下构建可泛化的医学影像基础模型提供了可行路径。 Abstract: Deep learning has revolutionized medical image registration by achieving unprecedented speeds, yet its clinical application is hindered by a limited ability to generalize beyond the training domain, a critical weakness given the typically small scale of medical datasets. In this paper, we introduce FMIR, a foundation model-based registration framework that overcomes this limitation.Combining a foundation model-based feature encoder for extracting anatomical structures with a general registration head, and trained with a channel regularization strategy on just a single dataset, FMIR achieves state-of-the-art(SOTA) in-domain performance while maintaining robust registration on out-of-domain images.Our approach demonstrates a viable path toward building generalizable medical imaging foundation models with limited resources. The code is available at https://github.com/Monday0328/FMIR.git.[175] Will It Zero-Shot?: Will It Zero-Shot?: Predicting Zero-Shot Classification Performance For Arbitrary Queries
Kevin Robbins,Xiaotong Liu,Yu Wu,Le Sun,Grady McPeak,Abby Stylianou,Robert Pless
Main category: cs.CV
TL;DR: 本文提出了一种结合生成图像和文本比较的方法,用于评估视觉-语言模型(如CLIP)在特定任务上的零样本准确率预测能力,显著优于纯文本方法,并为用户提供可解释的图像反馈。
Details
Motivation: 非专家用户缺乏简单有效的方法来判断所选视觉-语言模型(VLM)是否适用于其具体任务,尤其在跨领域场景下性能易下降。 Method: 在已有纯文本评估方法基础上,引入与目标任务相关的合成图像生成,联合文本和图像信息预测零样本分类准确率,并提供可视化反馈。 Result: 在标准CLIP基准数据集上的实验表明,该图像增强方法显著提升了零样本准确率预测质量,并能在无标注样本前提下帮助用户预判VLM有效性。 Conclusion: 融合生成图像的评估方法不仅提高了预测精度,还增强了可解释性,为非专家用户提供实用、直观的VLM适用性判断工具。 Abstract: Vision-Language Models like CLIP create aligned embedding spaces for text and images, making it possible for anyone to build a visual classifier by simply naming the classes they want to distinguish. However, a model that works well in one domain may fail in another, and non-expert users have no straightforward way to assess whether their chosen VLM will work on their problem. We build on prior work using text-only comparisons to evaluate how well a model works for a given natural language task, and explore approaches that also generate synthetic images relevant to that task to evaluate and refine the prediction of zero-shot accuracy. We show that generated imagery to the baseline text-only scores substantially improves the quality of these predictions. Additionally, it gives a user feedback on the kinds of images that were used to make the assessment. Experiments on standard CLIP benchmark datasets demonstrate that the image-based approach helps users predict, without any labeled examples, whether a VLM will be effective for their application.[176] OTI: A Model-free and Visually Interpretable Measure of Image Attackability
Jiaming Liang,Haowei Liu,Chi-Man Pun
Main category: cs.CV
TL;DR: 本文提出了一种无需模型依赖、视觉可解释的图像攻击性度量方法——物体纹理强度(OTI),用于评估图像被对抗扰动欺骗的难易程度。
Details
Motivation: 现有攻击性度量方法依赖于不可获取的任务模型代理,且缺乏视觉可解释性,限制了其在主动学习、对抗训练等场景中的应用。 Method: 提出Object Texture Intensity(OTI)指标,以图像中语义物体的纹理强度来衡量攻击性;从决策边界和对抗扰动的中高频特性两个角度进行理论分析。 Result: 实验表明OTI有效、计算高效,并为对抗机器学习提供了对攻击性的直观视觉理解。 Conclusion: OTI是一种模型无关、视觉可解释的图像攻击性度量方法,克服了现有方法对模型代理的依赖和缺乏可解释性的缺陷。 Abstract: Despite the tremendous success of neural networks, benign images can be corrupted by adversarial perturbations to deceive these models. Intriguingly, images differ in their attackability. Specifically, given an attack configuration, some images are easily corrupted, whereas others are more resistant. Evaluating image attackability has important applications in active learning, adversarial training, and attack enhancement. This prompts a growing interest in developing attackability measures. However, existing methods are scarce and suffer from two major limitations: (1) They rely on a model proxy to provide prior knowledge (e.g., gradients or minimal perturbation) to extract model-dependent image features. Unfortunately, in practice, many task-specific models are not readily accessible. (2) Extracted features characterizing image attackability lack visual interpretability, obscuring their direct relationship with the images. To address these, we propose a novel Object Texture Intensity (OTI), a model-free and visually interpretable measure of image attackability, which measures image attackability as the texture intensity of the image's semantic object. Theoretically, we describe the principles of OTI from the perspectives of decision boundaries as well as the mid- and high-frequency characteristics of adversarial perturbations. Comprehensive experiments demonstrate that OTI is effective and computationally efficient. In addition, our OTI provides the adversarial machine learning community with a visual understanding of attackability.[177] Saliency Driven Imagery Preprocessing for Efficient Compression -- Industrial Paper
Justin Downes,Sam Saltwick,Anthony Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于显著性图的卫星图像预处理方法,结合传统有损压缩标准,实现单幅大尺寸卫星图像内的可变码率压缩。
Details
Motivation: 每天采集数百TB的卫星图像,存储和带宽成本高昂;而多数下游任务仅关注图像中的小区域(兴趣区),现有全图均匀编码方法效率低下。 Method: 利用显著性图指导图像预处理,采用与量化显著性等级对应的可变尺寸平滑核对像素进行处理,从而优化后续压缩与编码。 Result: 实现了在同一幅大型卫星图像内按区域重要性进行可变码率压缩,提升了压缩效率与下游任务性能的平衡。 Conclusion: 基于显著性引导的预处理能有效提升传统压缩标准在卫星图像上的适应性与实用性,为遥感图像高效编码提供了新思路。 Abstract: The compression of satellite imagery remains an important research area as hundreds of terabytes of images are collected every day, which drives up storage and bandwidth costs. Although progress has been made in increasing the resolution of these satellite images, many downstream tasks are only interested in small regions of any given image. These areas of interest vary by task but, once known, can be used to optimize how information within the image is encoded. Whereas standard image encoding methods, even those optimized for remote sensing, work on the whole image equally, there are emerging methods that can be guided by saliency maps to focus on important areas. In this work we show how imagery preprocessing techniques driven by saliency maps can be used with traditional lossy compression coding standards to create variable rate image compression within a single large satellite image. Specifically, we use variable sized smoothing kernels that map to different quantized saliency levels to process imagery pixels in order to optimize downstream compression and encoding schemes.[178] Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning
Qi Li,Xinchao Wang
Main category: cs.CV
TL;DR: 本文提出了一种名为Sponge Tool Attack(STA)的新型攻击方法,仅通过改写输入提示词即可干扰基于工具调用的LLM推理过程,在不修改模型和工具的前提下,诱导模型生成冗长低效的推理路径,造成计算开销增加但语义意图不变,具有高度隐蔽性。
Details
Motivation: 现有基于外部工具增强的LLM推理方法存在未被充分研究的脆弱性,尤其在工具调用过程易受恶意提示改写攻击。 Method: 设计了一种迭代式、多智能体协作的提示重写框架STA,具备显式的重写策略控制,生成语义保真度高、表面良性的提示变体。 Result: 在6个模型、12个工具、4种智能体框架、13个跨领域数据集上验证了STA的有效性,成功诱导推理路径显著冗余化且计算开销大幅上升。 Conclusion: STA揭示了工具增强型LLM推理系统在提示层的关键安全盲区,强调需在代理架构中引入鲁棒的提示完整性保护机制。 Abstract: Enabling large language models (LLMs) to solve complex reasoning tasks is a key step toward artificial general intelligence. Recent work augments LLMs with external tools to enable agentic reasoning, achieving high utility and efficiency in a plug-and-play manner. However, the inherent vulnerabilities of such methods to malicious manipulation of the tool-calling process remain largely unexplored. In this work, we identify a tool-specific attack surface and propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt under a strict query-only access assumption. Without any modification on the underlying model or the external tools, STA converts originally concise and efficient reasoning trajectories into unnecessarily verbose and convoluted ones before arriving at the final answer. This results in substantial computational overhead while remaining stealthy by preserving the original task semantics and user intent. To achieve this, we design STA as an iterative, multi-agent collaborative framework with explicit rewritten policy control, and generates benign-looking prompt rewrites from the original one with high semantic fidelity. Extensive experiments across 6 models (including both open-source models and closed-source APIs), 12 tools, 4 agentic frameworks, and 13 datasets spanning 5 domains validate the effectiveness of STA.[179] Stylizing ViT: Anatomy-Preserving Instance Style Transfer for Domain Generalization
Sebastian Doerrich,Francesco Di Salvo,Jonas Alle,Christian Ledig
Main category: cs.CV
TL;DR: 本文提出Stylizing ViT,一种基于ViT的新型编码器,通过权重共享的注意力模块实现解剖结构保持与风格迁移协同,显著提升医学图像跨域泛化能力,并在多个任务中超越现有方法。
Details
Motivation: 深度学习模型在医学图像分析中常因数据异质性和稀缺性导致跨域和跨人群泛化能力差;传统增强方法在大域偏移下失效,而现有风格增强方法存在风格多样性不足或引入伪影的问题。 Method: 提出Stylizing ViT,采用权重共享的注意力块统一执行自注意力(保持解剖一致性)和交叉注意力(实现风格迁移),用于训练时数据增强及测试时增强(TTA)。 Result: 在组织病理学与皮肤科三类图像分类任务中,相比SOTA提升最高达13%准确率,生成图像无伪影、视觉可信;测试时增强进一步带来17%性能提升。 Conclusion: Stylizing ViT有效缓解医学图像域偏移问题,兼具强泛化性、高图像质量与实用性,为域泛化与测试时增强提供了新范式。 Abstract: Deep learning models in medical image analysis often struggle with generalizability across domains and demographic groups due to data heterogeneity and scarcity. Traditional augmentation improves robustness, but fails under substantial domain shifts. Recent advances in stylistic augmentation enhance domain generalization by varying image styles but fall short in terms of style diversity or by introducing artifacts into the generated images. To address these limitations, we propose Stylizing ViT, a novel Vision Transformer encoder that utilizes weight-shared attention blocks for both self- and cross-attention. This design allows the same attention block to maintain anatomical consistency through self-attention while performing style transfer via cross-attention. We assess the effectiveness of our method for domain generalization by employing it for data augmentation on three distinct image classification tasks in the context of histopathology and dermatology. Results demonstrate an improved robustness (up to +13% accuracy) over the state of the art while generating perceptually convincing images without artifacts. Additionally, we show that Stylizing ViT is effective beyond training, achieving a 17% performance improvement during inference when used for test-time augmentation. The source code is available at https://github.com/sdoerrich97/stylizing-vit .[180] SPACE-CLIP: Spatial Perception via Adaptive CLIP Embeddings for Monocular Depth Estimation
Taewan Cho,Taeryang Kim,Andrew Jaeyong Choi
Main category: cs.CV
TL;DR: 本文提出SPACE-CLIP,一种双通路解码器架构,直接从冻结的CLIP视觉编码器中提取几何信息,绕过文本编码器;语义通路利用FiLM动态调制高层特征,结构通路提取早期层的空间细节,二者分层融合,显著提升KITTI深度估计性能,并可作为空间感知模块集成于VLA等具身AI系统。
Details
Motivation: CLIP擅长语义理解但缺乏几何感知能力,现有方法依赖文本提示进行间接、低效的几何推理。 Method: 提出双通路解码器SPACE-CLIP:语义通路基于FiLM对齐全局上下文调制高层特征;结构通路从CLIP早期层提取细粒度空间信息;两通路分层融合,且完全不使用CLIP文本编码器。 Result: 在KITTI基准上大幅超越此前所有CLIP-based深度估计方法;消融实验证明双通路协同融合是性能关键。 Conclusion: SPACE-CLIP为重用大规模视觉模型提供了高效、简洁的新范式,不仅是一个深度估计器,更是一个即插即用的空间感知模块,适用于下一代具身AI(如VLA)系统。 Abstract: Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for semantic understanding but inherently struggles to perceive geometric structure. Existing methods attempt to bridge this gap by querying CLIP with textual prompts, a process that is often indirect and inefficient. This paper introduces a fundamentally different approach using a dual-pathway decoder. We present SPACE-CLIP, an architecture that unlocks and interprets latent geometric knowledge directly from a frozen CLIP vision encoder, completely bypassing the text encoder and its associated textual prompts. A semantic pathway interprets high-level features, dynamically conditioned on global context using feature-wise linear modulation (FiLM). In addition, a structural pathway extracts fine-grained spatial details from early layers. These complementary streams are hierarchically fused, enabling a robust synthesis of semantic context and precise geometry. Extensive experiments on the KITTI benchmark show that SPACE-CLIP dramatically outperforms previous CLIP-based methods. Our ablation studies validate that the synergistic fusion of our dual pathways is critical to this success. SPACE-CLIP offers a new, efficient, and architecturally elegant blueprint for repurposing large-scale vision models. The proposed method is not just a standalone depth estimator, but a readily integrable spatial perception module for the next generation of embodied AI systems, such as vision-language-action (VLA) models. Our model is available at https://github.com/taewan2002/space-clip[181] Training-Free Text-to-Image Compositional Food Generation via Prompt Grafting
Xinyue Pan,Yuhao Chen,Fengqing Zhu
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的Prompt Grafting(PG)框架,通过结合文本中的显式空间提示与采样过程中的隐式布局引导,解决多食物图像生成中因边界模糊导致的食物纠缠问题,实现对食物分离/混合的可控生成。
Details
Motivation: 现实世界中的餐食图像常含多种食物,而现有文本到图像扩散模型因食物边界不清易导致对象纠缠(如米饭与汤融合),难以可靠生成准确的多食物图像,制约了图像膳食评估和菜谱可视化等应用。 Method: 提出Prompt Grafting(PG)——一种训练-free框架:第一阶段用布局提示(layout prompt)建立清晰的空间区域;第二阶段在布局稳定后将目标提示(target prompt)'嫁接'上去;支持通过编辑布局安排来控制不同食物项的分离或混合。 Result: 在两个食物数据集上显著提升了目标食物对象的出现率,并提供了可控分离的定性证据。 Conclusion: PG有效缓解了多食物生成中的纠缠问题,实现了无需训练、基于提示编辑的空间可控合成,为饮食健康相关图像生成任务提供了新思路。 Abstract: Real-world meal images often contain multiple food items, making reliable compositional food image generation important for applications such as image-based dietary assessment, where multi-food data augmentation is needed, and recipe visualization. However, modern text-to-image diffusion models struggle to generate accurate multi-food images due to object entanglement, where adjacent foods (e.g., rice and soup) fuse together because many foods do not have clear boundaries. To address this challenge, we introduce Prompt Grafting (PG), a training-free framework that combines explicit spatial cues in text with implicit layout guidance during sampling. PG runs a two-stage process where a layout prompt first establishes distinct regions and the target prompt is grafted once layout formation stabilizes. The framework enables food entanglement control: users can specify which food items should remain separated or be intentionally mixed by editing the arrangement of layouts. Across two food datasets, our method significantly improves the presence of target objects and provides qualitative evidence of controllable separation.[182] Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing
Weiyu Zhang,Yuan Hu,Yong Li,Yu Liu
Main category: cs.CV
TL;DR: 本文提出Uni-RS模型,首次针对遥感领域设计统一多模态模型,通过空间布局规划、空间感知查询监督和图像-文本空间布局变化三大技术,解决理解与生成之间空间关系不一致的问题,显著提升文本到图像生成的空间保真度,同时保持理解任务性能。
Details
Motivation: 统一遥感多模态模型存在显著的‘空间反转诅咒’:虽能准确定位识别图像中物体,但在文本到图像生成中却难以忠实再现相同空间关系,而这类关系是遥感语义的核心。 Method: 提出Uni-RS模型,包含三部分:1)显式空间布局规划(将文本指令转化为空间布局计划,解耦几何规划与视觉合成);2)空间感知查询监督(引导可学习查询关注指令中明确指定的空间关系);3)图像-文本空间布局变化(引入几何一致的空间变换增强训练)。 Result: 在多个基准上实验表明,该方法显著提升了文本到图像生成的空间保真度,同时在图像描述、视觉定位和VQA等多模态理解任务上保持强性能。 Conclusion: Uni-RS有效缓解了遥感多模态模型中理解与生成之间的空间不对称问题,为构建空间可信的遥感生成模型提供了新范式。 Abstract: Unified remote sensing multimodal models exhibit a pronounced spatial reversal curse: Although they can accurately recognize and describe object locations in images, they often fail to faithfully execute the same spatial relations during text-to-image generation, where such relations constitute core semantic information in remote sensing. Motivated by this observation, we propose Uni-RS, the first unified multimodal model tailored for remote sensing, to explicitly address the spatial asymmetry between understanding and generation. Specifically, we first introduce explicit Spatial-Layout Planning to transform textual instructions into spatial layout plans, decoupling geometric planning from visual synthesis. We then impose Spatial-Aware Query Supervision to bias learnable queries toward spatial relations explicitly specified in the instruction. Finally, we develop Image-Caption Spatial Layout Variation to expose the model to systematic geometry-consistent spatial transformations. Extensive experiments across multiple benchmarks show that our approach substantially improves spatial faithfulness in text-to-image generation, while maintaining strong performance on multimodal understanding tasks like image captioning, visual grounding, and VQA tasks.[183] StyleDecoupler: Generalizable Artistic Style Disentanglement
Zexi Jia,Jinchao Zhang,Jie Zhou
Main category: cs.CV
TL;DR: StyleDecoupler 是一种无需微调的信息论框架,利用多模态与单模态视觉模型表征差异,解耦艺术风格与语义内容;配套发布大规模艺术数据集 WeART(28 万幅作品),并在风格检索等任务上达到 SOTA。
Details
Motivation: 艺术风格表征困难,因其与语义内容深度耦合;现有方法难以在不损害内容理解的前提下提取纯风格特征。 Method: 提出 StyleDecoupler 框架:以冻结的多模态视觉语言模型(如 CLIP)提取联合表征,以单模态模型(如 ViT)输出作为内容参考,通过最小化互信息从多模态嵌入中分离出纯风格特征;整个过程无需微调,即插即用。 Result: 在自建 WeART 和公开 WikiART 数据集上风格检索性能达 SOTA;支持风格关系图谱构建与生成模型评估。 Conclusion: StyleDecoupler 有效解耦风格与内容,验证了利用模态差异进行信息分解的可行性,为艺术计算与可控生成提供了新范式。 Abstract: Representing artistic style is challenging due to its deep entanglement with semantic content. We propose StyleDecoupler, an information-theoretic framework that leverages a key insight: multi-modal vision models encode both style and content, while uni-modal models suppress style to focus on content-invariant features. By using uni-modal representations as content-only references, we isolate pure style features from multi-modal embeddings through mutual information minimization. StyleDecoupler operates as a plug-and-play module on frozen Vision-Language Models without fine-tuning. We also introduce WeART, a large-scale benchmark of 280K artworks across 152 styles and 1,556 artists. Experiments show state-of-the-art performance on style retrieval across WeART and WikiART, while enabling applications like style relationship mapping and generative model evaluation. We release our method and dataset at this url.[184] An AI-enabled tool for quantifying overlapping red blood cell sickling dynamics in microfluidic assays
Nikhil Kadivar,Guansheng Li,Jianlu Zheng,John M. Higgins,Ming Dao,George Em Karniadakis,Mengjia Xu
Main category: cs.CV
TL;DR: 本文提出了一种基于深度学习的自动化框架,用于在时间序列显微图像中准确分割、分类和计数密集重叠的红细胞(RBC),尤其适用于镰状细胞形态动态分析;该框架结合AI辅助标注、nnU-Net分割、watershed后处理,在小样本标注下实现高精度量化,显著提升实验通量与药物响应评估能力。
Details
Motivation: 准确识别镰状细胞在不同生物物理条件下的形态转变,尤其是在高密度、重叠严重的细胞群体中存在挑战,且人工标注稀缺、效率低。 Method: 构建集成AI辅助标注(Roboflow)、nnU-Net语义分割、watershed实例分割、分类与计数的端到端深度学习框架,应用于时序显微图像分析。 Result: 在少量标注数据下实现高分割性能,有效解决细胞重叠与标注稀缺问题;可定量追踪RBC形态动态,使实验通量提升一倍以上,捕获药物依赖性镰变行为,并揭示独特的力学生物学特征。 Conclusion: 该AI驱动框架为微生理系统中的细胞生物力学研究和疗效评估提供了可扩展、可重复的计算平台。 Abstract: Understanding sickle cell dynamics requires accurate identification of morphological transitions under diverse biophysical conditions, particularly in densely packed and overlapping cell populations. Here, we present an automated deep learning framework that integrates AI-assisted annotation, segmentation, classification, and instance counting to quantify red blood cell (RBC) populations across varying density regimes in time-lapse microscopy data. Experimental images were annotated using the Roboflow platform to generate labeled dataset for training an nnU-Net segmentation model. The trained network enables prediction of the temporal evolution of the sickle cell fraction, while a watershed algorithm resolves overlapping cells to enhance quantification accuracy. Despite requiring only a limited amount of labeled data for training, the framework achieves high segmentation performance, effectively addressing challenges associated with scarce manual annotations and cell overlap. By quantitatively tracking dynamic changes in RBC morphology, this approach can more than double the experimental throughput via densely packed cell suspensions, capture drug-dependent sickling behavior, and reveal distinct mechanobiological signatures of cellular morphological evolution. Overall, this AI-driven framework establishes a scalable and reproducible computational platform for investigating cellular biomechanics and assessing therapeutic efficacy in microphysiological systems.[185] Advancing Structured Priors for Sparse-Voxel Surface Reconstruction
Ting-Hsun Chi,Chu-Rong Chen,Chi-Tun Hsu,Hsuan-Ting Lin,Sheng-Yu Huang,Cheng Sun,Yu-Chiang Frank Wang
Main category: cs.CV
TL;DR: 本文提出了一种结合3D高斯点绘与稀疏体素光栅化的混合方法,通过智能体素初始化和精细化深度几何监督,提升了辐射场表面重建的几何精度、细节恢复能力和收敛速度。
Details
Motivation: 现有显式表示方法(如3D高斯点绘和稀疏体素光栅化)各有优劣:前者收敛快但表面保真度低,后者几何清晰但初始化低效;需融合二者优势以提升表面重建质量。 Method: 提出基于场景结构的体素初始化策略,将体素置于合理位置并适配细节层次;并设计精细化深度几何监督,将多视角线索转化为逐射线深度正则化,增强深度一致性且不模糊边缘。 Result: 在标准基准上实验表明,该方法在几何精度、细结构恢复和表面完整性方面优于先前方法,同时保持快速收敛。 Conclusion: 融合显式表示优势的初始化与监督策略,可有效提升辐射场表面重建的整体性能,为高质量显式神经渲染提供了新思路。 Abstract: Reconstructing accurate surfaces with radiance fields has progressed rapidly, yet two promising explicit representations, 3D Gaussian Splatting and sparse-voxel rasterization, exhibit complementary strengths and weaknesses. 3D Gaussian Splatting converges quickly and carries useful geometric priors, but surface fidelity is limited by its point-like parameterization. Sparse-voxel rasterization provides continuous opacity fields and crisp geometry, but its typical uniform dense-grid initialization slows convergence and underutilizes scene structure. We combine the advantages of both by introducing a voxel initialization method that places voxels at plausible locations and with appropriate levels of detail, yielding a strong starting point for per-scene optimization. To further enhance depth consistency without blurring edges, we propose refined depth geometry supervision that converts multi-view cues into direct per-ray depth regularization. Experiments on standard benchmarks demonstrate improvements over prior methods in geometric accuracy, better fine-structure recovery, and more complete surfaces, while maintaining fast convergence.[186] Implicit Neural Representation-Based Continuous Single Image Super Resolution: An Empirical Study
Tayyab Nasir,Daochang Liu,Ajmal Mian
Main category: cs.CV
TL;DR: 本文对隐式神经表示(INR)在任意尺度图像超分辨率(ASSR)中的应用进行了系统性实证分析,揭示了现有方法实际增益有限、训练配置影响显著、新损失函数可提升纹理保真度、以及缩放定律适用于INR-based ASSR等关键结论。
Details
Motivation: 目前尚无系统性实证研究评估INR在ASSR中各类方法的有效性及不同训练策略(如缩放律、目标函数设计、优化策略)的影响,亟需严谨分析以明确现状、识别瓶颈并指明方向。 Method: 构建统一框架与开源代码库,跨多种设置对比现有INR方法;控制变量分析训练配置对感知质量的影响;提出一种兼顾边缘保持与纹理细节的新损失函数;验证缩放定律在INR-based ASSR中的适用性。 Result: (1)复杂INR方法仅带来边际提升;(2)模型性能高度依赖训练配置;(3)新损失函数显著提升各架构的纹理保真度;(4)缩放定律成立,模型复杂度与数据多样性增加带来可预测性能增益。 Conclusion: INR-based ASSR的研究重心应从单纯模型结构创新转向训练配方优化与目标函数设计;缩放规律为方法发展提供可预测路径;统一评估框架和开源资源有助于推动该领域健康发展。 Abstract: Implicit neural representation (INR) has become the standard approach for arbitrary-scale image super-resolution (ASSR). To date, no empirical study has systematically examined the effectiveness of existing methods, nor investigated the effects of different training recipes, such as scaling laws, objective design, and optimization strategies. A rigorous empirical analysis is essential not only for benchmarking performance and revealing true gains but also for establishing the current state of ASSR, identifying saturation limits, and highlighting promising directions. We fill this gap by comparing existing techniques across diverse settings and presenting aggregated performance results on multiple image quality metrics. We contribute a unified framework and code repository to facilitate reproducible comparisons. Furthermore, we investigate the impact of carefully controlled training configurations on perceptual image quality and examine a new loss function that penalizes intensity variations while preserving edges, textures, and finer details during training. We conclude the following key insights that have been previously overlooked: (1) Recent, more complex INR methods provide only marginal improvements over earlier methods. (2) Model performance is strongly correlated to training configurations, a factor overlooked in prior works. (3) The proposed loss enhances texture fidelity across architectures, emphasizing the role of objective design for targeted perceptual gains. (4) Scaling laws apply to INR-based ASSR, confirming predictable gains with increased model complexity and data diversity.[187] Flatten The Complex: Joint B-Rep Generation via Compositional $k$-Cell Particles
Junran Lu,Yuanqi Li,Hengji Li,Jie Guo,Yanwen Guo
Main category: cs.CV
TL;DR: 本文提出了一种基于k-细胞粒子集合的新B-Rep生成范式,通过解耦拓扑层级、共享界面隐变量实现几何耦合,结合多模态流匹配框架,支持无条件生成、单视图/点云重建、局部修复及非流形结构合成,显著提升CAD模型保真度、有效性和可编辑性。
Details
Motivation: B-Rep生成面临拓扑与几何耦合、层级结构刚性强、单元间几何关系(如邻接、共享)未被充分利用等问题,导致上下文感知弱、错误恢复能力差。 Method: 将B-Rep重构为可组合的k-细胞粒子集合,每个拓扑实体由粒子组成,相邻单元在交界处共享相同隐变量;采用多模态流匹配框架实现粒子集的合成,支持无条件生成与多种条件任务。 Result: 在3D重建、局部修复、非流形结构生成等任务上取得高保真、高有效性与强可编辑性的CAD模型,性能优于现有最先进方法。 Conclusion: 所提粒子化B-Rep表示解耦了传统层级结构,统一处理顶点、边、面,实现了拓扑与几何的联合生成与全局上下文建模,为CAD生成提供了更灵活、鲁棒和可扩展的新范式。 Abstract: Boundary Representation (B-Rep) is the widely adopted standard in Computer-Aided Design (CAD) and manufacturing. However, generative modeling of B-Reps remains a formidable challenge due to their inherent heterogeneity as geometric cell complexes, which entangles topology with geometry across cells of varying orders (i.e., $k$-cells such as vertices, edges, faces). Previous methods typically rely on cascaded sequences to handle this hierarchy, which fails to fully exploit the geometric relationships between cells, such as adjacency and sharing, limiting context awareness and error recovery. To fill this gap, we introduce a novel paradigm that reformulates B-Reps into sets of compositional $k$-cell particles. Our approach encodes each topological entity as a composition of particles, where adjacent cells share identical latents at their interfaces, thereby promoting geometric coupling along shared boundaries. By decoupling the rigid hierarchy, our representation unifies vertices, edges, and faces, enabling the joint generation of topology and geometry with global context awareness. We synthesize these particle sets using a multi-modal flow matching framework to handle unconditional generation as well as precise conditional tasks, such as 3D reconstruction from single-view or point cloud. Furthermore, the explicit and localized nature of our representation naturally extends to downstream tasks like local in-painting and enables the direct synthesis of non-manifold structures (e.g., wireframes). Extensive experiments demonstrate that our method produces high-fidelity CAD models with superior validity and editability compared to state-of-the-art methods.[188] The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
Chenyu Mu,Xin He,Qu Yang,Wanshun Chen,Jiadi Yao,Huang Liu,Zihao Yi,Bo Zhao,Xingyu Chen,Ruotian Ma,Fanghua Ye,Erkun Yang,Cheng Deng,Zhaopeng Tu,Xiaolong Li,Linus
Main category: cs.CV
TL;DR: 本文提出了一种端到端的智能体框架,将对话文本转化为连贯、长时序的电影级视频,通过ScripterAgent生成可执行脚本、DirectorAgent协调视频生成,并引入新基准ScriptBench和评估指标VSA,揭示了当前模型在视觉表现力与脚本忠实度间的权衡。
Details
Motivation: 现有视频生成模型虽能根据文本生成高质量画面,但在从高层概念(如对话)生成长时序、语义连贯的叙事视频时存在‘语义鸿沟’。 Method: 提出基于智能体的端到端框架:ScripterAgent将粗粒度对话转为细粒度可执行电影脚本(依托新构建的专家标注多模态基准ScriptBench);DirectorAgent采用跨场景连续生成策略协调SOTA视频模型;并设计CriticAgent和Visual-Script Alignment(VSA)指标进行评估。 Result: 显著提升各视频模型在脚本忠实度和时间保真度上的表现;发现当前SOTA模型在视觉效果与脚本严格遵循之间存在关键权衡。 Conclusion: 该框架有效弥合了创意构想与影视化实现之间的语义鸿沟,为自动化电影制作提供了新范式与实证洞见。 Abstract: Recent advances in video generation have produced models capable of synthesizing stunning visual content from simple text prompts. However, these models struggle to generate long-form, coherent narratives from high-level concepts like dialogue, revealing a ``semantic gap'' between a creative idea and its cinematic execution. To bridge this gap, we introduce a novel, end-to-end agentic framework for dialogue-to-cinematic-video generation. Central to our framework is ScripterAgent, a model trained to translate coarse dialogue into a fine-grained, executable cinematic script. To enable this, we construct ScriptBench, a new large-scale benchmark with rich multimodal context, annotated via an expert-guided pipeline. The generated script then guides DirectorAgent, which orchestrates state-of-the-art video models using a cross-scene continuous generation strategy to ensure long-horizon coherence. Our comprehensive evaluation, featuring an AI-powered CriticAgent and a new Visual-Script Alignment (VSA) metric, shows our framework significantly improves script faithfulness and temporal fidelity across all tested video models. Furthermore, our analysis uncovers a crucial trade-off in current SOTA models between visual spectacle and strict script adherence, providing valuable insights for the future of automated filmmaking.[189] Learning Sewing Patterns via Latent Flow Matching of Implicit Fields
Cong Cao,Ren Li,Corentin Dumery,Hao Li
Main category: cs.CV
TL;DR: 本文提出了一种基于隐式表示的缝纫纸样建模方法,利用符号距离场和无符号距离场分别表示布片边界与缝合端点,并通过潜在流匹配模型学习布片组合分布,实现复杂结构纸样的准确生成与图像到纸样的高精度估计。
Details
Motivation: 缝纫纸样是服装设计、制作与物理仿真等应用的基础,但其面板几何形状和缝线排列变化大,自动化建模仍具挑战性。 Method: 采用隐式表示:每个布片由符号距离场(定义边界)和无符号距离场(标识缝合端点)表征;编码至连续潜在空间以支持可微网格化;使用潜在流匹配模型学习布片组合分布,并设计缝合预测模块从提取的边段中恢复缝合关系。 Result: 实现了对复杂结构缝纫纸样的高精度建模与生成;在图像到纸样的估计任务中优于现有方法;支持纸样补全与重拟合等实用功能。 Conclusion: 该方法为数字时尚设计提供了更鲁棒、灵活且可微分的缝纫纸样建模框架。 Abstract: Sewing patterns define the structural foundation of garments and are essential for applications such as fashion design, fabrication, and physical simulation. Despite progress in automated pattern generation, accurately modeling sewing patterns remains difficult due to the broad variability in panel geometry and seam arrangements. In this work, we introduce a sewing pattern modeling method based on an implicit representation. We represent each panel using a signed distance field that defines its boundary and an unsigned distance field that identifies seam endpoints, and encode these fields into a continuous latent space that enables differentiable meshing. A latent flow matching model learns distributions over panel combinations in this representation, and a stitching prediction module recovers seam relations from extracted edge segments. This formulation allows accurate modeling and generation of sewing patterns with complex structures. We further show that it can be used to estimate sewing patterns from images with improved accuracy relative to existing approaches, and supports applications such as pattern completion and refitting, providing a practical tool for digital fashion design.[190] Frequency-aware Neural Representation for Videos
Jun Zhu,Xinfeng Zhang,Lv Tang,Junhao Jiang,Gai Zhang,Jia Wang
Main category: cs.CV
TL;DR: 本文提出FaNeRV,一种频率感知的神经视频表示方法,通过解耦低频和高频分量、多分辨率监督策略、动态高频注入机制和频率分解网络模块,显著提升了视频重建质量和率失真性能。
Details
Motivation: 现有隐式神经表示(INR)视频压缩方法存在固有频谱偏差,偏向低频成分,导致重建过度平滑和率失真性能不佳。 Method: 提出FaNeRV框架,包括:1)显式解耦低/高频分量;2)多分辨率监督策略,分阶段引导网络学习全局结构与细节纹理;3)动态高频注入机制,自适应增强困难区域的高频重建;4)频率分解网络模块,提升不同频段特征建模能力。 Result: 在标准基准上大量实验表明,FaNeRV显著优于现有INR方法,并在率失真性能上媲美传统编解码器。 Conclusion: FaNeRV通过频率感知设计有效缓解INR的频谱偏差问题,实现了高效且保真的视频重建,为神经视频压缩提供了新思路。 Abstract: Implicit Neural Representations (INRs) have emerged as a promising paradigm for video compression. However, existing INR-based frameworks typically suffer from inherent spectral bias, which favors low-frequency components and leads to over-smoothed reconstructions and suboptimal rate-distortion performance. In this paper, we propose FaNeRV, a Frequency-aware Neural Representation for videos, which explicitly decouples low- and high-frequency components to enable efficient and faithful video reconstruction. FaNeRV introduces a multi-resolution supervision strategy that guides the network to progressively capture global structures and fine-grained textures through staged supervision . To further enhance high-frequency reconstruction, we propose a dynamic high-frequency injection mechanism that adaptively emphasizes challenging regions. In addition, we design a frequency-decomposed network module to improve feature modeling across different spectral bands. Extensive experiments on standard benchmarks demonstrate that FaNeRV significantly outperforms state-of-the-art INR methods and achieves competitive rate-distortion performance against traditional codecs.[191] Video Compression with Hierarchical Temporal Neural Representation
Jun Zhu,Xinfeng Zhang,Lv Tang,Junhao Jiang,Gai Zhang,Jia Wang
Main category: cs.CV
TL;DR: 本文提出了一种分层时间神经表示方法TeNeRV,通过帧间特征融合和GoP自适应调制机制,有效建模视频中的短时与长时依赖关系,显著提升了隐式神经表示(INR)在视频压缩中的率失真性能。
Details
Motivation: 现有基于隐式神经表示(INRs)的视频压缩方法将时间维度视为独立输入,难以有效捕捉复杂的时间依赖关系。 Method: 提出TeNeRV:1)Inter-Frame Feature Fusion(IFF)模块聚合相邻帧特征以建模局部时间一致性与细粒度运动;2)GoP-Adaptive Modulation(GAM)机制将视频划分为图像组(GoP),学习组特定先验并调制网络参数以实现自适应表示。 Result: 在多个数据集上实验表明,TeNeRV在率失真性能上持续优于现有INR-based视频压缩方法。 Conclusion: 分层建模短时与长时时间依赖对提升INR视频压缩性能至关重要,TeNeRV为高效视频表示提供了新思路。 Abstract: Video compression has recently benefited from implicit neural representations (INRs), which model videos as continuous functions. INRs offer compact storage and flexible reconstruction, providing a promising alternative to traditional codecs. However, most existing INR-based methods treat the temporal dimension as an independent input, limiting their ability to capture complex temporal dependencies. To address this, we propose a Hierarchical Temporal Neural Representation for Videos, TeNeRV. TeNeRV integrates short- and long-term dependencies through two key components. First, an Inter-Frame Feature Fusion (IFF) module aggregates features from adjacent frames, enforcing local temporal coherence and capturing fine-grained motion. Second, a GoP-Adaptive Modulation (GAM) mechanism partitions videos into Groups-of-Pictures and learns group-specific priors. The mechanism modulates network parameters, enabling adaptive representations across different GoPs. Extensive experiments demonstrate that TeNeRV consistently outperforms existing INR-based methods in rate-distortion performance, validating the effectiveness of our proposed approach.[192] Bridging Supervision Gaps: A Unified Framework for Remote Sensing Change Detection
Kaixuan Jiang,Chen Wu,Zhenghui Zhao,Chengxi Han
Main category: cs.CV
TL;DR: 本文提出了一种统一的遥感影像变化检测框架UniCD,通过共享编码器和多分支协同学习机制,统一处理监督、弱监督和无监督变化检测任务。
Details
Motivation: 现实场景中像素级变化标注成本高昂,现有模型难以适应不同标注可用性的场景。 Method: UniCD采用共享编码器和三分支架构:监督分支引入时空感知模块(STAM);弱监督分支设计变化表征正则化(CRR);无监督分支提出语义先验驱动的变化推理(SPCI)。 Result: 在主流数据集上,UniCD在三类任务中均达到最优性能,在LEVIR-CD上弱监督和无监督任务分别比当前SOTA提升12.72%和12.37%。 Conclusion: UniCD有效融合异构监督信号,显著提升了标注稀缺场景下的变化检测性能,为多范式变化检测提供了统一可行框架。 Abstract: Change detection (CD) aims to identify surface changes from multi-temporal remote sensing imagery. In real-world scenarios, Pixel-level change labels are expensive to acquire, and existing models struggle to adapt to scenarios with diverse annotation availability. To tackle this challenge, we propose a unified change detection framework (UniCD), which collaboratively handles supervised, weakly-supervised, and unsupervised tasks through a coupled architecture. UniCD eliminates architectural barriers through a shared encoder and multi-branch collaborative learning mechanism, achieving deep coupling of heterogeneous supervision signals. Specifically, UniCD consists of three supervision-specific branches. In the supervision branch, UniCD introduces the spatial-temporal awareness module (STAM), achieving efficient synergistic fusion of bi-temporal features. In the weakly-supervised branch, we construct change representation regularization (CRR), which steers model convergence from coarse-grained activations toward coherent and separable change modeling. In the unsupervised branch, we propose semantic prior-driven change inference (SPCI), which transforms unsupervised tasks into controlled weakly-supervised path optimization. Experiments on mainstream datasets demonstrate that UniCD achieves optimal performance across three tasks. It exhibits significant accuracy improvements in weakly and unsupervised scenarios, surpassing current state-of-the-art by 12.72% and 12.37% on LEVIR-CD, respectively.[193] MV-S2V: Multi-View Subject-Consistent Video Generation
Ziyang Song,Xinyu Gong,Bangya Liu,Zelin Zhao
Main category: cs.CV
TL;DR: 本文提出了多视角主题到视频生成(MV-S2V)新任务,通过合成数据构建与真实数据结合,并引入TS-RoPE机制区分跨主体与跨视角参考,实现3D级主体一致性视频生成。
Details
Motivation: 现有主题到视频生成(S2V)方法仅支持单视角参考,导致任务退化为S2I+I2V流程,无法发挥视频中主体控制的全部潜力;亟需支持多视角输入以实现真正3D一致的主体生成。 Method: 提出MV-S2V任务;构建合成数据生成流水线并辅以小规模实拍数据;设计时序偏移RoPE(TS-RoPE)机制以区分不同主体与同一主体的不同视角参考。 Result: 在多视角参考下实现了更优的3D主体一致性与高质量视觉输出,显著优于单视角基线方法。 Conclusion: MV-S2V为视频生成开辟了新方向,验证了多视角条件建模对提升主体一致性的有效性,推动了主题驱动视频生成向三维可控演进。 Abstract: Existing Subject-to-Video Generation (S2V) methods have achieved high-fidelity and subject-consistent video generation, yet remain constrained to single-view subject references. This limitation renders the S2V task reducible to an S2I + I2V pipeline, failing to exploit the full potential of video subject control. In this work, we propose and address the challenging Multi-View S2V (MV-S2V) task, which synthesizes videos from multiple reference views to enforce 3D-level subject consistency. Regarding the scarcity of training data, we first develop a synthetic data curation pipeline to generate highly customized synthetic data, complemented by a small-scale real-world captured dataset to boost the training of MV-S2V. Another key issue lies in the potential confusion between cross-subject and cross-view references in conditional generation. To overcome this, we further introduce Temporally Shifted RoPE (TS-RoPE) to distinguish between different subjects and distinct views of the same subject in reference conditioning. Our framework achieves superior 3D subject consistency w.r.t. multi-view reference images and high-quality visual outputs, establishing a new meaningful direction for subject-driven video generation. Our project page is available at this URL[194] Agreement-Driven Multi-View 3D Reconstruction for Live Cattle Weight Estimation
Rabin Dulal,Wenfeng Jia,Lihong Zheng,Jane Quinn
Main category: cs.CV
TL;DR: 本文提出了一种基于多视角RGB图像和SAM 3D融合的非接触式牛只活体重估计算法,在低数据条件下验证了经典集成模型比深度学习模型更适用于实际农场场景。
Details
Motivation: 传统称重和体况评分方法需人工干预,影响动物福利与生产效率,亟需低成本、非接触、易部署的替代方案。 Method: 采用多视角RGB图像输入,结合SAM 3D模型与一致性引导的点云融合策略生成单头牛3D点云,再分别用经典集成回归(如随机森林、XGBoost)和深度学习模型进行体重回归预测。 Result: SAM 3D+多视角融合在3D重建质量上优于其他方法;经典集成模型在低数据下表现最稳健(R²=0.69±0.10,MAPE=2.22%±0.56%),更适合农场实际部署。 Conclusion: 提升3D重建质量比增加模型复杂度更能有效提升体重估计性能,尤其在难以获取大量3D标注数据的农场环境中。 Abstract: Accurate cattle live weight estimation is vital for livestock management, welfare, and productivity. Traditional methods, such as manual weighing using a walk-over weighing system or proximate measurements using body condition scoring, involve manual handling of stock and can impact productivity from both a stock and economic perspective. To address these issues, this study investigated a cost-effective, non-contact method for live weight calculation in cattle using 3D reconstruction. The proposed pipeline utilized multi-view RGB images with SAM 3D-based agreement-guided fusion, followed by ensemble regression. Our approach generates a single 3D point cloud per animal and compares classical ensemble models with deep learning models under low-data conditions. Results show that SAM 3D with multi-view agreement fusion outperforms other 3D generation methods, while classical ensemble models provide the most consistent performance for practical farm scenarios (R$^2$ = 0.69 $\pm$ 0.10, MAPE = 2.22 $\pm$ 0.56 \%), making this practical for on-farm implementation. These findings demonstrate that improving reconstruction quality is more critical than increasing model complexity for scalable deployment on farms where producing a large volume of 3D data is challenging.[195] ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning
Wen Luo,Peng Chen,Xiaotao Huang,LiQun Huang
Main category: cs.CV
TL;DR: 本文提出ViTCoP框架,通过视觉编码器冗余过滤与LLM分层协同剪枝,结合K向量L2范数作为显著性度量,在大幅降低计算开销的同时保持甚至提升LVLM在图像和视频理解任务上的性能。
Details
Motivation: 现有视觉token剪枝方法存在两大局限:在视觉编码器中剪枝易过早丢失关键信息;在LLM中剪枝易导致保留token间信息冗余。 Method: 提出视觉与文本语义协同剪枝框架(ViTCoP),包含两部分:1)在视觉编码器中进行冗余过滤;2)基于LLM层次结构进行逐步协同剪枝,并采用K向量的L2范数作为LLM中token的显著性度量以兼容FlashAttention等加速技术。 Result: 在多种LVLM上实验表明,ViTCoP在图像与视频理解任务上达到SOTA性能,显著降低推理延迟和GPU显存消耗,且在极端剪枝率下优势更明显。 Conclusion: ViTCoP通过联合优化视觉与语言模块的剪枝策略,实现了高效、鲁棒且硬件友好的视觉token压缩,在保持模型性能的同时大幅提升推理效率。 Abstract: Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing methods are generally limited, either losing critical visual information prematurely due to pruning in the vision encoder, or leading to information redundancy among the selected tokens due to pruning in the Large Language Models (LLMs). To address these challenges, we propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the LLM based on its hierarchical characteristics, to efficiently preserve critical and informationally diverse visual tokens. Meanwhile, to ensure compatibility with acceleration techniques like FlashAttention, we introduce the L2 norm of K-vectors as the token saliency metric in the LLM. Extensive experiments on various Large Vision-Language Models demonstrate that ViTCoP not only achieves state-of-the-art performance surpassing existing methods on both image and video understanding tasks, but also significantly reduces model inference latency and GPU memory consumption. Notably, its performance advantage over other methods becomes even more pronounced under extreme pruning rates.[196] VAE-REPA: Variational Autoencoder Representation Alignment for Efficient Diffusion Training
Mengmeng Wang,Dengyang Jiang,Liuzhuozheng Li,Yucheng Lin,Guojiang Shen,Xiangjie Kong,Yong Liu,Guang Dai,Jingdong Wang
Main category: cs.CV
TL;DR: 本文提出了一种轻量级内在引导框架\name,利用预训练VAE的重建特性对扩散Transformer中间特征进行对齐,显著提升训练收敛速度与生成质量,且仅增加4%计算开销。
Details
Motivation: 现有去噪扩散Transformer训练收敛效率低,而REPA、SRA等加速方法依赖外部编码器或双模型结构,带来沉重计算开销。 Method: 提出\name框架,利用现成预训练VAE的特征,通过轻量投影层对齐扩散Transformer中间隐特征与VAE特征,并采用特征对齐损失进行监督。 Result: 在多个实验中,\name相比基线提升了生成质量与训练收敛速度,性能媲美或超越现有加速方法,仅引入4%额外GFLOPs,无需额外外部引导模型。 Conclusion: \name是一种简单高效、低开销的扩散Transformer训练加速方案,充分利用VAE固有视觉先验,避免了外部依赖和复杂架构。 Abstract: Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes \textbf{\namex}, a lightweight intrinsic guidance framework for efficient diffusion training. \name leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, \name aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that \name improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4\% extra GFLOPs with zero additional cost for external guidance models.[197] Geometry-Grounded Gaussian Splatting
Baowen Zhang,Chenxing Jiang,Heng Li,Shaojie Shen,Ping Tan
Main category: cs.CV
TL;DR: 本文提出了一种基于高斯图元作为随机实体的理论框架,实现了几何感知的高斯点绘,显著提升了形状重建的质量和多视角一致性。
Details
Motivation: 现有从高斯图元中提取形状的方法因几何参数化和近似不足,导致多视角不一致和对漂浮物敏感。 Method: 通过严格理论推导,将高斯图元建模为一类随机实体,并利用其体素特性高效渲染高质量深度图以实现精细几何提取。 Result: 在公开数据集上,该方法在所有基于高斯点绘的形状重建方法中取得了最优结果。 Conclusion: 高斯图元可被严谨视为随机实体,从而为几何驱动的高斯点绘提供了理论基础和实用路径。 Abstract: Gaussian Splatting (GS) has demonstrated impressive quality and efficiency in novel view synthesis. However, shape extraction from Gaussian primitives remains an open problem. Due to inadequate geometry parameterization and approximation, existing shape reconstruction methods suffer from poor multi-view consistency and are sensitive to floaters. In this paper, we present a rigorous theoretical derivation that establishes Gaussian primitives as a specific type of stochastic solids. This theoretical framework provides a principled foundation for Geometry-Grounded Gaussian Splatting by enabling the direct treatment of Gaussian primitives as explicit geometric representations. Using the volumetric nature of stochastic solids, our method efficiently renders high-quality depth maps for fine-grained geometry extraction. Experiments show that our method achieves the best shape reconstruction results among all Gaussian Splatting-based methods on public datasets.[198] SynMind: Reducing Semantic Hallucination in fMRI-Based Image Reconstruction
Lan Yang,Minghan Yang,Ke Li,Honggang Zhang,Kaiyue Pang,Yi-Zhe Song
Main category: cs.CV
TL;DR: 本文提出SynMind框架,通过将fMRI信号解析为句子级语义描述,并结合视觉先验引导扩散模型,显著提升脑成像重建的语义准确性与感知一致性。
Details
Motivation: 现有fMRI图像重建方法虽具高视觉保真度,但常出现语义错位(如物体误生成),因其过度依赖纠缠的低层视觉嵌入,忽视显式语义身份。 Method: 利用接地视觉语言模型(VLM)生成多粒度、类人语义文本描述(含物体身份与空间关系),将其作为显式语义编码,与视觉先验融合以条件化Stable Diffusion 1.4。 Result: SynMind在多数定量指标上超越SOTA;仅用SD1.4和单消费级GPU即超越基于SDXL的方法;人类评估证实其重建更符合人类视觉感知;神经可视化显示其激活更广、更语义相关的脑区。 Conclusion: 显式引入分层、组合式语义解码是提升fMRI重建语义准确性的关键路径,SynMind为脑解码提供了可解释、高效且感知一致的新范式。 Abstract: Recent advances in fMRI-based image reconstruction have achieved remarkable photo-realistic fidelity. Yet, a persistent limitation remains: while reconstructed images often appear naturalistic and holistically similar to the target stimuli, they frequently suffer from severe semantic misalignment -- salient objects are often replaced or hallucinated despite high visual quality. In this work, we address this limitation by rethinking the role of explicit semantic interpretation in fMRI decoding. We argue that existing methods rely too heavily on entangled visual embeddings which prioritize low-level appearance cues -- such as texture and global gist -- over explicit semantic identity. To overcome this, we parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding. We achieve this by leveraging grounded VLMs to generate synthetic, human-like, multi-granularity textual representations that capture object identities and spatial organization. Built upon this foundation, we propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model. Extensive experiments demonstrate that SynMind outperforms state-of-the-art methods across most quantitative metrics. Notably, by offloading semantic reasoning to our text-alignment module, SynMind surpasses competing methods based on SDXL while using the much smaller Stable Diffusion 1.4 and a single consumer GPU. Large-scale human evaluations further confirm that SynMind produces reconstructions more consistent with human visual perception. Neurovisualization analyses reveal that SynMind engages broader and more semantically relevant brain regions, mitigating the over-reliance on high-level visual areas.[199] Domain Generalization with Quantum Enhancement for Medical Image Classification: A Lightweight Approach for Cross-Center Deployment
Jingsong Xia,Siqi Wang
Main category: cs.CV
TL;DR: 本文提出一种轻量级域泛化框架,结合量子增强协同学习,提升医学影像AI模型在跨中心部署中的鲁棒性,无需多中心真实标注数据。
Details
Motivation: 医学影像AI模型在单中心/单设备表现良好,但在跨中心真实场景中因域偏移导致性能下降,临床泛化能力受限。 Method: 基于MobileNetV2构建域不变编码器,通过三部分实现:(1) 多域成像偏移模拟(亮度、对比度、锐化、噪声扰动);(2) 域对抗训练(梯度反转);(3) 轻量级量子特征增强层(参数化量子电路进行非线性映射与纠缠建模);并引入测试时自适应策略。 Result: 在模拟多中心医学影像数据集上,该方法显著优于无域泛化或量子增强的基线模型,在未见目标域上降低性能方差,提升AUC和敏感度。 Conclusion: 量子增强的域泛化方法在计算资源受限下具备临床应用潜力,为混合量子-经典医学影像系统提供了可行范式。 Abstract: Medical image artificial intelligence models often achieve strong performance in single-center or single-device settings, yet their effectiveness frequently deteriorates in real-world cross-center deployment due to domain shift, limiting clinical generalizability. To address this challenge, we propose a lightweight domain generalization framework with quantum-enhanced collaborative learning, enabling robust generalization to unseen target domains without relying on real multi-center labeled data. Specifically, a MobileNetV2-based domain-invariant encoder is constructed and optimized through three key components: (1) multi-domain imaging shift simulation using brightness, contrast, sharpening, and noise perturbations to emulate heterogeneous acquisition conditions; (2) domain-adversarial training with gradient reversal to suppress domain-discriminative features; and (3) a lightweight quantum feature enhancement layer that applies parameterized quantum circuits for nonlinear feature mapping and entanglement modeling. In addition, a test-time adaptation strategy is employed during inference to further alleviate distribution shifts. Experiments on simulated multi-center medical imaging datasets demonstrate that the proposed method significantly outperforms baseline models without domain generalization or quantum enhancement on unseen domains, achieving reduced domain-specific performance variance and improved AUC and sensitivity. These results highlight the clinical potential of quantum-enhanced domain generalization under constrained computational resources and provide a feasible paradigm for hybrid quantum--classical medical imaging systems.[200] MV-SAM: Multi-view Promptable Segmentation using Pointmap Guidance
Yoonwoo Jeong,Cheng Sun,Yu-Chiang Frank Wang,Minsu Cho,Jaesung Choe
Main category: cs.CV
TL;DR: 本文提出MV-SAM框架,利用点图(pointmaps)实现多视角图像的3D一致分割,无需显式3D网络或标注数据,在多个基准上优于SAM2-Video并媲美需场景级优化的方法。
Details
Motivation: 现有可提示分割模型(如SAM)在视频和多视角图像中缺乏3D感知,导致跨视角结果不一致,需昂贵的逐场景优化来保证3D一致性。 Method: MV-SAM利用视觉几何模型重建的点图(含像素-3D点一一对应关系),将SAM的图像嵌入提升为3D点嵌入,并通过Transformer结合3D提示嵌入与3D位置编码进行解码,实现2D交互与3D几何对齐。 Result: 在NVOS、SPIn-NeRF、ScanNet++、uCo3D和DL3DV等多视角/3D分割基准上,MV-SAM超越SAM2-Video,并达到与需逐场景优化方法相当的性能;仅用SA-1B数据集训练即具良好泛化性。 Conclusion: MV-SAM证明了借助无姿态图像生成的点图即可实现高效、通用的3D一致多视角分割,无需3D监督或专用3D网络,为可提示分割向三维空间拓展提供了新范式。 Abstract: Promptable segmentation has emerged as a powerful paradigm in computer vision, enabling users to guide models in parsing complex scenes with prompts such as clicks, boxes, or textual cues. Recent advances, exemplified by the Segment Anything Model (SAM), have extended this paradigm to videos and multi-view images. However, the lack of 3D awareness often leads to inconsistent results, necessitating costly per-scene optimization to enforce 3D consistency. In this work, we introduce MV-SAM, a framework for multi-view segmentation that achieves 3D consistency using pointmaps -- 3D points reconstructed from unposed images by recent visual geometry models. Leveraging the pixel-point one-to-one correspondence of pointmaps, MV-SAM lifts images and prompts into 3D space, eliminating the need for explicit 3D networks or annotated 3D data. Specifically, MV-SAM extends SAM by lifting image embeddings from its pretrained encoder into 3D point embeddings, which are decoded by a transformer using cross-attention with 3D prompt embeddings. This design aligns 2D interactions with 3D geometry, enabling the model to implicitly learn consistent masks across views through 3D positional embeddings. Trained on the SA-1B dataset, our method generalizes well across domains, outperforming SAM2-Video and achieving comparable performance with per-scene optimization baselines on NVOS, SPIn-NeRF, ScanNet++, uCo3D, and DL3DV benchmarks. Code will be released.[201] VidLaDA: Bidirectional Diffusion Large Language Models for Efficient Video Understanding
Zhihao He,Tieyuan Chen,Kangyu Wang,Ziran Qin,Yang Shao,Chaofan Gan,Shijie Li,Zuxuan Wu,Weiyao Lin
Main category: cs.CV
TL;DR: 本文提出VidLaDA,一种基于扩散语言模型的视频大语言模型,利用双向注意力机制克服自回归模型的因果掩码偏差,并引入MARS-Cache框架加速推理,实现12倍以上加速且不损失推理精度。
Details
Motivation: 标准自回归视频大语言模型受因果掩码偏差影响,难以有效建模全局时空关系,导致理解效率低下。 Method: 提出VidLaDA模型,采用扩散语言建模范式与双向注意力;并设计MARS-Cache,融合异步视觉缓存刷新与帧级分块注意力,通过锚点令牌保留全局连通性。 Result: VidLaDA在多项实验中超越扩散基线模型,性能媲美Qwen2.5-VL和LLaVA-Video等先进自回归模型;MARS-Cache带来超12倍推理加速,且不降低推理准确率。 Conclusion: 双向建模与高效缓存机制可显著提升视频语言模型的全局建模能力与推理效率,为视频理解提供新范式。 Abstract: Standard Autoregressive Video LLMs inevitably suffer from causal masking biases that hinder global spatiotemporal modeling, leading to suboptimal understanding efficiency. We propose VidLaDA, a Video LLM based on Diffusion Language Model utilizing bidirectional attention to capture bidirectional dependencies. To further tackle the inference bottleneck of diffusion decoding on massive video tokens, we introduce MARS-Cache. This framework accelerates inference by combining asynchronous visual cache refreshing with frame-wise chunk attention, effectively pruning redundancy while preserving global connectivity via anchor tokens. Extensive experiments show VidLaDA outperforms diffusion baselines and rivals state-of-the-art autoregressive models (e.g., Qwen2.5-VL and LLaVA-Video), with MARS-Cache delivering over 12x speedup without compromising reasoning accuracy. Code and checkpoints are open-sourced at https://github.com/ziHoHe/VidLaDA.[202] Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran
Muhammad Umar Salman,Mohammad Areeb Qazi,Mohammed Talha Alam
Main category: cs.CV
TL;DR: Quran MD 是一个综合性的古兰经多模态数据集,涵盖文本、语言学和音频三个维度,支持自然语言处理、语音识别、文本转语音、语言学分析及数字伊斯兰研究等应用。
Details
Motivation: 为支持古兰经的计算研究与实际应用,特别是结合其丰富的口传诵读传统,构建一个覆盖文本、翻译、音标与多诵读者音频的细粒度多模态数据集。 Method: 构建包含32位不同诵读者的 verse-level 和 word-level 音频数据,并配以阿拉伯原文、英文翻译、音标转写及对齐音频;所有数据按层级(经文/单词)结构化标注。 Result: 发布 Quran MD 数据集,涵盖全本古兰经的多模态对齐数据,支持 ASR、Tajweed 检测、TTS、多模态嵌入、语义检索、风格迁移与个性化辅导系统等任务。 Conclusion: Quran MD 填补了古兰经多模态资源空白,为计算语言学、语音技术与伊斯兰数字人文研究提供了可扩展、可复用的基础数据平台。 Abstract: We present Quran MD, a comprehensive multimodal dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels. For each verse (ayah), the dataset provides its original Arabic text, English translation, and phonetic transliteration. To capture the rich oral tradition of Quranic recitation, we include verse-level audio from 32 distinct reciters, reflecting diverse recitation styles and dialectical nuances. At the word level, each token is paired with its corresponding Arabic script, English translation, transliteration, and an aligned audio recording, allowing fine-grained analysis of pronunciation, phonology, and semantic context. This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies. Bridging text and audio modalities across multiple reciters, this dataset provides a unique resource to advance computational approaches to Quranic recitation and study. Beyond enabling tasks such as ASR, tajweed detection, and Quranic TTS, it lays the foundation for multimodal embeddings, semantic retrieval, style transfer, and personalized tutoring systems that can support both research and community applications. The dataset is available at https://huggingface.co/datasets/Buraaq/quran-audio-text-dataset[203] PEAfowl: Perception-Enhanced Multi-View Vision-Language-Action for Bimanual Manipulation
Qingyu Fan,Zhaoxiang Li,Yi Lu,Wang Chen,Qiu Shen,Xiao-xiao Long,Yinghao Cai,Tao Lu,Shuo Wang,Xun Cao
Main category: cs.CV
TL;DR: 本文提出了PEAfowl,一种增强感知的多视角视觉-语言-动作(VLA)策略,用于杂乱场景中的双手操作任务。通过引入基于深度分布的3D空间推理和Perceiver式文本感知读出机制,显著提升了模型在遮挡、视角与场景变化下的泛化能力,并借助训练阶段的深度蒸馏提升几何感知能力,最终在仿真与真实机器人上均取得显著性能提升。
Details
Motivation: 现有视觉-语言-动作模型在杂乱场景中双手操作任务上泛化能力差,主要因多视角特征融合方式缺乏3D一致性,且语言指令仅以全局条件注入,导致指令定位粗粒度。 Method: 提出PEAfowl模型:1)预测每个token的深度分布,进行可微3D提升,并聚合局部跨视角邻居以构建几何对齐、跨视角一致的表征;2)用Perceiver风格的文本感知读出机制替代全局语言条件,作用于冻结的CLIP视觉特征,实现迭代证据积累;3)采用仅训练阶段使用的深度蒸馏(来自预训练深度教师模型)监督深度分布头,引入几何先验而不增加推理开销。 Result: 在域随机化的RoboTwin 2.0基准上,PEAfowl比最强基线成功率提升23.0个百分点;真实机器人实验验证了可靠的sim-to-real迁移能力,且深度蒸馏带来持续性能增益。 Conclusion: PEAfowl通过几何感知的多视角表征学习与细粒度语言指令 grounding,有效提升了复杂场景下双手操作策略的鲁棒性与泛化性,验证了感知增强对VLA模型的关键作用。 Abstract: Bimanual manipulation in cluttered scenes requires policies that remain stable under occlusions, viewpoint and scene variations. Existing vision-language-action models often fail to generalize because (i) multi-view features are fused via view-agnostic token concatenation, yielding weak 3D-consistent spatial understanding, and (ii) language is injected as global conditioning, resulting in coarse instruction grounding. In this paper, we introduce PEAfowl, a perception-enhanced multi-view VLA policy for bimanual manipulation. For spatial reasoning, PEAfowl predicts per-token depth distributions, performs differentiable 3D lifting, and aggregates local cross-view neighbors to form geometrically grounded, cross-view consistent representations. For instruction grounding, we propose to replace global conditioning with a Perceiver-style text-aware readout over frozen CLIP visual features, enabling iterative evidence accumulation. To overcome noisy and incomplete commodity depth without adding inference overhead, we apply training-only depth distillation from a pretrained depth teacher to supervise the depth-distribution head, providing perception front-end with geometry-aware priors. On RoboTwin 2.0 under domain-randomized setting, PEAfowl improves the strongest baseline by 23.0 pp in success rate, and real-robot experiments further demonstrate reliable sim-to-real transfer and consistent improvements from depth distillation. Project website: https://peafowlvla.github.io/.[204] Masked Depth Modeling for Spatial Perception
Bin Tan,Changjiang Sun,Xiage Qin,Hanat Adai,Zelin Fu,Tianxiang Zhou,Han Zhang,Yinghao Xu,Xing Zhu,Yujun Shen,Nan Xue
Main category: cs.CV
TL;DR: 本文提出LingBot-Depth模型,通过掩码深度建模和视觉上下文优化深度图,在深度精度和像素覆盖率上超越高端RGB-D相机,并提供跨模态对齐的RGB-深度表征。
Details
Motivation: 深度传感器在镜面或无纹理表面等挑战性成像条件下存在固有不准确,这些误差可视为反映几何模糊性的“掩码”信号。 Method: 提出LingBot-Depth深度补全模型,结合掩码深度建模与视觉上下文引导的深度优化,并设计自动化数据筛选流程支持大规模训练。 Result: 模型在深度精度和像素覆盖率上优于顶级RGB-D相机;下游任务验证其RGB与深度模态间具有对齐的潜在表征;公开代码、模型权重及300万RGB-深度图像对(含200万真实+100万仿真数据)。 Conclusion: LingBot-Depth为克服硬件限制与成像挑战提供了新范式,推动空间视觉感知向更鲁棒、可扩展的方向发展。 Abstract: Spatial visual perception is a fundamental requirement in physical-world applications like autonomous driving and robotic manipulation, driven by the need to interact with 3D environments. Capturing pixel-aligned metric depth using RGB-D cameras would be the most viable way, yet it usually faces obstacles posed by hardware limitations and challenging imaging conditions, especially in the presence of specular or texture-less surfaces. In this work, we argue that the inaccuracies from depth sensors can be viewed as "masked" signals that inherently reflect underlying geometric ambiguities. Building on this motivation, we present LingBot-Depth, a depth completion model which leverages visual context to refine depth maps through masked depth modeling and incorporates an automated data curation pipeline for scalable training. It is encouraging to see that our model outperforms top-tier RGB-D cameras in terms of both depth precision and pixel coverage. Experimental results on a range of downstream tasks further suggest that LingBot-Depth offers an aligned latent representation across RGB and depth modalities. We release the code, checkpoint, and 3M RGB-depth pairs (including 2M real data and 1M simulated data) to the community of spatial perception.[205] Revisiting 3D Reconstruction Kernels as Low-Pass Filters
Shengjun Zhang,Min Chen,Yibo Wei,Mingyu Dong,Yueqi Duan
Main category: cs.CV
TL;DR: 本文从信号处理角度重新审视3D重建,指出离散采样引起的周期性频谱延拓是根本挑战;提出基于理想低通滤波器的Jinc核及其改进的调制核,在保持空间效率的同时提升频域保真度,显著改善渲染性能。
Details
Motivation: 离散采样导致周期性频谱延拓,现有3D重建核(如高斯、指数、t分布)作为非理想低通滤波器会引起高低频谱混叠,影响重建质量。 Method: 引入具有理想截止特性的Jinc核作为理想低通滤波器,并针对其空间衰减慢的问题,设计调制核以平衡空间效率与频域保真度。 Result: 所提出的Jinc核与调制核在实验中展现出优于现有方法的渲染性能。 Conclusion: 从信号处理视角出发,采用理想或近似理想的低通滤波核可有效缓解频谱混叠,提升3D重建质量。 Abstract: 3D reconstruction is to recover 3D signals from the sampled discrete 2D pixels, with the goal to converge continuous 3D spaces. In this paper, we revisit 3D reconstruction from the perspective of signal processing, identifying the periodic spectral extension induced by discrete sampling as the fundamental challenge. Previous 3D reconstruction kernels, such as Gaussians, Exponential functions, and Student's t distributions, serve as the low pass filters to isolate the baseband spectrum. However, their unideal low-pass property results in the overlap of high-frequency components with low-frequency components in the discrete-time signal's spectrum. To this end, we introduce Jinc kernel with an instantaneous drop to zero magnitude exactly at the cutoff frequency, which is corresponding to the ideal low pass filters. As Jinc kernel suffers from low decay speed in the spatial domain, we further propose modulated kernels to strick an effective balance, and achieves superior rendering performance by reconciling spatial efficiency and frequency-domain fidelity. Experimental results have demonstrated the effectiveness of our Jinc and modulated kernels.[206] Feature-Space Generative Models for One-Shot Class-Incremental Learning
Jack Foster,Kirill Paramonov,Mete Ozay,Umberto Michieli
Main category: cs.CV
TL;DR: 本文提出Gen1S方法,通过将嵌入空间映射到残差空间并利用生成模型(VAE或扩散模型)学习基类残差的多模态分布,作为结构先验以提升单样本新类识别性能,在多个基准和骨干网络上超越现有最优方法。
Details
Motivation: 解决少样本类增量学习(FSCIL)中仅用单样本(1-shot)识别新类且不允许后续训练或模型修改的挑战性问题。 Method: 将原始嵌入空间映射为残差空间(减去类原型),再用VAE或扩散模型建模基类残差的多模态分布,以此作为结构先验辅助新类识别。 Result: Gen1S在多个基准数据集和骨干网络上持续超越当前最优方法,显著提升新类识别准确率。 Conclusion: 基类与新类嵌入具有结构相似性,利用生成模型建模残差分布可有效提供强结构先验,提升单样本新类泛化能力。 Abstract: Few-shot class-incremental learning (FSCIL) is a paradigm where a model, initially trained on a dataset of base classes, must adapt to an expanding problem space by recognizing novel classes with limited data. We focus on the challenging FSCIL setup where a model receives only a single sample (1-shot) for each novel class and no further training or model alterations are allowed after the base training phase. This makes generalization to novel classes particularly difficult. We propose a novel approach predicated on the hypothesis that base and novel class embeddings have structural similarity. We map the original embedding space into a residual space by subtracting the class prototype (i.e., the average class embedding) of input samples. Then, we leverage generative modeling with VAE or diffusion models to learn the multi-modal distribution of residuals over the base classes, and we use this as a valuable structural prior to improve recognition of novel classes. Our approach, Gen1S, consistently improves novel class recognition over the state of the art across multiple benchmarks and backbone architectures.[207] Benchmarking Direct Preference Optimization for Medical Large Vision-Language Models
Dain Kim,Jiwoo Lee,Jaehoon Yun,Yong Hoe Koo,Qingyu Chen,Hyunjae Kim,Jaewoo Kang
Main category: cs.CV
TL;DR: 本文首次系统评估了多种直接偏好优化(DPO)方法在医学大视觉语言模型(LVLMs)上的效果,发现其性能不稳定且难以纠正视觉误判;为此提出一种针对性的偏好构建策略,显著提升视觉问答性能,并开源全部资源。
Details
Motivation: 现有大型视觉语言模型(LVLMs)在医疗场景中因对齐不足和可靠性差而受限;尽管DPO被广泛用于响应优化,但其在高风险医疗领域的实证研究仍属空白,亟需系统性评估与改进。 Method: 对九种DPO变体在LLaVA-Med和HuatuoGPT-Vision两个医学LVLM上进行跨任务、跨骨干网络的全面评估;基于发现的问题,设计一种聚焦于纠正视觉误判的偏好构造策略。 Result: 当前DPO方法在医学任务中效果不稳定,常无法超越监督微调,且难以解决视觉误判;所提新策略在视觉问答任务上比最强DPO基线提升3.6%。 Conclusion: DPO在医疗LVLM中的应用不能简单套用通用方案,需针对领域特有问题(如视觉误判)定制偏好构建机制;本工作为后续医学VLM对齐研究提供了基准、洞见与开源资源。 Abstract: Large Vision-Language Models (LVLMs) hold significant promise for medical applications, yet their deployment is often constrained by insufficient alignment and reliability. While Direct Preference Optimization (DPO) has emerged as a potent framework for refining model responses, its efficacy in high-stakes medical contexts remains underexplored, lacking the rigorous empirical groundwork necessary to guide future methodological advances. To bridge this gap, we present the first comprehensive examination of diverse DPO variants within the medical domain, evaluating nine distinct formulations across two medical LVLMs: LLaVA-Med and HuatuoGPT-Vision. Our results reveal several critical limitations: current DPO approaches often yield inconsistent gains over supervised fine-tuning, with their efficacy varying significantly across different tasks and backbones. Furthermore, they frequently fail to resolve fundamental visual misinterpretation errors. Building on these insights, we present a targeted preference construction strategy as a proof-of-concept that explicitly addresses visual misinterpretation errors frequently observed in existing DPO models. This design yields a 3.6% improvement over the strongest existing DPO baseline on visual question-answering tasks. To support future research, we release our complete framework, including all training data, model checkpoints, and our codebase at https://github.com/dmis-lab/med-vlm-dpo.[208] RemEdit: Efficient Diffusion Editing with Riemannian Geometry
Eashan Adhikarla,Brian D. Davison
Main category: cs.CV
TL;DR: RemEdit是一种基于扩散模型的可控图像生成框架,通过在潜在空间中建模黎曼流形并结合Mamba模块、双SLERP混合与视觉语言模型提示增强,提升编辑语义保真度;同时引入任务特定的注意力剪枝机制加速推理,在高剪枝率下仍保持实时性能和高质量编辑效果。
Details
Motivation: 解决可控图像生成中语义保真度与推理速度之间的关键权衡问题。 Method: 1)将潜在空间建模为黎曼流形,用Mamba模块学习其结构以计算测地线路径;2)采用双SLERP混合和视觉语言模型驱动的目标感知提示增强提升编辑控制精度;3)设计轻量级任务特定注意力剪枝头,动态保留对编辑任务关键的token。 Result: 在保持实时性能(50%剪枝下)的同时,超越现有最优图像编辑方法。 Conclusion: RemEdit为实用且强大的图像编辑设立了新基准。 Abstract: Controllable image generation is fundamental to the success of modern generative AI, yet it faces a critical trade-off between semantic fidelity and inference speed. The RemEdit diffusion-based framework addresses this trade-off with two synergistic innovations. First, for editing fidelity, we navigate the latent space as a Riemannian manifold. A mamba-based module efficiently learns the manifold's structure, enabling direct and accurate geodesic path computation for smooth semantic edits. This control is further refined by a dual-SLERP blending technique and a goal-aware prompt enrichment pass from a Vision-Language Model. Second, for additional acceleration, we introduce a novel task-specific attention pruning mechanism. A lightweight pruning head learns to retain tokens essential to the edit, enabling effective optimization without the semantic degradation common in content-agnostic approaches. RemEdit surpasses prior state-of-the-art editing frameworks while maintaining real-time performance under 50% pruning. Consequently, RemEdit establishes a new benchmark for practical and powerful image editing. Source code: https://www.github.com/eashanadhikarla/RemEdit.[209] From Specialist to Generalist: Unlocking SAM's Learning Potential on Unlabeled Medical Images
Vi Vu,Thanh-Huy Nguyen,Tien-Thinh Nguyen,Ba-Thinh Lam,Hoang-Thien Nguyen,Tianyang Wang,Xingjian Li,Min Xu
Main category: cs.CV
TL;DR: 本文提出SC-SAM框架,通过U-Net(专家)与SAM(通才)双向协同训练,在半监督医学图像分割中高效利用无标签数据,显著提升性能。
Details
Motivation: 基础模型(如SAM)在医学图像上因领域偏移、标注稀缺及PEFT无法利用无标签数据而难以适配;而传统模型(如U-Net)在半监督医学学习中表现优异,但其辅助PEFT-SAM的潜力被忽视。 Method: 提出SC-SAM框架:U-Net生成点提示和伪标签引导SAM微调,SAM则作为强泛化监督者正则化U-Net,形成双向共训循环。 Result: 在前列腺MRI和息肉分割基准上达到SOTA,超越现有半监督SAM变体及MedSAM等医学基础模型。 Conclusion: 专家(U-Net)与通才(SAM)协作可有效提升标签效率,为医学图像分割提供新范式。 Abstract: Foundation models like the Segment Anything Model (SAM) show strong generalization, yet adapting them to medical images remains difficult due to domain shift, scarce labels, and the inability of Parameter-Efficient Fine-Tuning (PEFT) to exploit unlabeled data. While conventional models like U-Net excel in semi-supervised medical learning, their potential to assist a PEFT SAM has been largely overlooked. We introduce SC-SAM, a specialist-generalist framework where U-Net provides point-based prompts and pseudo-labels to guide SAM's adaptation, while SAM serves as a powerful generalist supervisor to regularize U-Net. This reciprocal guidance forms a bidirectional co-training loop that allows both models to effectively exploit the unlabeled data. Across prostate MRI and polyp segmentation benchmarks, our method achieves state-of-the-art results, outperforming other existing semi-supervised SAM variants and even medical foundation models like MedSAM, highlighting the value of specialist-generalist cooperation for label-efficient medical image segmentation. Our code is available at https://github.com/vnlvi2k3/SC-SAM.[210] DTC: A Deformable Transposed Convolution Module for Medical Image Segmentation
Chengkun Sun,Jinqian Pan,Renjie Liang,Zhengkang Fan,Xin Miao,Jiang Bian,Jie Xu
Main category: cs.CV
TL;DR: 本文提出了一种可变形转置卷积(DTC)方法,用于医学图像分割中的上采样,通过学习动态采样位置来提升特征重建与细节恢复能力。
Details
Motivation: 传统上采样方法(如转置卷积和线性插值)依赖固定位置采样,难以捕获结构信息,易引入伪影或丢失细节。 Method: 受可变形卷积启发,提出可变形转置卷积(DTC),让模型自主学习动态采样坐标,适用于2D和3D医学图像分割。 Result: 在BTCV15(3D)、ISIC18和BUSI(2D)等数据集上验证,DTC能有效嵌入现有分割模型,持续提升解码器的特征重建与细节恢复性能。 Conclusion: DTC是一种更灵活、更具表达力的上采样方法,在多种医学图像分割任务中展现出优越性和通用性。 Abstract: In medical image segmentation, particularly in UNet-like architectures, upsampling is primarily used to transform smaller feature maps into larger ones, enabling feature fusion between encoder and decoder features and supporting multi-scale prediction. Conventional upsampling methods, such as transposed convolution and linear interpolation, operate on fixed positions: transposed convolution applies kernel elements to predetermined pixel or voxel locations, while linear interpolation assigns values based on fixed coordinates in the original feature map. These fixed-position approaches may fail to capture structural information beyond predefined sampling positions and can lead to artifacts or loss of detail. Inspired by deformable convolutions, we propose a novel upsampling method, Deformable Transposed Convolution (DTC), which learns dynamic coordinates (i.e., sampling positions) to generate high-resolution feature maps for both 2D and 3D medical image segmentation tasks. Experiments on 3D (e.g., BTCV15) and 2D datasets (e.g., ISIC18, BUSI) demonstrate that DTC can be effectively integrated into existing medical image segmentation models, consistently improving the decoder's feature reconstruction and detail recovery capability.[211] FlowMorph: Physics-Consistent Self-Supervision for Label-Free Single-Cell Mechanics in Microfluidic Videos
Bora Yimenicioglu,Vishal Manikanden
Main category: cs.CV
TL;DR: 本文提出了FlowMorph,一种物理一致的自监督框架,用于从微流控视频中无标签地学习红细胞(RBC)力学代理参数k,从而实现高通量、高精度的RBC变形性量化。
Details
Motivation: 红细胞机械性能是血液与系统性疾病的重要生物标志物,但现有微流控分析方法依赖有监督分割或人工设计的kymograph,且未显式建模层流Stokes流体物理,限制了泛化性与可解释性。 Method: FlowMorph采用低维参数化轮廓建模RBC,通过可微分的'胶囊在流中'模型(融合层流输运与曲率正则化弹性松弛)演化边界点,并联合优化轮廓重叠、胞内流一致性、面积守恒、壁约束和时间平滑性等物理驱动损失函数,仅需自动提取的轮廓与光流。 Result: 在四个公开数据集上,FlowMorph平均轮廓IoU达0.905;标量k单独区分tank-treading与flipping动力学AUC为0.863;仅用200个RT-DC事件校准后,k到杨氏模量E的单调映射预测MAE为0.118 MPa,且对通道几何、光学与帧率变化鲁棒。 Conclusion: FlowMorph实现了物理引导、无需标注、高鲁棒性的RBC单细胞力学表征,为无标记微流控力学诊断提供了新范式。 Abstract: Mechanical properties of red blood cells (RBCs) are promising biomarkers for hematologic and systemic disease, motivating microfluidic assays that probe deformability at throughputs of $10^3$--$10^6$ cells per experiment. However, existing pipelines rely on supervised segmentation or hand-crafted kymographs and rarely encode the laminar Stokes-flow physics that governs RBC shape evolution. We introduce FlowMorph, a physics-consistent self-supervised framework that learns a label-free scalar mechanics proxy $k$ for each tracked RBC from short brightfield microfluidic videos. FlowMorph models each cell by a low-dimensional parametric contour, advances boundary points through a differentiable ''capsule-in-flow'' combining laminar advection and curvature-regularized elastic relaxation, and optimizes a loss coupling silhouette overlap, intra-cellular flow agreement, area conservation, wall constraints, and temporal smoothness, using only automatically derived silhouettes and optical flow. Across four public RBC microfluidic datasets, FlowMorph achieves a mean silhouette IoU of $0.905$ on physics-rich videos with provided velocity fields and markedly improves area conservation and wall violations over purely data-driven baselines. On $\sim 1.5\times 10^5$ centered sequences, the scalar $k$ alone separates tank-treading from flipping dynamics with an AUC of $0.863$. Using only $200$ real-time deformability cytometry (RT-DC) events for calibration, a monotone map $E=g(k)$ predicts apparent Young's modulus with a mean absolute error of $0.118$\,MPa on $600$ held-out cells and degrades gracefully under shifts in channel geometry, optics, and frame rate.[212] UPLiFT: Efficient Pixel-Dense Feature Upsampling with Local Attenders
Matthew Walmer,Saksham Suri,Anirud Aggarwal,Abhinav Shrivastava
Main category: cs.CV
TL;DR: 本文提出UPLiFT架构及Local Attender算子,通过改进迭代式特征上采样方法,在保持低推理开销的同时实现像素级稠密特征上采样的SOTA性能,并在生成式任务中表现优异。
Details
Motivation: 现有跨注意力机制的特征上采样方法存在效率扩展性差的问题,而早期迭代式方法被忽视;本文旨在重新挖掘迭代式方法潜力,兼顾性能与效率。 Method: 提出UPLiFT通用轻量级像素稠密特征变换架构,并设计局部注意力算子(Local Attender),采用纯局部定义的注意力池化形式,避免全局计算开销。 Result: UPLiFT在多个基准上达到SOTA性能且推理成本低于现有像素稠密上采样器;在VAE特征上采样等生成任务中媲美先进耦合流匹配模型。 Conclusion: 迭代式特征上采样方法仍具竞争力;UPLiFT以更低开销实现高性能与多功能性,为预训练视觉骨干输出的高效稠密化提供了新范式。 Abstract: The space of task-agnostic feature upsampling has emerged as a promising area of research to efficiently create denser features from pre-trained visual backbones. These methods act as a shortcut to achieve dense features for a fraction of the cost by learning to map low-resolution features to high-resolution versions. While early works in this space used iterative upsampling approaches, more recent works have switched to cross-attention-based methods, which risk falling into the same efficiency scaling problems of the backbones they are upsampling. In this work, we demonstrate that iterative upsampling methods can still compete with cross-attention-based methods; moreover, they can achieve state-of-the-art performance with lower inference costs. We propose UPLiFT, an architecture for Universal Pixel-dense Lightweight Feature Transforms. We also propose an efficient Local Attender operator to overcome the limitations of prior iterative feature upsampling methods. This operator uses an alternative attentional pooling formulation defined fully locally. We show that our Local Attender allows UPLiFT to maintain stable features throughout upsampling, enabling state-of-the-art performance with lower inference costs than existing pixel-dense feature upsamplers. In addition, we apply UPLiFT to generative downstream tasks and show that it achieves competitive performance with state-of-the-art Coupled Flow Matching models for VAE feature upsampling. Altogether, UPLiFT offers a versatile and efficient approach to creating denser features.[213] Domain-Expert-Guided Hybrid Mixture-of-Experts for Medical AI: Integrating Data-Driven Learning with Clinical Priors
Jinchen Gu,Nan Zhao,Lei Qiu,Lu Zhang
Main category: cs.CV
TL;DR: 本文提出了一种名为DKGH-MoE的混合专家模型,将数据驱动专家与临床专家知识(如医生注视模式)引导的专家相结合,以提升医学图像分析的性能与可解释性。
Details
Motivation: MoE模型在医学等小数据领域效果受限,而临床实践中蕴含丰富的专家知识(如医生注视模式和诊断启发式),尚未被有效利用。 Method: 提出Domain-Knowledge-Guided Hybrid MoE(DKGH-MoE),包含两个并行MoE分支:一个从原始影像中学习数据驱动特征,另一个利用医生眼动线索等临床先验构建专家,二者融合实现互补。 Result: DKGH-MoE在提升模型性能的同时增强了临床可解释性,并具备即插即用特性。 Conclusion: 融合领域知识与数据驱动学习是提升医学AI鲁棒性与可信度的有效路径,DKGH-MoE为该方向提供了可推广的模块化解决方案。 Abstract: Mixture-of-Experts (MoE) models increase representational capacity with modest computational cost, but their effectiveness in specialized domains such as medicine is limited by small datasets. In contrast, clinical practice offers rich expert knowledge, such as physician gaze patterns and diagnostic heuristics, that models cannot reliably learn from limited data. Combining data-driven experts, which capture novel patterns, with domain-expert-guided experts, which encode accumulated clinical insights, provides complementary strengths for robust and clinically meaningful learning. To this end, we propose Domain-Knowledge-Guided Hybrid MoE (DKGH-MoE), a plug-and-play and interpretable module that unifies data-driven learning with domain expertise. DKGH-MoE integrates a data-driven MoE to extract novel features from raw imaging data, and a domain-expert-guided MoE incorporates clinical priors, specifically clinician eye-gaze cues, to emphasize regions of high diagnostic relevance. By integrating domain expert insights with data-driven features, DKGH-MoE improves both performance and interpretability.[214] MorphXAI: An Explainable Framework for Morphological Analysis of Parasites in Blood Smear Images
Aqsa Yousaf,Sint Sint Win,Megan Coffee,Habeeb Olufowobi
Main category: cs.CV
TL;DR: 本文提出了MorphXAI,一种将寄生虫检测与细粒度形态学分析相结合的可解释性深度学习框架,通过形态监督提升模型可解释性,并构建了包含三种寄生虫的临床标注数据集。
Details
Motivation: 现有深度学习模型在寄生虫检测中缺乏可解释性,传统可视化解释方法无法反映医生依赖的形态学特征。 Method: 提出MorphXAI框架,在检测过程中引入形态学监督,联合预测寄生虫位置及形状、曲率、点状结构数量、鞭毛存在与否和发育阶段等临床相关属性;构建含Leishmania、T. brucei和T. cruzi三种寄生虫的医生标注形态数据集。 Result: MorphXAI在检测性能上优于基线模型,并能提供结构化、生物学意义明确的解释。 Conclusion: MorphXAI有效 bridged 检测精度与临床可解释性之间的鸿沟,为资源有限地区的寄生虫病诊断提供了更可靠、可信赖的AI辅助工具。 Abstract: Parasitic infections remain a pressing global health challenge, particularly in low-resource settings where diagnosis still depends on labor-intensive manual inspection of blood smears and the availability of expert domain knowledge. While deep learning models have shown strong performance in automating parasite detection, their clinical usefulness is constrained by limited interpretability. Existing explainability methods are largely restricted to visual heatmaps or attention maps, which highlight regions of interest but fail to capture the morphological traits that clinicians rely on for diagnosis. In this work, we present MorphXAI, an explainable framework that unifies parasite detection with fine-grained morphological analysis. MorphXAI integrates morphological supervision directly into the prediction pipeline, enabling the model to localize parasites while simultaneously characterizing clinically relevant attributes such as shape, curvature, visible dot count, flagellum presence, and developmental stage. To support this task, we curate a clinician-annotated dataset of three parasite species (Leishmania, Trypanosoma brucei, and Trypanosoma cruzi) with detailed morphological labels, establishing a new benchmark for interpretable parasite analysis. Experimental results show that MorphXAI not only improves detection performance over the baseline but also provides structured, biologically meaningful explanations.[215] Strip-Fusion: Spatiotemporal Fusion for Multispectral Pedestrian Detection
Asiegbu Miracle Kanu-Asiegbu,Nitin Jotwani,Xiaoxiao Du
Main category: cs.CV
TL;DR: 本文提出Strip-Fusion,一种鲁棒的空间-时间融合网络,用于多光谱行人检测,能有效应对图像错位、光照变化和严重遮挡等问题。
Details
Motivation: 现有方法主要关注空间融合而忽略时序信息,且多光谱数据集中的RGB与热成像图像常存在未对齐问题;同时行人受光照、遮挡等因素影响难以检测。 Method: 提出Strip-Fusion网络,包含时序自适应卷积以动态加权时空特征,并设计KL散度损失缓解可见光与热成像模态不平衡,还引入新后处理算法抑制误检。 Result: 在KAIST和CVC-14基准上性能达到领先水平,在严重遮挡和图像错位等挑战性场景下显著优于先前最优方法。 Conclusion: Strip-Fusion通过联合建模时空信息与模态互补性,提升了多光谱行人检测的鲁棒性与准确性,为复杂环境下的机器人感知提供了有效方案。 Abstract: Pedestrian detection is a critical task in robot perception. Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information. Several gaps remain with multispectral pedestrian detection methods. First, existing approaches primarily focus on spatial fusion and often neglect temporal information. Second, RGB and thermal image pairs in multispectral benchmarks may not always be perfectly aligned. Pedestrians are also challenging to detect due to varying lighting conditions, occlusion, etc. This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images, as well as varying lighting conditions and heavy occlusions. The Strip-Fusion pipeline integrates temporally adaptive convolutions to dynamically weigh spatial-temporal features, enabling our model to better capture pedestrian motion and context over time. A novel Kullback-Leibler divergence loss was designed to mitigate modality imbalance between visible and thermal inputs, guiding feature alignment toward the more informative modality during training. Furthermore, a novel post-processing algorithm was developed to reduce false positives. Extensive experimental results show that our method performs competitively for both the KAIST and the CVC-14 benchmarks. We also observed significant improvements compared to previous state-of-the-art on challenging conditions such as heavy occlusion and misalignment.[216] Leveraging Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation
Zhuangzhi Gao,Feixiang Zhou,He Zhao,Xiuju Chen,Xiaoxin Li,Qinkai Yu,Yitian Zhao,Alena Shantsila,Gregory Y. H. Lip,Eduard Shantsila,Yalin Zheng
Main category: cs.CV
TL;DR: 本文提出PIs-Regressor模块和Topology SegNet框架,直接从数据学习可微分的持久性图像(PI)表征,并将其嵌入网络结构中以提升医学图像中曲线结构分割的准确性与拓扑保真度,避免依赖手工设计的拓扑损失函数。
Details
Motivation: 现有方法难以有效提取和嵌入持久性图(PD)等拓扑特征,因其非可微性和高计算成本;且主流方法依赖泛化能力差的手工损失函数。 Method: 提出PIs-Regressor模块,从数据端到端学习持久性图像(PI)表征;构建Topology SegNet,在编码器-解码器的下采样和上采样阶段融合拓扑特征,将拓扑信息内建于网络架构而非作为辅助损失。 Result: 在三个曲线结构分割基准上达到像素级精度和拓扑保真度的SOTA性能,显著提升模型对过曝、模糊等医学影像退化问题的鲁棒性。 Conclusion: 将可微分拓扑表征(PI)直接嵌入分割网络结构是比使用手工拓扑损失更鲁棒、更通用的策略,且具备良好扩展性,可与其他拓扑方法协同增强性能。 Abstract: Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.[217] Semi-Supervised Hyperspectral Image Classification with Edge-Aware Superpixel Label Propagation and Adaptive Pseudo-Labeling
Yunfei Qiu,Qiqiong Ma,Tianhua Lv,Li Fang,Shudong Zhou,Wei Yao
Main category: cs.CV
TL;DR: 本文提出了一种结合空间先验与动态学习机制的半监督高光谱图像分类框架DREPL,包含EASLP、DHP和ATSC三个核心模块,有效缓解边界标签扩散与伪标签不稳定问题,在多个基准数据集上表现出优越性能。
Details
Motivation: 高光谱图像半监督分类面临标注成本高、样本少的问题,导致边界标签扩散和伪标签不稳定等挑战。 Method: 提出动态可靠性增强伪标签框架(DREPL),包含:1)边缘感知超像素标签传播(EASLP)模块,融合边缘强度惩罚与邻域校正;2)动态历史融合预测(DHP)方法,加权融合历史与当前预测;3)自适应三元样本分类(ATSC)策略,分层利用易/模糊/难样本。 Result: 在四个基准数据集上验证了该方法在分类精度、伪标签稳定性与时空一致性方面的显著提升。 Conclusion: 所提DREPL框架通过EASLP与DREPL协同,实现了空间与时间维度的一致性优化,提升了半监督高光谱分类的鲁棒性与效率。 Abstract: Significant progress has been made in semi-supervised hyperspectral image (HSI) classification regarding feature extraction and classification performance. However, due to high annotation costs and limited sample availability, semi-supervised learning still faces challenges such as boundary label diffusion and pseudo-label instability. To address these issues, this paper proposes a novel semi-supervised hyperspectral classification framework integrating spatial prior information with a dynamic learning mechanism. First, we design an Edge-Aware Superpixel Label Propagation (EASLP) module. By integrating edge intensity penalty with neighborhood correction strategy, it mitigates label diffusion from superpixel segmentation while enhancing classification robustness in boundary regions. Second, we introduce a Dynamic History-Fused Prediction (DHP) method. By maintaining historical predictions and dynamically weighting them with current results, DHP smoothens pseudo-label fluctuations and improves temporal consistency and noise resistance. Concurrently, incorporating condifence and consistency measures, the Adaptive Tripartite Sample Categorization (ATSC) strategy implements hierarchical utilization of easy, ambiguous, and hard samples, leading to enhanced pseudo-label quality and learning efficiency. The Dynamic Reliability-Enhanced Pseudo-Label Framework (DREPL), composed of DHP and ATSC, strengthens pseudo-label stability across temporal and sample domains. Through synergizes operation with EASLP, it achieves spatio-temporal consistency optimization. Evaluations on four benchmark datasets demonstrate its capability to maintain superior classification performance.[218] Cross-Domain Transfer with Self-Supervised Spectral-Spatial Modeling for Hyperspectral Image Classification
Jianshu Chao,Tianhua Lv,Qiqiong Ma,Yunfei Qiu,Li Fang,Huifang Shen,Wei Yao
Main category: cs.CV
TL;DR: 本文提出了一种无需源域标签的自监督跨域迁移框架,通过Spatial-Spectral Transformer(S2Former)和频域约束(FDC)实现光谱-空间联合表征学习,并在微调阶段引入扩散对齐微调(DAFT)机制,显著提升了少样本下跨域高光谱图像分类性能。
Details
Motivation: 现有自监督高光谱方法依赖源域标注且易受分布偏移影响,跨域泛化能力弱。 Method: 提出自监督跨域迁移框架:预训练阶段设计Spatial-Spectral Transformer(含双分支与双向交叉注意力)和频域约束(FDC);微调阶段引入Diffusion-Aligned Fine-tuning(DAFT)蒸馏机制。 Result: 在四个高光谱数据集上验证了方法在跨域场景下的稳定分类性能与强适应性,尤其适用于资源受限(如少样本、无源标签)条件。 Conclusion: 所提框架有效解耦了源域标签依赖与分布偏移敏感性,实现了高质量、可迁移的光谱-空间联合表征学习,为高光谱跨域任务提供了新范式。 Abstract: Self-supervised learning has demonstrated considerable potential in hyperspectral representation, yet its application in cross-domain transfer scenarios remains under-explored. Existing methods, however, still rely on source domain annotations and are susceptible to distribution shifts, leading to degraded generalization performance in the target domain. To address this, this paper proposes a self-supervised cross-domain transfer framework that learns transferable spectral-spatial joint representations without source labels and achieves efficient adaptation under few samples in the target domain. During the self-supervised pre-training phase, a Spatial-Spectral Transformer (S2Former) module is designed. It adopts a dual-branch spatial-spectral transformer and introduces a bidirectional cross-attention mechanism to achieve spectral-spatial collaborative modeling: the spatial branch enhances structural awareness through random masking, while the spectral branch captures fine-grained differences. Both branches mutually guide each other to improve semantic consistency. We further propose a Frequency Domain Constraint (FDC) to maintain frequency-domain consistency through real Fast Fourier Transform (rFFT) and high-frequency magnitude loss, thereby enhancing the model's capability to discern fine details and boundaries. During the fine-tuning phase, we introduce a Diffusion-Aligned Fine-tuning (DAFT) distillation mechanism. This aligns semantic evolution trajectories through a teacher-student structure, enabling robust transfer learning under low-label conditions. Experimental results demonstrate stable classification performance and strong cross-domain adaptability across four hyperspectral datasets, validating the method's effectiveness under resource-constrained conditions.[219] Text-Pass Filter: An Efficient Scene Text Detector
Chuang Yang,Haozhao Ma,Xu Han,Yuan Yuan,Qi Wang
Main category: cs.CV
TL;DR: 本文提出Text-Pass Filter (TPF),一种用于任意形状文本检测的新方法,通过模拟带通滤波器思想直接分割整个文本区域,避免传统收缩掩码策略带来的边缘特征丢失问题;引入REU增强长条形文本特征一致性,FPU提升前景背景区分能力,实现无需复杂后处理的实时粘连文本分离。
Details
Motivation: 现有基于收缩掩码扩展策略的方法在收缩过程中丢失文本边缘视觉特征,混淆前景与背景差异,导致文本特征识别存在固有局限。 Method: 提出Text-Pass Filter(TPF),为每个文本构建独特的特征-滤波器对,模拟带通滤波器机制实现文本独立提取;设计强化集成单元(REU)解决细长文本大宽高比导致的识别困难;引入前景先验单元(FPU)提升前景-背景判别能力。 Result: 实验验证了REU和FPU的有效性,TPF在任意形状文本检测任务中展现出优越性能,尤其在粘连文本分离和实时性方面优于现有方法。 Conclusion: TPF通过端到端直接分割整词、结合REU与FPU模块,在保持高精度的同时显著简化流程,为高效、鲁棒的任意形状文本检测提供了新范式。 Abstract: To pursue an efficient text assembling process, existing methods detect texts via the shrink-mask expansion strategy. However, the shrinking operation loses the visual features of text margins and confuses the foreground and background difference, which brings intrinsic limitations to recognize text features. We follow this issue and design Text-Pass Filter (TPF) for arbitrary-shaped text detection. It segments the whole text directly, which avoids the intrinsic limitations. It is noteworthy that different from previous whole text region-based methods, TPF can separate adhesive texts naturally without complex decoding or post-processing processes, which makes it possible for real-time text detection. Concretely, we find that the band-pass filter allows through components in a specified band of frequencies, called its passband but blocks components with frequencies above or below this band. It provides a natural idea for extracting whole texts separately. By simulating the band-pass filter, TPF constructs a unique feature-filter pair for each text. In the inference stage, every filter extracts the corresponding matched text by passing its pass-feature and blocking other features. Meanwhile, considering the large aspect ratio problem of ribbon-like texts makes it hard to recognize texts wholly, a Reinforcement Ensemble Unit (REU) is designed to enhance the feature consistency of the same text and to enlarge the filter's recognition field to help recognize whole texts. Furthermore, a Foreground Prior Unit (FPU) is introduced to encourage TPF to discriminate the difference between the foreground and background, which improves the feature-filter pair quality. Experiments demonstrate the effectiveness of REU and FPU while showing the TPF's superiority.[220] Computational Framework for Estimating Relative Gaussian Blur Kernels between Image Pairs
Akbar Saadat
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的前向计算框架,用于实时估计图像的离焦模糊程度,基于高斯模型的解析表达式进行离散计算,并通过邻域相似性筛选唯一解,实验显示其在合成模糊值估计和模糊图像重建上误差均低于2%。
Details
Motivation: 实现高斯模型在实时应用中的零训练、快速前向计算,解决离焦模糊估计问题。 Method: 基于高斯核标准差适用范围内的解析离焦图像表达式进行离散计算,对多解点利用邻域相似性度量筛选唯一解,并支持两幅图像互为部分模糊版本的情形。 Result: 在真实图像上,合成模糊值估计的平均绝对误差(MAE)低于1.7%;将提取的离焦滤波器应用于较清晰图像所重建的模糊图像,其强度与真实模糊图像差异低于2%。 Conclusion: 该零训练前向框架高效、准确,适用于实时离焦模糊建模与估计任务。 Abstract: Following the earlier verification for Gaussian model in \cite{ASaa2026}, this paper introduces a zero training forward computational framework for the model to realize it in real time applications. The framework is based on discrete calculation of the analytic expression of the defocused image from the sharper one for the application range of the standard deviation of the Gaussian kernels and selecting the best matches. The analytic expression yields multiple solutions at certain image points, but is filtered down to a single solution using similarity measures over neighboring points.The framework is structured to handle cases where two given images are partial blurred versions of each other. Experimental evaluations on real images demonstrate that the proposed framework achieves a mean absolute error (MAE) below $1.7\%$ in estimating synthetic blur values. Furthermore, the discrepancy between actual blurred image intensities and their corresponding estimates remains under $2\%$, obtained by applying the extracted defocus filters to less blurred images.[221] Spatial-Conditioned Reasoning in Long-Egocentric Videos
James Tribble,Hao Wang,Si-En Hong,Chaoyi Zhou,Ashish Bastola,Siyu Huang,Abolfazl Razi
Main category: cs.CV
TL;DR: 本文研究了显式空间信号对基于视觉语言模型(VLM)的长时程自我中心视频理解的影响,不修改模型架构或推理流程;提出了Sanpo-D数据集并融合深度图增强空间推理能力,发现深度感知与空间接地表征可提升行人和障碍物检测等安全关键任务性能。
Details
Motivation: 长时程自我中心视频在视觉导航中面临视角漂移和缺乏持久几何上下文的挑战,现有VLM的空间推理能力仍有限。 Method: 构建细粒度重标注数据集Sanpo-D,对多个VLM在导航导向空间查询上进行基准测试,并在输入层融合深度图与RGB帧以分析其对空间推理的影响。 Result: 揭示了通用准确性与空间专业化之间的权衡,深度感知与空间接地表征能提升行人检测、障碍物检测等安全关键任务的性能。 Conclusion: 显式空间信号(如深度图)可有效增强VLM在长时程自我中心视频中的空间推理能力,无需修改模型结构,为安全关键导航任务提供了实用改进路径。 Abstract: Long-horizon egocentric video presents significant challenges for visual navigation due to viewpoint drift and the absence of persistent geometric context. Although recent vision-language models perform well on image and short-video reasoning, their spatial reasoning capability in long egocentric sequences remains limited. In this work, we study how explicit spatial signals influence VLM-based video understanding without modifying model architectures or inference procedures. We introduce Sanpo-D, a fine-grained re-annotation of the Google Sanpo dataset, and benchmark multiple VLMs on navigation-oriented spatial queries. To examine input-level inductive bias, we further fuse depth maps with RGB frames and evaluate their impact on spatial reasoning. Our results reveal a trade-off between general-purpose accuracy and spatial specialization, showing that depth-aware and spatially grounded representations can improve performance on safety-critical tasks such as pedestrian and obstruction detection.[222] LungCRCT: Causal Representation based Lung CT Processing for Lung Cancer Treatment
Daeyoung Kim
Main category: cs.CV
TL;DR: 本文提出LungCRCT框架,利用潜在因果表征学习和图自编码器进行因果发现,结合距离相关性解耦与熵驱动图像重建优化,实现可解释的肺癌因果干预分析与高精度恶性肿瘤分类(AUC 93.91%)。
Details
Motivation: 肺癌早期无症状、症状易与其他呼吸系统疾病混淆,导致延误诊断;现有深度学习模型(如CNN、ViT)虽在检测上表现优异,但缺乏因果解释性与可干预性,限制其在治疗分析中的应用。 Method: 提出LungCRCT框架:基于图自编码器的因果发现算法,融合距离相关性(distance Correlation)解耦与熵驱动的图像重建优化,学习肺癌物理进展机制中的潜在因果表征。 Result: 支持因果干预分析,并在恶性肿瘤分类任务中达到93.91%的AUC;下游模型轻量且鲁棒。 Conclusion: LungCRCT通过引入因果表征学习,弥补了传统AI模型在可解释性与因果推断上的不足,为肺癌早期监测与个性化治疗分析提供了新范式。 Abstract: Due to silence in early stages, lung cancer has been one of the most leading causes of mortality in cancer patients world-wide. Moreover, major symptoms of lung cancer are hard to differentiate with other respiratory disease symptoms such as COPD, further leading patients to overlook cancer progression in early stages. Thus, to enhance survival rates in lung cancer, early detection from consistent proactive respiratory system monitoring becomes crucial. One of the most prevalent and effective methods for lung cancer monitoring would be low-dose computed tomography(LDCT) chest scans, which led to remarkable enhancements in lung cancer detection or tumor classification tasks under rapid advancements and applications of computer vision based AI models such as EfficientNet or ResNet in image processing. However, though advanced CNN models under transfer learning or ViT based models led to high performing lung cancer detections, due to its intrinsic limitations in terms of correlation dependence and low interpretability due to complexity, expansions of deep learning models to lung cancer treatment analysis or causal intervention analysis simulations are still limited. Therefore, this research introduced LungCRCT: a latent causal representation learning based lung cancer analysis framework that retrieves causal representations of factors within the physical causal mechanism of lung cancer progression. With the use of advanced graph autoencoder based causal discovery algorithms with distance Correlation disentanglement and entropy-based image reconstruction refinement, LungCRCT not only enables causal intervention analysis for lung cancer treatments, but also leads to robust, yet extremely light downstream models in malignant tumor classification tasks with an AUC score of 93.91%.[223] Forward Consistency Learning with Gated Context Aggregation for Video Anomaly Detection
Jiahao Lyu,Minghua Zhao,Xuewen Huang,Yifei Chen,Shuangli Du,Jing Hu,Cheng Shi,Zhiyong Lv
Main category: cs.CV
TL;DR: 本文提出了一种轻量级视频异常检测模型FoGA,通过前向一致性学习与门控上下文聚合,在边缘设备上实现高精度与高效率的平衡。
Details
Motivation: 现有VAD方法依赖大模型、难以部署于边缘设备,且主流预测方法仅利用单帧预测误差,忽略长期时序信息。 Method: 提出基于Unet的轻量架构(约2M参数),引入门控上下文聚合模块增强跳连特征融合,并设计前向一致性损失联合优化,结合即时与前向帧预测误差进行混合异常度量。 Result: 在多个基准上显著超越SOTA方法,推理速度达155 FPS,兼顾高精度与高效率。 Conclusion: FoGA为资源受限场景下的实时视频异常检测提供了有效可行的解决方案,实现了性能与效率的良好权衡。 Abstract: As a crucial element of public security, video anomaly detection (VAD) aims to measure deviations from normal patterns for various events in real-time surveillance systems. However, most existing VAD methods rely on large-scale models to pursue extreme accuracy, limiting their feasibility on resource-limited edge devices. Moreover, mainstream prediction-based VAD detects anomalies using only single-frame future prediction errors, overlooking the richer constraints from longer-term temporal forward information. In this paper, we introduce FoGA, a lightweight VAD model that performs Forward consistency learning with Gated context Aggregation, containing about 2M parameters and tailored for potential edge devices. Specifically, we propose a Unet-based method that performs feature extraction on consecutive frames to generate both immediate and forward predictions. Then, we introduce a gated context aggregation module into the skip connections to dynamically fuse encoder and decoder features at the same spatial scale. Finally, the model is jointly optimized with a novel forward consistency loss, and a hybrid anomaly measurement strategy is adopted to integrate errors from both immediate and forward frames for more accurate detection. Extensive experiments demonstrate the effectiveness of the proposed method, which substantially outperforms state-of-the-art competing methods, running up to 155 FPS. Hence, our FoGA achieves an excellent trade-off between performance and the efficiency metric.[224] Agentic Very Long Video Understanding
Aniket Rege,Arka Sadhu,Yuliang Li,Kejie Li,Ramya Korlakai Vinayak,Yuning Chai,Yong Jae Lee,Hyo Jin Kim
Main category: cs.CV
TL;DR: 本文提出了EGAgent框架,利用实体场景图对长时间段的自我中心视频进行建模,支持跨模态、多跳和时序一致的推理,在EgoLifeQA和Video-MME(Long)数据集上取得优异性能。
Details
Motivation: 现有方法受限于上下文窗口长度,难以对持续数天甚至数周的自我中心视频流进行组合式、多跳推理,无法满足全天候可穿戴AI助手对长期情境理解的需求。 Method: 提出EGAgent增强型智能体框架,核心是构建随时间演化的实体场景图(含人物、地点、物体及其关系),并配备基于该图的结构化搜索与推理工具,以及混合视觉-音频检索能力。 Result: 在EgoLifeQA数据集上达到57.5%准确率(SOTA),在Video-MME(Long)上达74.1%,显著提升长时程自我中心视频理解性能。 Conclusion: 基于实体场景图的智能体框架能有效支持跨模态、多跳与时序一致的长期视频理解,为可穿戴AI助手提供关键技术支撑。 Abstract: The advent of always-on personal AI assistants, enabled by all-day wearable devices such as smart glasses, demands a new level of contextual understanding, one that goes beyond short, isolated events to encompass the continuous, longitudinal stream of egocentric video. Achieving this vision requires advances in long-horizon video understanding, where systems must interpret and recall visual and audio information spanning days or even weeks. Existing methods, including large language models and retrieval-augmented generation, are constrained by limited context windows and lack the ability to perform compositional, multi-hop reasoning over very long video streams. In this work, we address these challenges through EGAgent, an enhanced agentic framework centered on entity scene graphs, which represent people, places, objects, and their relationships over time. Our system equips a planning agent with tools for structured search and reasoning over these graphs, as well as hybrid visual and audio search capabilities, enabling detailed, cross-modal, and temporally coherent reasoning. Experiments on the EgoLifeQA and Video-MME (Long) datasets show that our method achieves state-of-the-art performance on EgoLifeQA (57.5%) and competitive performance on Video-MME (Long) (74.1%) for complex longitudinal video understanding tasks.[225] TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
Zehua Liu,Shihao Zou,Jincai Huang,Yanfang Zhang,Chao Tong,Weixin Si
Main category: cs.CV
TL;DR: 本文提出了一种用于经动脉化疗栓塞术(TACE)中2D-3D血管配准的粗到精策略,包含结构感知的透视n点法(SA-PnP)和基于时间扩散的形变模型(TempDiffReg),显著提升了配准精度与解剖合理性。
Details
Motivation: TACE手术中血管导航复杂、解剖变异大,亟需准确鲁棒的2D-3D血管配准以辅助微导管精准定位。 Method: 提出两阶段配准策略:第一阶段用结构感知的SA-PnP实现全局对齐;第二阶段用时序扩散模型TempDiffReg迭代建模血管局部形变,利用时间上下文捕捉解剖变化。 Result: 在626组多帧临床数据上验证,MSE为0.63 mm,MAE为0.51 mm,较最优方法分别降低66.7%和17.7%,解剖合理性更优。 Conclusion: 该方法可提升TACE手术精度与安全性,尤其有助于经验较少的医生开展复杂操作,改善手术效果与患者预后。 Abstract: Transarterial chemoembolization (TACE) is a preferred treatment option for hepatocellular carcinoma and other liver malignancies, yet it remains a highly challenging procedure due to complex intra-operative vascular navigation and anatomical variability. Accurate and robust 2D-3D vessel registration is essential to guide microcatheter and instruments during TACE, enabling precise localization of vascular structures and optimal therapeutic targeting. To tackle this issue, we develop a coarse-to-fine registration strategy. First, we introduce a global alignment module, structure-aware perspective n-point (SA-PnP), to establish correspondence between 2D and 3D vessel structures. Second, we propose TempDiffReg, a temporal diffusion model that performs vessel deformation iteratively by leveraging temporal context to capture complex anatomical variations and local structural changes. We collected data from 23 patients and constructed 626 paired multi-frame samples for comprehensive evaluation. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art (SOTA) methods in both accuracy and anatomical plausibility. Specifically, our method achieves a mean squared error (MSE) of 0.63 mm and a mean absolute error (MAE) of 0.51 mm in registration accuracy, representing 66.7\% lower MSE and 17.7\% lower MAE compared to the most competitive existing approaches. It has the potential to assist less-experienced clinicians in safely and efficiently performing complex TACE procedures, ultimately enhancing both surgical outcomes and patient care. Code and data are available at: \textcolor{blue}{https://github.com/LZH970328/TempDiffReg.git}[226] YOLO-DS: Fine-Grained Feature Decoupling via Dual-Statistic Synergy Operator for Object Detection
Lin Huang,Yujuan Tan,Weisheng Li,Shitai Shan,Liu Liu,Bo Liu,Linlin Shen,Jing Yu,Yue Niu
Main category: cs.CV
TL;DR: 本文提出YOLO-DS框架,通过新型双统计协同算子(DSO)解耦目标特征,并设计两个轻量门控模块(DSG和MSG),在MS-COCO上显著提升YOLOv8各尺度模型性能,AP提升1.1%–1.7%,仅带来微小延迟增加。
Details
Motivation: 现有YOLO检测器缺乏对共享特征通道内异构目标响应的显式建模,限制了性能进一步提升。 Method: 提出YOLO-DS框架,核心为双统计协同算子(DSO),联合建模通道均值与峰均差;并设计双统计协同门控(DSG)模块实现自适应通道选择,以及多路径分段门控(MSG)模块实现深度方向特征加权。 Result: 在MS-COCO基准上,YOLO-DS在YOLOv8五种模型尺度(N/S/M/L/X)上均取得1.1%–1.7%的AP提升,推理延迟仅轻微增加;可视化、消融及对比实验验证了其对异构目标判别能力与高效性的优势。 Conclusion: YOLO-DS通过显式建模异构目标响应,有效提升了单阶段检测器的精度与效率平衡,为YOLO系列提供了可扩展的新范式。 Abstract: One-stage object detection, particularly the YOLO series, strikes a favorable balance between accuracy and efficiency. However, existing YOLO detectors lack explicit modeling of heterogeneous object responses within shared feature channels, which limits further performance gains. To address this, we propose YOLO-DS, a framework built around a novel Dual-Statistic Synergy Operator (DSO). The DSO decouples object features by jointly modeling the channel-wise mean and the peak-to-mean difference. Building upon the DSO, we design two lightweight gating modules: the Dual-Statistic Synergy Gating (DSG) module for adaptive channel-wise feature selection, and the Multi-Path Segmented Gating (MSG) module for depth-wise feature weighting. On the MS-COCO benchmark, YOLO-DS consistently outperforms YOLOv8 across five model scales (N, S, M, L, X), achieving AP gains of 1.1% to 1.7% with only a minimal increase in inference latency. Extensive visualization, ablation, and comparative studies validate the effectiveness of our approach, demonstrating its superior capability in discriminating heterogeneous objects with high efficiency.[227] \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation
Weiye Zhu,Zekai Zhang,Xiangchen Wang,Hewei Pan,Teng Wang,Tiantian Geng,Rongtao Xu,Feng Zheng
Main category: cs.CV
TL;DR: 本文提出NaVIDA框架,通过逆动力学增强和分层概率动作分块,建模视觉-动作因果关系,提升VLN任务的泛化性与稳定性。
Details
Motivation: 现有VLN方法缺乏对动作如何因果改变视觉观测量的显式建模,导致行为不稳定、泛化弱和路径误差累积。 Method: 提出NaVIDA框架,结合策略学习与动作驱动的视觉动力学建模;引入基于片段的逆动力学监督;设计层次化概率动作分块(HPAC)以提供长程视觉变化线索;采用熵引导机制自适应设定执行步长。 Result: 在多个VLN基准上超越SOTA方法,参数更少(3B vs. 8B),并在真实机器人上验证了有效性。 Conclusion: 建模视觉-动作因果关系对提升VLN鲁棒性与泛化能力至关重要,NaVIDA为该方向提供了有效且实用的统一框架。 Abstract: Vision-and-Language Navigation (VLN) requires agents to interpret natural language instructions and act coherently in visually rich environments. However, most existing methods rely on reactive state-action mappings without explicitly modeling how actions causally transform subsequent visual observations. Lacking such vision-action causality, agents cannot anticipate the visual changes induced by its own actions, leading to unstable behaviors, weak generalization, and cumulative error along trajectory. To address these issues, we introduce \textsc{NaVIDA} (\textbf{Nav}igation with \textbf{I}nverse \textbf{D}ynamics \textbf{A}ugmentation), a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution. \textsc{NaVIDA} augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions. To structure this supervision and extend the effective planning range, \textsc{NaVIDA} employs hierarchical probabilistic action chunking (HPAC), which organizes trajectories into multi-step chunks and provides discriminative, longer-range visual-change cues. To further curb error accumulation and stabilize behavior at inference, an entropy-guided mechanism adaptively sets the execution horizon of action chunks. Extensive experiments show that \textsc{NaVIDA} achieves superior navigation performance compared to state-of-the-art methods with fewer parameters (3B vs. 8B). Real-world robot evaluations further validate the practical feasibility and effectiveness of our approach. Code and data will be available upon acceptance.[228] Multi-Perspective Subimage CLIP with Keyword Guidance for Remote Sensing Image-Text Retrieval
Yifan Li,Shiying Wang,Jianqiang Huang
Main category: cs.CV
TL;DR: 本文提出MPS-CLIP,一种参数高效的视觉语言预训练框架,通过关键词引导的细粒度对齐提升遥感图文检索性能,结合LLM、SamGeo与新型适配器G^2A,在RSICD和RSITMD上达到SOTA。
Details
Motivation: 现有VLP模型在遥感图文检索中依赖粗粒度全局对齐,忽略图像中密集多尺度语义;全量微调计算开销大且易灾难性遗忘。 Method: 提出MPS-CLIP框架:1)用LLM提取关键词,指导SamGeo生成语义子视角;2)设计轻量Gated Global Attention(G^2A)适配器适配冻结主干;3)引入Multi-Perspective Representation(MPR)模块聚合局部特征;4)采用多视角对比损失与加权三元组损失联合优化。 Result: 在RSICD和RSITMD数据集上mR分别达35.18%和48.40%,显著优于全量微调及其他先进方法。 Conclusion: MPS-CLIP通过参数高效方式实现关键词引导的细粒度对齐,有效提升了遥感图文检索精度与鲁棒性,为轻量适配VLP模型提供了新范式。 Abstract: Vision-Language Pre-training (VLP) models like CLIP have significantly advanced Remote Sensing Image-Text Retrieval (RSITR). However, existing methods predominantly rely on coarse-grained global alignment, which often overlooks the dense, multi-scale semantics inherent in overhead imagery. Moreover, adapting these heavy models via full fine-tuning incurs prohibitive computational costs and risks catastrophic forgetting. To address these challenges, we propose MPS-CLIP, a parameter-efficient framework designed to shift the retrieval paradigm from global matching to keyword-guided fine-grained alignment. Specifically, we leverage a Large Language Model (LLM) to extract core semantic keywords, guiding the Segment Anything Model (SamGeo) to generate semantically relevant sub-perspectives. To efficiently adapt the frozen backbone, we introduce a Gated Global Attention (G^2A) adapter, which captures global context and long-range dependencies with minimal overhead. Furthermore, a Multi-Perspective Representation (MPR) module aggregates these local cues into robust multi-perspective embeddings. The framework is optimized via a hybrid objective combining multi-perspective contrastive and weighted triplet losses, which dynamically selects maximum-response perspectives to suppress noise and enforce precise semantic matching. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that MPS-CLIP achieves state-of-the-art performance with 35.18% and 48.40% mean Recall (mR), respectively, significantly outperforming full fine-tuning baselines and recent competitive methods. Code is available at https://github.com/Lcrucial1f/MPS-CLIP.[229] MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models
Tian-Yi Zhou,Xuan-Hao Liu,Bao-Liang Lu,Wei-Long Zheng
Main category: cs.CV
TL;DR: 本文提出MindCine框架,通过多模态联合学习与预训练大尺度EEG模型,解决EEG-to-video重建中单模态局限与数据稀缺问题,实现小样本下的高保真视频重建。
Details
Motivation: 现有EEG-to-video重建方法受限于仅对齐文本模态(导致过拟合)和EEG-视频配对数据稀缺(难以收敛),亟需兼顾多模态信息与数据效率的新方法。 Method: 提出MindCine框架:1)多模态联合学习策略,融合文本以外的模态;2)利用预训练大规模EEG模型缓解数据稀缺;3)设计带因果注意力的Seq2Seq模型解码感知信息。 Result: 在有限数据下,MindCine在定性与定量评估上均超越现有SOTA方法;验证了多模态互补性及大EEG模型对性能提升的有效性。 Conclusion: 多模态协同与大模型先验可显著提升小样本EEG-to-video重建质量,为非侵入式脑机视觉解码提供新范式。 Abstract: Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG's non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.[230] QualiRAG: Retrieval-Augmented Generation for Visual Quality Understanding
Linhan Cao,Wei Sun,Weixia Zhang,Xiangyang Zhu,Kaiwei Zhang,Jun Jia,Dandan Zhu,Guangtao Zhai,Xiongkuo Min
Main category: cs.CV
TL;DR: 本文提出QualiRAG,一种无需训练的检索增强生成(RAG)框架,利用大视觉语言模型的隐式感知知识,通过动态构建四类互补知识源实现细粒度、可解释的视觉质量理解,显著优于现有开源与微调模型,且无需任务特定训练。
Details
Motivation: 当前视觉质量评估(VQA)正转向可解释的质量理解,但主流监督微调或强化学习方法依赖高成本人工标注指令数据,易受数据集偏差影响。 Method: 提出训练自由的QualiRAG框架:不依赖静态语料库检索,而是将问题结构化分解,动态生成并检索四类互补知识源——视觉元数据、主体定位、全局质量摘要和局部质量描述,支撑证据驱动的推理。 Result: 在视觉质量理解任务上显著超越开源通用多模态模型及VQA微调模型;在质量比较任务上达到有竞争力的性能;全程无需任何任务特定训练。 Conclusion: QualiRAG验证了利用大模型固有感知能力、通过检索增强实现高效、鲁棒、免训练视觉质量理解的可行性,为可解释VQA提供了新范式。 Abstract: Visual quality assessment (VQA) is increasingly shifting from scalar score prediction toward interpretable quality understanding -- a paradigm that demands \textit{fine-grained spatiotemporal perception} and \textit{auxiliary contextual information}. Current approaches rely on supervised fine-tuning or reinforcement learning on curated instruction datasets, which involve labor-intensive annotation and are prone to dataset-specific biases. To address these challenges, we propose \textbf{QualiRAG}, a \textit{training-free} \textbf{R}etrieval-\textbf{A}ugmented \textbf{G}eneration \textbf{(RAG)} framework that systematically leverages the latent perceptual knowledge of large multimodal models (LMMs) for visual quality perception. Unlike conventional RAG that retrieves from static corpora, QualiRAG dynamically generates auxiliary knowledge by decomposing questions into structured requests and constructing four complementary knowledge sources: \textit{visual metadata}, \textit{subject localization}, \textit{global quality summaries}, and \textit{local quality descriptions}, followed by relevance-aware retrieval for evidence-grounded reasoning. Extensive experiments show that QualiRAG achieves substantial improvements over open-source general-purpose LMMs and VQA-finetuned LMMs on visual quality understanding tasks, and delivers competitive performance on visual quality comparison tasks, demonstrating robust quality assessment capabilities without any task-specific training. The code will be publicly available at https://github.com/clh124/QualiRAG.[231] HomoFM: Deep Homography Estimation with Flow Matching
Mengfan He,Liangzheng Sun,Chunyu Li,Ziyang Meng
Main category: cs.CV
TL;DR: 本文提出HomoFM框架,首次将生成模型中的流匹配技术引入单应性估计任务,通过学习连续点态速度场来恢复高精度变换,并结合梯度反转层提升跨域鲁棒性。
Details
Motivation: 现有深度单应性估计方法多采用直接回归或迭代优化,难以建模复杂几何变换且泛化能力差,尤其在跨域(如多模态、光照变化)场景下表现不佳。 Method: 将单应性估计建模为速度场学习问题,利用条件流轨迹从噪声分布中恢复配准坐标;在特征提取主干中嵌入梯度反转层(GRL)以实现域不变特征学习。 Result: 在标准基准上,HomoFM在估计精度和鲁棒性方面均优于当前最优方法。 Conclusion: HomoFM通过引入流匹配与域自适应机制,有效提升了单应性估计的精度与跨域泛化能力,为几何匹配任务提供了新思路。 Abstract: Deep homography estimation has broad applications in computer vision and robotics. Remarkable progresses have been achieved while the existing methods typically treat it as a direct regression or iterative refinement problem and often struggling to capture complex geometric transformations or generalize across different domains. In this work, we propose HomoFM, a new framework that introduces the flow matching technique from generative modeling into the homography estimation task for the first time. Unlike the existing methods, we formulate homography estimation problem as a velocity field learning problem. By modeling a continuous and point-wise velocity field that transforms noisy distributions into registered coordinates, the proposed network recovers high-precision transformations through a conditional flow trajectory. Furthermore, to address the challenge of domain shifts issue, e.g., the cases of multimodal matching or varying illumination scenarios, we integrate a gradient reversal layer (GRL) into the feature extraction backbone. This domain adaptation strategy explicitly constrains the encoder to learn domain-invariant representations, significantly enhancing the network's robustness. Extensive experiments demonstrate the effectiveness of the proposed method, showing that HomoFM outperforms state-of-the-art methods in both estimation accuracy and robustness on standard benchmarks. Code and data resource are available at https://github.com/hmf21/HomoFM.[232] Facial Emotion Recognition on FER-2013 using an EfficientNetB2-Based Approach
Sahil Naik,Soham Bagayatkar,Pavankumar Singh
Main category: cs.CV
TL;DR: 本文提出了一种基于EfficientNetB2的轻量高效人脸表情识别方法,通过两阶段训练、AdamW优化、标签平滑和类别权重裁剪等技术,在FER-2013数据集上达到68.78%准确率,参数量仅为VGG16基线的十分之一,适用于实时与边缘部署。
Details
Motivation: 真实场景下人脸表情识别面临图像质量低、光照/姿态变化、背景干扰、类间差异小、众包标签噪声大及严重类别不平衡等问题,现有大型CNN模型计算和内存开销高,难以用于实时应用。 Method: 采用轻量级EfficientNetB2架构,结合两阶段warm-up与微调策略;引入AdamW优化器、解耦权重衰减、标签平滑(ε=0.06)、裁剪类别权重、Dropout、混合精度训练及实时数据增强;使用分层87.5%/12.5%训练验证划分。 Result: 在FER-2013测试集上达到68.78%准确率,参数量约为VGG16基线的1/10;实验显示训练稳定、泛化能力强,并具备良好的每类指标和学习动态表现。 Conclusion: 所提方法在精度与效率间取得良好平衡,显著降低资源消耗,适合实时和边缘设备部署,为轻量化表情识别提供了实用可行的解决方案。 Abstract: Detection of human emotions based on facial images in real-world scenarios is a difficult task due to low image quality, variations in lighting, pose changes, background distractions, small inter-class variations, noisy crowd-sourced labels, and severe class imbalance, as observed in the FER-2013 dataset of 48x48 grayscale images. Although recent approaches using large CNNs such as VGG and ResNet achieve reasonable accuracy, they are computationally expensive and memory-intensive, limiting their practicality for real-time applications. We address these challenges using a lightweight and efficient facial emotion recognition pipeline based on EfficientNetB2, trained using a two-stage warm-up and fine-tuning strategy. The model is enhanced with AdamW optimization, decoupled weight decay, label smoothing (epsilon = 0.06) to reduce annotation noise, and clipped class weights to mitigate class imbalance, along with dropout, mixed-precision training, and extensive real-time data augmentation. The model is trained using a stratified 87.5%/12.5% train-validation split while keeping the official test set intact, achieving a test accuracy of 68.78% with nearly ten times fewer parameters than VGG16-based baselines. Experimental results, including per-class metrics and learning dynamics, demonstrate stable training and strong generalization, making the proposed approach suitable for real-time and edge-based applications.[233] V-Loop: Visual Logical Loop Verification for Hallucination Detection in Medical Visual Question Answering
Mengyuan Jin,Zehui Liao,Yong Xia
Main category: cs.CV
TL;DR: 本文提出了一种无需训练、即插即用的视觉逻辑循环验证框架(V-Loop),用于检测医学视觉问答中多模态大语言模型(MLLMs)的答案幻觉,通过构建双向视觉 grounded 逻辑闭环来验证答案的事实正确性,显著优于现有不确定性方法。
Details
Motivation: 现有基于不确定性的内省式幻觉检测方法间接且无法直接验证答案是否符合图像事实,在高风险医学场景中存在安全隐患。 Method: 提出V-Loop框架:1)MLLM生成原始答案;2)从原始QA对中提取语义单元;3)以答案为条件生成验证问题并重问原问题单元;4)强制视觉注意力一致性,确保原问题与验证问题均依赖相同图像证据;5)若验证答案匹配预期语义,则闭环成立,判定答案可信,否则标记为幻觉。 Result: 在多个医学VQA基准和MLLM上实验表明,V-Loop持续优于现有内省方法,效率高,并可与不确定性方法结合进一步提升性能。 Conclusion: V-Loop是一种有效、高效、通用的训练-free 幻觉检测框架,为提升医学VQA可靠性提供了新范式。 Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable capability in assisting disease diagnosis in medical visual question answering (VQA). However, their outputs remain vulnerable to hallucinations (i.e., responses that contradict visual facts), posing significant risks in high-stakes medical scenarios. Recent introspective detection methods, particularly uncertainty-based approaches, offer computational efficiency but are fundamentally indirect, as they estimate predictive uncertainty for an image-question pair rather than verifying the factual correctness of a specific answer. To address this limitation, we propose Visual Logical Loop Verification (V-Loop), a training-free and plug-and-play framework for hallucination detection in medical VQA. V-Loop introduces a bidirectional reasoning process that forms a visually grounded logical loop to verify factual correctness. Given an input, the MLLM produces an answer for the primary input pair. V-Loop extracts semantic units from the primary QA pair, generates a verification question by conditioning on the answer unit to re-query the question unit, and enforces visual attention consistency to ensure answering both primary question and verification question rely on the same image evidence. If the verification answer matches the expected semantic content, the logical loop closes, indicating factual grounding; otherwise, the primary answer is flagged as hallucinated. Extensive experiments on multiple medical VQA benchmarks and MLLMs show that V-Loop consistently outperforms existing introspective methods, remains highly efficient, and further boosts uncertainty-based approaches when used in combination.[234] Vision-Language-Model-Guided Differentiable Ray Tracing for Fast and Accurate Multi-Material RF Parameter Estimation
Zerui Kang,Yishen Lim,Zhouyou Gu,Seung-Woo Ko,Tony Q. S. Quek,Jihong Park
Main category: cs.CV
TL;DR: 本文提出了一种视觉-语言模型(VLM)引导的可微射线追踪(DRT)框架,用于加速并稳定多材料射频参数估计,显著提升6G电磁数字孪生中材料参数反演的精度与效率。
Details
Motivation: 准确的射频材料参数对6G电磁数字孪生至关重要,但现有基于梯度的逆向射线追踪方法对初始值敏感、在测量数据有限时计算成本高。 Method: 利用VLM解析场景图像以推断材料类别,并通过ITU-R材料表映射为电导率等定量先验;VLM还选择能激发材料区分性路径的收发器布设;随后在DRT引擎中基于接收信号强度进行梯度优化精调。 Result: 在NVIDIA Sionna室内场景实验中,相比均匀/随机初始化及随机布设基线,收敛速度提升2–4倍,最终参数误差降低10–100倍;仅需少量接收器即可实现<0.1%平均相对误差;复杂度近似线性增长,且VLM引导布设减少了所需测量数。 Conclusion: VLM提供的语义先验能有效引导物理驱动的优化过程,实现快速、鲁棒的RF材料参数估计。 Abstract: Accurate radio-frequency (RF) material parameters are essential for electromagnetic digital twins in 6G systems, yet gradient-based inverse ray tracing (RT) remains sensitive to initialization and costly under limited measurements. This paper proposes a vision-language-model (VLM) guided framework that accelerates and stabilizes multi-material parameter estimation in a differentiable RT (DRT) engine. A VLM parses scene images to infer material categories and maps them to quantitative priors via an ITU-R material table, yielding informed conductivity initializations. The VLM further selects informative transmitter/receiver placements that promote diverse, material-discriminative paths. Starting from these priors, the DRT performs gradient-based refinement using measured received signal strengths. Experiments in NVIDIA Sionna on indoor scenes show 2-4$\times$ faster convergence and 10-100$\times$ lower final parameter error compared with uniform or random initialization and random placement baselines, achieving sub-0.1\% mean relative error with only a few receivers. Complexity analyses indicate per-iteration time scales near-linearly with the number of materials and measurement setups, while VLM-guided placement reduces the measurements required for accurate recovery. Ablations over RT depth and ray counts confirm further accuracy gains without significant per-iteration overhead. Results demonstrate that semantic priors from VLMs effectively guide physics-based optimization for fast and reliable RF material estimation.[235] A multimodal vision foundation model for generalizable knee pathology
Kang Yu,Dingyu Wang,Zimu Yuan,Nan Zhou,Jiajun Liu,Jiaxin Liu,Shanggui Liu,Yaoyan Zheng,Huishu Yuan,Di Huang,Dong Jiang
Main category: cs.CV
TL;DR: 本文提出了OrthoFoundation,一种针对肌肉骨骼病理学的多模态视觉基础模型,通过自监督对比学习在120万膝关节X光和MRI图像上预训练,实现了14个下游任务的SOTA性能,并展现出优异的标签效率和跨解剖部位泛化能力。
Details
Motivation: 现有骨科AI方法依赖任务特定的监督学习,需要大量标注数据,缺乏跨模态和临床场景的泛化能力;同时,大规模、高质量、开源的肌肉骨骼数据集稀缺,制约了基础模型的发展。 Method: 构建包含120万未标注膝关节X光和MRI图像的预训练数据集,采用Dinov3骨干网络,通过自监督对比学习进行训练,以学习鲁棒的放射学表征。 Result: 在14个下游任务中达到SOTA性能,X光骨关节炎诊断准确率领先,MRI结构损伤检测排名第一;仅用50%标注数据即可匹敌全监督基线;且能跨解剖部位(髋、肩、踝)有效泛化。 Conclusion: OrthoFoundation通过从大规模多模态数据中学习关节无关的放射学语义,显著降低了标注负担,提升了临床诊断准确性,是肌肉骨骼影像通用AI的重要进展。 Abstract: Musculoskeletal disorders represent a leading cause of global disability, creating an urgent demand for precise interpretation of medical imaging. Current artificial intelligence (AI) approaches in orthopedics predominantly rely on task-specific, supervised learning paradigms. These methods are inherently fragmented, require extensive annotated datasets, and often lack generalizability across different modalities and clinical scenarios. The development of foundation models in this field has been constrained by the scarcity of large-scale, curated, and open-source musculoskeletal datasets. To address these challenges, we introduce OrthoFoundation, a multimodal vision foundation model optimized for musculoskeletal pathology. We constructed a pre-training dataset of 1.2 million unlabeled knee X-ray and MRI images from internal and public databases. Utilizing a Dinov3 backbone, the model was trained via self-supervised contrastive learning to capture robust radiological representations. OrthoFoundation achieves state-of-the-art (SOTA) performance across 14 downstream tasks. It attained superior accuracy in X-ray osteoarthritis diagnosis and ranked first in MRI structural injury detection. The model demonstrated remarkable label efficiency, matching supervised baselines using only 50% of labeled data. Furthermore, despite being pre-trained on knee images, OrthoFoundation exhibited exceptional cross-anatomy generalization to the hip, shoulder, and ankle. OrthoFoundation represents a significant advancement toward general-purpose AI for musculoskeletal imaging. By learning fundamental, joint-agnostic radiological semantics from large-scale multimodal data, it overcomes the limitations of conventional models, which provides a robust framework for reducing annotation burdens and enhancing diagnostic accuracy in clinical practice.[236] Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing
Chao Wang,Xuanying Li,Cheng Dai,Jinglei Feng,Yuxiang Luo,Yuqi Ouyang,Hao Qin
Main category: cs.CV
TL;DR: 本文提出Co-PLNet,一种点线协同的线框解析框架,通过点线提示编码器(PLP-Encoder)和交叉引导线解码器(CGL-Decoder)实现线段与端点/交点的联合建模与一致性优化,在Wireframe和YorkUrban数据集上提升了精度、鲁棒性与实时性。
Details
Motivation: 现有线框解析方法分别预测线段和交点,再后处理匹配,易导致不一致和鲁棒性差。 Method: 提出Co-PLNet:1)Point-Line Prompt Encoder(PLP-Encoder)将早期检测结果转化为几何属性对齐的空间提示图;2)Cross-Guidance Line Decoder(CGL-Decoder)基于稀疏注意力和互补提示联合优化点线预测,保障一致性与效率。 Result: 在Wireframe和YorkUrban数据集上,精度与鲁棒性显著提升,同时具备良好的实时推理效率。 Conclusion: 点线协同建模与空间提示机制可有效提升线框解析的结构一致性与实用性,为结构化几何感知提供新范式。 Abstract: Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception.[237] Depth to Anatomy: Learning Internal Organ Locations from Surface Depth Images
Eytan Kats,Kai Geissler,Daniel Mensing,Jochen G. Hirsch,Stefan Heldman,Mattias P. Heinrich
Main category: cs.CV
TL;DR: 本文提出了一种基于学习的框架,利用单张2D深度图像直接预测多个内部器官的3D位置和形状,无需显式表面重建,提升了自动患者定位的精度与效率。
Details
Motivation: 自动化患者定位对优化扫描流程和提高患者通量至关重要;利用RGB-D相机获取的深度信息可有效估计内部器官位置。 Method: 基于大规模全身MRI数据集合成配对的深度图像与解剖分割标签,训练统一的卷积神经网络,直接从单张2D深度图预测多器官的3D位置和形状。 Result: 该方法能准确定位包括骨骼和软组织在内的多种解剖结构,在实验中展现出将深度传感器集成到放射科工作流中的潜力。 Conclusion: 所提方法无需表面重建即可实现高精度器官定位,有望简化扫描流程并改善患者体验。 Abstract: Automated patient positioning plays an important role in optimizing scanning procedure and improving patient throughput. Leveraging depth information captured by RGB-D cameras presents a promising approach for estimating internal organ positions, thereby enabling more accurate and efficient positioning. In this work, we propose a learning-based framework that directly predicts the 3D locations and shapes of multiple internal organs from single 2D depth images of the body surface. Utilizing a large-scale dataset of full-body MRI scans, we synthesize depth images paired with corresponding anatomical segmentations to train a unified convolutional neural network architecture. Our method accurately localizes a diverse set of anatomical structures, including bones and soft tissues, without requiring explicit surface reconstruction. Experimental results demonstrate the potential of integrating depth sensors into radiology workflows to streamline scanning procedures and enhance patient experience through automated patient positioning.[238] Revisiting Aerial Scene Classification on the AID Benchmark
Subhajeet Das,Susmita Ghosh,Abhiroop Chatterjee
Main category: cs.CV
TL;DR: This paper reviews machine learning methods for aerial image classification and proposes Aerial-Y-Net, an attention-enhanced CNN with multi-scale feature fusion, achieving 91.72% accuracy on the AID dataset.
Details
Motivation: The heterogeneous nature of aerial images makes robust scene classification challenging, necessitating a comprehensive review and improved models. Method: A literature review of handcrafted features (e.g., SIFT, LBP), traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks; plus the design of Aerial-Y-Net—a spatial attention-enhanced CNN with multi-scale feature fusion. Result: Aerial-Y-Net achieves 91.72% accuracy on the AID dataset, outperforming several baseline architectures. Conclusion: Attention mechanisms and multi-scale feature fusion significantly improve aerial image classification performance, and Aerial-Y-Net demonstrates the effectiveness of such design. Abstract: Aerial images play a vital role in urban planning and environmental preservation, as they consist of various structures, representing different types of buildings, forests, mountains, and unoccupied lands. Due to its heterogeneous nature, developing robust models for scene classification remains a challenge. In this study, we conduct a literature review of various machine learning methods for aerial image classification. Our survey covers a range of approaches from handcrafted features (e.g., SIFT, LBP) to traditional CNNs (e.g., VGG, GoogLeNet), and advanced deep hybrid networks. In this connection, we have also designed Aerial-Y-Net, a spatial attention-enhanced CNN with multi-scale feature fusion mechanism, which acts as an attention-based model and helps us to better understand the complexities of aerial images. Evaluated on the AID dataset, our model achieves 91.72% accuracy, outperforming several baseline architectures.[239] Contextual Range-View Projection for 3D LiDAR Point Clouds
Seyedali Mousavi,Seyedhamidreza Mousavi,Masoud Daneshtalab
Main category: cs.CV
TL;DR: 本文提出两种改进的LiDAR点云range-view投影方法(CAP和CWAP),通过引入实例中心性和类别权重来缓解多对一映射冲突,显著提升语义分割性能。
Details
Motivation: 现有range-view投影中采用最近点(最小深度)选择策略,忽略了语义相关性和物体结构,导致上下文信息丢失。 Method: 提出Centerness-Aware Projection(CAP)和Class-Weighted-Aware Projection(CWAP):CAP根据点到实例中心的距离调整深度优先级;CWAP依据用户定义的类别权重进行点选择。 Result: 在SemanticKITTI上,CAP相比基线提升最高3.1% mIoU;CWAP可针对性增强特定类别性能,且对其他类别影响极小。 Conclusion: 融合实例结构与类别先验的投影策略能更有效地保留关键上下文信息,提升下游2D语义分割任务性能。 Abstract: Range-view projection provides an efficient method for transforming 3D LiDAR point clouds into 2D range image representations, enabling effective processing with 2D deep learning models. However, a major challenge in this projection is the many-to-one conflict, where multiple 3D points are mapped onto the same pixel in the range image, requiring a selection strategy. Existing approaches typically retain the point with the smallest depth (closest to the LiDAR), disregarding semantic relevance and object structure, which leads to the loss of important contextual information. In this paper, we extend the depth-based selection rule by incorporating contextual information from both instance centers and class labels, introducing two mechanisms: \textit{Centerness-Aware Projection (CAP)} and \textit{Class-Weighted-Aware Projection (CWAP)}. In CAP, point depths are adjusted according to their distance from the instance center, thereby prioritizing central instance points over noisy boundary and background points. In CWAP, object classes are prioritized through user-defined weights, offering flexibility in the projection strategy. Our evaluations on the SemanticKITTI dataset show that CAP preserves more instance points during projection, achieving up to a 3.1\% mIoU improvement compared to the baseline. Furthermore, CWAP enhances the performance of targeted classes while having a negligible impact on the performance of other classes[240] SwipeGen: Bridging the Execution Gap in GUI Agents via Human-like Swipe Synthesis
Xuan Wang,Siyuan Su,Quantong Fu,Yongxiang Hu,Yangfan Zhou
Main category: cs.CV
TL;DR: 本文提出SwipeGen管道以生成类人滑动交互,并构建首个评估GUI代理滑动执行能力的基准,进而提出GUISwiper代理,其滑动执行准确率达69.07%,较现有VLM基线提升214%。
Details
Motivation: 现有GUI代理在滑动操作执行上策略过于简化,难以复现人类行为,导致任务完成率受限。 Method: 将人类滑动手势分解为多个可量化维度,设计自动化管道SwipeGen用于合成类人滑动交互,并基于此构建首个滑动执行能力评估基准;进一步利用合成数据训练GUI代理GUISwiper。 Result: GUISwiper滑动执行准确率达69.07%,相较现有VLM基线提升214%。 Conclusion: 通过精细化建模人类滑动行为并构建专用基准与代理,显著提升了GUI代理的滑动执行能力,为GUI自动化中动作执行环节提供了新范式。 Abstract: With the widespread adoption of Graphical User Interface (GUI) agents for automating GUI interaction tasks, substantial research focused on improving GUI perception to ground task instructions into concrete action steps. However, the step execution capability of these agents has gradually emerged as a new bottleneck for task completion. In particular, existing GUI agents often adopt overly simplified strategies for handling swipe interactions, preventing them from accurately replicating human-like behavior. To address this limitation, we decompose human swipe gestures into multiple quantifiable dimensions and propose an automated pipeline SwipeGen to synthesize human-like swipe interactions through GUI exploration. Based on this pipeline, we construct and release the first benchmark for evaluating the swipe execution capability of GUI agents. Furthermore, leveraging the synthesized data, we propose GUISwiper, a GUI agent with enhanced interaction execution capabilities. Experimental results demonstrate that GUISwiper achieves a swipe execution accuracy of 69.07%, representing a 214% improvement over existing VLM baselines.[241] A Tumor Aware DenseNet Swin Hybrid Learning with Boosted and Hierarchical Feature Spaces for Large-Scale Brain MRI Classification
Muhammad Ali Shah,Muhammad Mansoor Alam,Saddam Hussain Khan
Main category: cs.CV
TL;DR: 本文提出了一种高效的Densely Swin Hybrid(EDSH)框架,用于脑肿瘤MRI分析,通过两种肿瘤感知实验设置联合捕捉细粒度纹理模式和长程上下文依赖,显著提升了分类准确率和召回率。
Details
Motivation: 解决脑肿瘤MRI分析中类特异性诊断挑战,尤其是不同肿瘤类型(如弥漫性胶质瘤、脑膜瘤、垂体瘤)在形态、边界、位置等特征上的差异导致的检测困难。 Method: 提出EDSH框架,包含两个实验设置:1)Boosted Feature Space(BFS),融合定制化DenseNet(局部)与Swin-T(全局)分支并进行维度对齐与特征增强;2)分层DenseNet-Swin架构,结合深度特征提取(DFE)与双残差连接(DR),其中DenseNet作为CNN主干学习局部结构特征,Swin-T建模全局肿瘤形态;并对DenseNet输入端和Swin-T的patch embedding与窗口注意力机制进行MRI任务适配。 Result: 在大规模MRI数据集(40,260张图像,四类肿瘤)上评估,EDSH达到98.50%的测试准确率和召回率,显著优于单独CNN、ViT及其它混合模型。 Conclusion: EDSH框架通过局部-全局协同建模与任务驱动的网络定制,在脑肿瘤MRI分类中实现了高敏感性与低假阴性,验证了其在临床辅助诊断中的潜力。 Abstract: This study proposes an efficient Densely Swin Hybrid (EDSH) framework for brain tumor MRI analysis, designed to jointly capture fine grained texture patterns and long range contextual dependencies. Two tumor aware experimental setups are introduced to address class-specific diagnostic challenges. The first setup employs a Boosted Feature Space (BFS), where independently customized DenseNet and Swint branches learn complementary local and global representations that are dimension aligned, fused, and boosted, enabling highly sensitive detection of diffuse glioma patterns by successfully learning the features of irregular shape, poorly defined mass, and heterogeneous texture. The second setup adopts a hierarchical DenseNet Swint architecture with Deep Feature Extraction have Dual Residual connections (DFE and DR), in which DenseNet serves as a stem CNN for structured local feature learning, while Swin_t models global tumor morphology, effectively suppressing false negatives in meningioma and pituitary tumor classification by learning the features of well defined mass, location (outside brain) and enlargments in tumors (dural tail or upward extension). DenseNet is customized at the input level to match MRI spatial characteristics, leveraging dense residual connectivity to preserve texture information and mitigate vanishing-gradient effects. In parallel, Swint is tailored through task aligned patch embedding and shifted-window self attention to efficiently capture hierarchical global dependencies. Extensive evaluation on a large-scale MRI dataset (stringent 40,260 images across four tumor classes) demonstrates consistent superiority over standalone CNNs, Vision Transformers, and hybrids, achieving 98.50 accuracy and recall on the test unseen dataset.[242] PPISP: Physically-Plausible Compensation and Control of Photometric Variations in Radiance Field Reconstruction
Isaac Deutsch,Nicolas Moënne-Loccoz,Gavriel State,Zan Gojcic
Main category: cs.CV
TL;DR: 本文提出了一种物理上合理的ISP(PPISP)校正模块,用于解决多视角3D重建中因相机光学特性和图像信号处理(ISP)差异导致的光度不一致性问题。该模块通过物理建模分离相机固有与捕获相关效应,并利用专用控制器预测新视角的ISP参数,从而提升泛化性与评估真实性。
Details
Motivation: 现有方法(如逐帧隐变量或仿射颜色校正)缺乏物理依据且在新视角上泛化能力差,难以应对真实场景中由相机ISP带来的光度不一致问题。 Method: 提出物理可解释的PPISP校正模块,将相机内参与捕获依赖效应解耦;设计PPISP控制器,基于输入视图学习并预测新视角的ISP参数(如自动曝光、白平衡),实现无需真值图像的公平评估。 Result: 在标准基准上达到SOTA性能,支持元数据集成,提供直观可控的ISP参数调节能力。 Conclusion: PPISP通过物理建模和可解释的ISP参数预测,显著提升了多视角3D重建对光度变化的鲁棒性与泛化能力,为真实场景应用提供了更可靠、可解释的解决方案。 Abstract: Multi-view 3D reconstruction methods remain highly sensitive to photometric inconsistencies arising from camera optical characteristics and variations in image signal processing (ISP). Existing mitigation strategies such as per-frame latent variables or affine color corrections lack physical grounding and generalize poorly to novel views. We propose the Physically-Plausible ISP (PPISP) correction module, which disentangles camera-intrinsic and capture-dependent effects through physically based and interpretable transformations. A dedicated PPISP controller, trained on the input views, predicts ISP parameters for novel viewpoints, analogous to auto exposure and auto white balance in real cameras. This design enables realistic and fair evaluation on novel views without access to ground-truth images. PPISP achieves SoTA performance on standard benchmarks, while providing intuitive control and supporting the integration of metadata when available. The source code is available at: https://github.com/nv-tlabs/ppisp[243] Beyond Rigid: Benchmarking Non-Rigid Video Editing
Bingzheng Qu,Kehai Chen,Xuefeng Bai,Jun Yu,Min Zhang
Main category: cs.CV
TL;DR: 本文提出了NRVBench,首个专门用于评估非刚性视频编辑的基准,包含高质量数据集、新评估指标NRVE-Acc和无需训练的基线方法VM-Edit,显著提升了物理合理性和时序一致性。
Details
Motivation: 现有文本驱动视频编辑方法在生成连贯的非刚性形变时存在物理失真和时序闪烁问题,缺乏专用、全面的评估基准。 Method: 构建包含180个物理驱动非刚性运动视频和2340条细粒度指令的数据集;提出基于视觉语言模型的评估指标NRVE-Acc;设计无需训练的双区域去噪基线方法VM-Edit。 Result: 实验表明当前方法在物理合理性上存在不足,而所提NRVE-Acc和VM-Edit在标准及新指标下均取得优异性能。 Conclusion: NRVBench为推进物理感知视频编辑提供了标准化评测平台,推动该领域发展。 Abstract: Despite the remarkable progress in text-driven video editing, generating coherent non-rigid deformations remains a critical challenge, often plagued by physical distortion and temporal flicker. To bridge this gap, we propose NRVBench, the first dedicated and comprehensive benchmark designed to evaluate non-rigid video editing. First, we curate a high-quality dataset consisting of 180 non-rigid motion videos from six physics-based categories, equipped with 2,340 fine-grained task instructions and 360 multiple-choice questions. Second, we propose NRVE-Acc, a novel evaluation metric based on Vision-Language Models that can rigorously assess physical compliance, temporal consistency, and instruction alignment, overcoming the limitations of general metrics in capturing complex dynamics. Third, we introduce a training-free baseline, VM-Edit, which utilizes a dual-region denoising mechanism to achieve structure-aware control, balancing structural preservation and dynamic deformation. Extensive experiments demonstrate that while current methods have shortcomings in maintaining physical plausibility, our method achieves excellent performance across both standard and proposed metrics. We believe the benchmark could serve as a standard testing platform for advancing physics-aware video editing.[244] Q-Bench-Portrait: Benchmarking Multimodal Large Language Models on Portrait Image Quality Perception
Sijing Wu,Yunhao Li,Zicheng Zhang,Qi Jia,Xinyue Li,Huiyu Duan,Xiongkuo Min,Guangtao Zhai
Main category: cs.CV
TL;DR: 本文提出了首个专门用于肖像图像质量感知的综合基准Q-Bench-Portrait,包含2765个图像-问题-答案三元组,并基于该基准评估了25种多模态大语言模型(MLLMs),发现当前模型在肖像图像感知方面仍显著落后于人类判断。
Details
Motivation: 现有MLLMs在通用图像上的低级视觉任务表现良好,但其对具有独特结构和感知特性的肖像图像的质量感知能力尚未被系统研究。 Method: 构建了Q-Bench-Portrait基准,涵盖多样化的肖像图像来源、全面的质量维度(技术失真、AIGC特有失真、美学)及多种题型(单选、多选、判断、开放问答);并在该基准上对20个开源和5个闭源MLLMs进行系统评测。 Result: 当前MLLMs在肖像图像质量感知任务中表现有限且不精确,与人类判断存在明显差距。 Conclusion: Q-Bench-Portrait为评估和提升MLLMs在肖像图像质量感知方面的能力提供了重要工具,有望推动通用及领域专用MLLMs的相关研究。 Abstract: Recent advances in multimodal large language models (MLLMs) have demonstrated impressive performance on existing low-level vision benchmarks, which primarily focus on generic images. However, their capabilities to perceive and assess portrait images, a domain characterized by distinct structural and perceptual properties, remain largely underexplored. To this end, we introduce Q-Bench-Portrait, the first holistic benchmark specifically designed for portrait image quality perception, comprising 2,765 image-question-answer triplets and featuring (1) diverse portrait image sources, including natural, synthetic distortion, AI-generated, artistic, and computer graphics images; (2) comprehensive quality dimensions, covering technical distortions, AIGC-specific distortions, and aesthetics; and (3) a range of question formats, including single-choice, multiple-choice, true/false, and open-ended questions, at both global and local levels. Based on Q-Bench-Portrait, we evaluate 20 open-source and 5 closed-source MLLMs, revealing that although current models demonstrate some competence in portrait image perception, their performance remains limited and imprecise, with a clear gap relative to human judgments. We hope that the proposed benchmark will foster further research into enhancing the portrait image perception capabilities of both general-purpose and domain-specific MLLMs.[245] OREHAS: A fully automated deep-learning pipeline for volumetric endolymphatic hydrops quantification in MRI
Caterina Fuster-Barceló,Claudia Castrillón,Laura Rodrigo-Muñoz,Victor Manuel Vega-Suárez,Nicolás Pérez-Fernández,Gorka Bastarrika,Arrate Muñoz-Barrutia
Main category: cs.CV
TL;DR: OREHAS是一个全自动的内淋巴积水(EH)体积量化系统,基于常规3D-MRI图像,仅需每例患者3-6张标注切片即可实现高精度分割与体积比(ELR)计算,性能优于临床软件syngo.via,并具备临床可推广性。
Details
Motivation: 现有内淋巴积水(EH)量化方法依赖大量手动标注或临床软件(如syngo.via),存在主观性强、一致性差、易高估体积等问题,亟需一种自动、可靠、适配常规MRI协议的量化工具。 Method: 提出OREHAS全自动流程,整合切片分类、内耳定位与序列特异性分割三模块;采用轻量监督深度学习模型,仅用每例3–6张标注切片训练,直接从全3D MRI体积中计算单耳内淋巴/前庭体积比(ELR)。 Result: 在外部验证集上,OREHAS对SPACE-MRC和REAL-IR序列的Dice分数分别达0.90和0.75;与专家标注VSI达74.3%,显著优于syngo.via(42.5%);其内淋巴体积更小且生理合理,前庭体积与syngo.via一致。 Conclusion: OREHAS实现了低监督、高泛化、临床兼容的EH全自动量化,提升了可重复性与客观性,为大规模研究及诊断阈值重校准提供了可靠技术基础。 Abstract: We present OREHAS (Optimized Recognition & Evaluation of volumetric Hydrops in the Auditory System), the first fully automatic pipeline for volumetric quantification of endolymphatic hydrops (EH) from routine 3D-SPACE-MRC and 3D-REAL-IR MRI. The system integrates three components -- slice classification, inner ear localization, and sequence-specific segmentation -- into a single workflow that computes per-ear endolymphatic-to-vestibular volume ratios (ELR) directly from whole MRI volumes, eliminating the need for manual intervention. Trained with only 3 to 6 annotated slices per patient, OREHAS generalized effectively to full 3D volumes, achieving Dice scores of 0.90 for SPACE-MRC and 0.75 for REAL-IR. In an external validation cohort with complete manual annotations, OREHAS closely matched expert ground truth (VSI = 74.3%) and substantially outperformed the clinical syngo.via software (VSI = 42.5%), which tended to overestimate endolymphatic volumes. Across 19 test patients, vestibular measurements from OREHAS were consistent with syngo.via, while endolymphatic volumes were systematically smaller and more physiologically realistic. These results show that reliable and reproducible EH quantification can be achieved from standard MRI using limited supervision. By combining efficient deep-learning-based segmentation with a clinically aligned volumetric workflow, OREHAS reduces operator dependence, ensures methodological consistency. Besides, the results are compatible with established imaging protocols. The approach provides a robust foundation for large-scale studies and for recalibrating clinical diagnostic thresholds based on accurate volumetric measurements of the inner ear.[246] Gaze Prediction in Virtual Reality Without Eye Tracking Using Visual and Head Motion Cues
Christos Petrou,Harris Partaourides,Athanasios Balomenos,Yannis Kopsinis,Sotirios Chatzis
Main category: cs.CV
TL;DR: 本文提出了一种无需直接眼动追踪的VR注视预测框架,融合HMD运动信号与视频帧视觉显著性特征,采用UniSal编码器和TSMixer/LSTM时序模型,在EHTask数据集和商用VR设备上显著优于基线方法。
Details
Motivation: 在VR应用中,直接眼动追踪常受硬件限制或隐私问题制约,而注视预测对降低传感器延迟和实现注视渲染等关键技术至关重要。 Method: 提出融合HMD运动信号与基于UniSal提取的视觉显著性特征的注视预测框架,并采用TSMixer和LSTM两种轻量级时序模型进行未来注视方向预测。 Result: 在EHTask数据集及商用VR硬件上的实验表明,该方法持续优于Center-of-HMD和Mean Gaze等基线方法。 Conclusion: 所提预测模型能有效降低感知延迟、提升VR中自然交互体验,适用于无法部署直接眼动追踪的场景。 Abstract: Gaze prediction plays a critical role in Virtual Reality (VR) applications by reducing sensor-induced latency and enabling computationally demanding techniques such as foveated rendering, which rely on anticipating user attention. However, direct eye tracking is often unavailable due to hardware limitations or privacy concerns. To address this, we present a novel gaze prediction framework that combines Head-Mounted Display (HMD) motion signals with visual saliency cues derived from video frames. Our method employs UniSal, a lightweight saliency encoder, to extract visual features, which are then fused with HMD motion data and processed through a time-series prediction module. We evaluate two lightweight architectures, TSMixer and LSTM, for forecasting future gaze directions. Experiments on the EHTask dataset, along with deployment on commercial VR hardware, show that our approach consistently outperforms baselines such as Center-of-HMD and Mean Gaze. These results demonstrate the effectiveness of predictive gaze modeling in reducing perceptual lag and enhancing natural interaction in VR environments where direct eye tracking is constrained.[247] Estimation of geometric transformation matrices using grid-shaped pilot signals
Rinka Kawano,Masaki Kawamura
Main category: cs.CV
TL;DR: 本文提出了一种基于网格状导频信号的数字水印方法,通过分析导频信号在几何变换(如缩放、旋转、剪切和裁剪)后的畸变,利用Radon变换估计变换矩阵,从而实现鲁棒的同步与水印提取,尤其对裁剪攻击具有较强鲁棒性。
Details
Motivation: 现有数字水印方法在面对图像裁剪等几何攻击时,难以准确同步(即定位水印嵌入位置),导致水印提取失败;裁剪会改变图像原点,加剧同步难度,而当前方法对此类攻击鲁棒性不足。 Method: 在图像中嵌入具有不同水平与垂直值的网格状导频信号;图像经几何变换后,分析畸变网格的结构;结合Radon变换估计网格角度与间隔,并利用水平/垂直线编码差异确定网格朝向以消除歧义,进而估计几何变换矩阵并实现同步。 Result: 仿真结果表明,该方法在各向异性缩放、旋转、剪切及裁剪(包括单一与复合攻击)下均能以较低误差精确估计变换矩阵。 Conclusion: 所提基于导频信号与Radon变换的同步方法显著提升了水印系统对裁剪等几何攻击的鲁棒性,为实际应用中复杂几何失真下的可靠水印提取提供了有效解决方案。 Abstract: Digital watermarking techniques are essential to prevent unauthorized use of images. Since pirated images are often geometrically distorted by operations such as scaling and cropping, accurate synchronization - detecting the embedding position of the watermark - is critical for proper extraction. In particular, cropping changes the origin of the image, making synchronization difficult. However, few existing methods are robust against cropping. To address this issue, we propose a watermarking method that estimates geometric transformations applied to a stego image using a pilot signal, allowing synchronization even after cropping. A grid-shaped pilot signal with distinct horizontal and vertical values is embedded in the image. When the image is transformed, the grid is also distorted. By analyzing this distortion, the transformation matrix can be estimated. Applying the Radon transform to the distorted image allows estimation of the grid angles and intervals. In addition, since the horizontal and vertical grid lines are encoded differently, the grid orientation can be determined, which reduces ambiguity. To validate our method, we performed simulations with anisotropic scaling, rotation, shearing, and cropping. The results show that the proposed method accurately estimates transformation matrices with low error under both single and composite attacks.[248] ARMOR: Agentic Reasoning for Methods Orchestration and Reparameterization for Robust Adversarial Attacks
Gabriel Lee Jun Rong,Christos Korgialas,Dion Jia Xu Ho,Pai Chet Ng,Xiaoxiao Miao,Konstantinos N. Plataniotis
Main category: cs.CV
TL;DR: 本文提出了ARMOR框架,利用视觉语言模型(VLM)和大语言模型(LLM)协同调度多种对抗攻击方法(CW、JSMA、STA),实现语义感知、自适应重参数化的动态攻击编排,在跨架构迁移性和盲/白盒目标攻击效果上均取得提升。
Details
Motivation: 现有自动化攻击套件是静态的、固定序列的集成方法,缺乏战略适应性和语义感知能力。 Method: 提出ARMOR框架,通过VLM引导的智能体协同调度CW、JSMA和STA三种对抗攻击原语,在共享的‘Mixing Desk’中生成与融合扰动;LLM在闭环系统中实时自适应调优各并行攻击智能体的参数,利用图像特定的语义脆弱性。 Result: 在标准基准上,ARMOR显著提升了跨架构攻击迁移性;对盲盒目标输出混合攻击结果,对白盒目标基于置信度与SSIM分数自动选择最优或混合攻击策略,并稳定欺骗两类目标。 Conclusion: ARMOR通过引入具身推理与语义驱动的攻击编排机制,突破了传统静态攻击范式,为自适应、智能化对抗攻击提供了新范式。 Abstract: Existing automated attack suites operate as static ensembles with fixed sequences, lacking strategic adaptation and semantic awareness. This paper introduces the Agentic Reasoning for Methods Orchestration and Reparameterization (ARMOR) framework to address these limitations. ARMOR orchestrates three canonical adversarial primitives, Carlini-Wagner (CW), Jacobian-based Saliency Map Attack (JSMA), and Spatially Transformed Attacks (STA) via Vision Language Models (VLM)-guided agents that collaboratively generate and synthesize perturbations through a shared ``Mixing Desk". Large Language Models (LLMs) adaptively tune and reparameterize parallel attack agents in a real-time, closed-loop system that exploits image-specific semantic vulnerabilities. On standard benchmarks, ARMOR achieves improved cross-architecture transfer and reliably fools both settings, delivering a blended output for blind targets and selecting the best attack or blended attacks for white-box targets using a confidence-and-SSIM score.[249] Efficient Complex-Valued Vision Transformers for MRI Classification Directly from k-Space
Moritz Rempe,Lukas T. Rotkopf,Marco Schlimbach,Helmut Becker,Fabian Hörst,Johannes Haubold,Philipp Dammann,Kevin Kröninger,Jens Kleesiek
Main category: cs.CV
TL;DR: 本文提出了一种直接在MRI原始k空间数据上进行分类的复数Vision Transformer(kViT),通过径向k空间分块策略适配MRI物理特性,在保持高性能的同时显著提升计算效率和鲁棒性。
Details
Motivation: 现有MRI深度学习方法多基于重建后的幅值图像,丢失相位信息且依赖计算昂贵的傅里叶变换;传统网络的局部操作难以建模k空间数据的全局、非局部特性。 Method: 提出复数域Vision Transformer(kViT),并设计径向k空间补丁策略,以契合k空间谱能量分布,实现端到端k空间直接分类。 Result: 在fastMRI和内部数据集上,kViT分类性能媲美ResNet、EfficientNet和ViT等图像域SOTA模型;对高加速因子更鲁棒,训练显存消耗最多降低68倍。 Conclusion: kViT为MRI提供了资源高效、直接来自扫描仪的AI分析新范式,推动了k空间原生深度学习的发展。 Abstract: Deep learning applications in Magnetic Resonance Imaging (MRI) predominantly operate on reconstructed magnitude images, a process that discards phase information and requires computationally expensive transforms. Standard neural network architectures rely on local operations (convolutions or grid-patches) that are ill-suited for the global, non-local nature of raw frequency-domain (k-Space) data. In this work, we propose a novel complex-valued Vision Transformer (kViT) designed to perform classification directly on k-Space data. To bridge the geometric disconnect between current architectures and MRI physics, we introduce a radial k-Space patching strategy that respects the spectral energy distribution of the frequency-domain. Extensive experiments on the fastMRI and in-house datasets demonstrate that our approach achieves classification performance competitive with state-of-the-art image-domain baselines (ResNet, EfficientNet, ViT). Crucially, kViT exhibits superior robustness to high acceleration factors and offers a paradigm shift in computational efficiency, reducing VRAM consumption during training by up to 68$\times$ compared to standard methods. This establishes a pathway for resource-efficient, direct-from-scanner AI analysis.[250] Larger than memory image processing
Jon Sporring,David Stansby
Main category: cs.CV
TL;DR: 本文提出了一种面向超大规模图像(如PB级电镜数据)的流式分析架构,通过基于切片的1D流式处理、扫掠式执行、窗口操作与重叠感知分块等技术,最小化磁盘I/O;并设计了一种领域专用语言(DSL),在编译期和运行期自动优化流水线(如窗口大小选择、阶段融合、流调度),实现近线性I/O扫描与可控内存占用,无需将全量数据载入内存。
Details
Motivation: 现有超大规模图像分析(如PB级电镜、TB级器官图谱)严重受I/O瓶颈限制,传统基于3D块的数据布局在多遍扫描算法中导致大量冗余磁盘访问,亟需一种兼顾局部性、顺序读写与内存受限场景的高效流式处理范式。 Method: 提出基于2D切片的1D流式架构,形式化扫掠式执行、窗口操作与重叠感知分块以减少冗余访问;设计支持编译期与运行期自动优化(窗口大小、阶段融合、流tee/zip、多遍调度)的领域专用语言(DSL),适配有限内存机器。 Result: 所提方法在保持算法正确性前提下,显著降低磁盘I/O次数(相较3D块布局减少至约1/9),实现近线性I/O扫描、可预测内存占用,并在真实PB/TB级数据上验证了大幅吞吐提升;DSL可无缝集成现有分割与形态学工具链。 Conclusion: 对于超大规模图像分析,结构化为顺序流式处理比依赖3D块布局更本质地缓解I/O瓶颈;引入具备内在流式知识的DSL,是实现高性能、低内存、高吞吐分析的关键路径。 Abstract: This report addresses larger-than-memory image analysis for petascale datasets such as 1.4 PB electron-microscopy volumes and 150 TB human-organ atlases. We argue that performance is fundamentally I/O-bound. We show that structuring analysis as streaming passes over data is crucial. For 3D volumes, two representations are popular: stacks of 2D slices (e.g., directories or multi-page TIFF) and 3D chunked layouts (e.g., Zarr/HDF5). While for a few algorithms, chunked layout on disk is crucial to keep disk I/O at a minimum, we show how the slice-based streaming architecture can be built on top of either image representation in a manner that minimizes disk I/O. This is in particular advantageous for algorithms relying on neighbouring values, since the slicing streaming architecture is 1D, which implies that there are only 2 possible sweeping orders, both of which are aligned with the order in which images are read from the disk. This is in contrast to 3D chunks, in which any sweep cannot be done without accessing each chunk at least 9 times. We formalize this with sweep-based execution (natural 2D/3D orders), windowed operations, and overlap-aware tiling to minimize redundant access. Building on these principles, we introduce a domain-specific language (DSL) that encodes algorithms with intrinsic knowledge of their optimal streaming and memory use; the DSL performs compile-time and run-time pipeline analyses to automatically select window sizes, fuse stages, tee and zip streams, and schedule passes for limited-RAM machines, yielding near-linear I/O scans and predictable memory footprints. The approach integrates with existing tooling for segmentation and morphology but reframes pre/post-processing as pipelines that privilege sequential read/write patterns, delivering substantial throughput gains for extremely large images without requiring full-volume residency in memory.[251] Comparative Evaluation of Machine Learning Algorithms for Affective State Recognition from Children's Drawings
Aura Loredana Dan
Main category: cs.CV
TL;DR: 本文比较了MobileNet、EfficientNet和VGG16三种深度学习模型在儿童绘画情感识别任务中的性能,采用迁移学习并在专家标注数据集上训练,探讨了轻量级与深层模型在准确性、鲁棒性与计算效率间的权衡。
Details
Motivation: 传统儿童情绪评估方法具有侵入性、主观性强、一致性差等问题,亟需非侵入、客观、可重复的自动识别方法;儿童绘画作为自然表达方式,为情感识别提供了新途径。 Method: 基于儿童绘画数据集,采用迁移学习策略,在统一实验框架下对比评估MobileNet、EfficientNet和VGG16三种深度学习模型的情感分类性能,重点关注分类准确率、鲁棒性及计算效率。 Result: 不同模型在情感分类任务中表现出显著性能差异,轻量级模型(如MobileNet)在计算效率和实时性方面优势明显,而更深模型(如VGG16)可能在准确率或鲁棒性上略有提升,但代价是更高资源消耗。 Conclusion: 在面向移动与实时应用的儿童绘画情感识别任务中,应根据实际场景需求权衡模型深度与效率;轻量级架构更具部署潜力,尤其适用于临床辅助与早期筛查工具开发。 Abstract: Autism spectrum disorder (ASD) represents a neurodevelopmental condition characterized by difficulties in expressing emotions and communication, particularly during early childhood. Understanding the affective state of children at an early age remains challenging, as conventional assessment methods are often intrusive, subjective, or difficult to apply consistently. This paper builds upon previous work on affective state recognition from children's drawings by presenting a comparative evaluation of machine learning models for emotion classification. Three deep learning architectures -- MobileNet, EfficientNet, and VGG16 -- are evaluated within a unified experimental framework to analyze classification performance, robustness, and computational efficiency. The models are trained using transfer learning on a dataset of children's drawings annotated with emotional labels provided by psychological experts. The results highlight important trade-offs between lightweight and deeper architectures when applied to drawing-based affective computing tasks, particularly in mobile and real-time application contexts.[252] On Procrustes Contamination in Machine Learning Applications of Geometric Morphometrics
Lloyd Austin Courtenay
Main category: cs.CV
TL;DR: 本文指出传统几何形态测量学(GMM)中使用广义Procrustes分析(GPA)进行样本对齐会引入统计依赖,污染机器学习模型;提出一种新对齐方法——测试样本基于训练集单独对齐,并通过模拟验证其有效性,同时揭示了Procrustes形变空间中的基本统计约束。
Details
Motivation: 标准GMM流程中先全局GPA再划分训练/测试集,可能导致数据泄露和统计依赖,影响ML模型泛化能力。 Method: 通过2D/3D控制模拟评估GPA污染效应;提出测试样本向训练集单独对齐的新预处理方法;理论推导Procrustes切空间自由度与RMSE标度关系;结合线性与卷积回归检验空间自相关影响。 Result: 发现样本量与关键点数之间存在稳健的‘对角线’关系,其斜率可由Procrustes切空间自由度解析得出;忽略关键点空间自相关会显著降低模型性能;新对齐方法可消除跨样本依赖。 Conclusion: GMM用于机器学习时需谨慎预处理;推荐采用训练集驱动的实时对齐策略;研究厘清了Procrustes形状空间固有的统计限制,为后续方法学提供理论基础与实践指南。 Abstract: Geometric morphometrics (GMM) is widely used to quantify shape variation, more recently serving as input for machine learning (ML) analyses. Standard practice aligns all specimens via Generalized Procrustes Analysis (GPA) prior to splitting data into training and test sets, potentially introducing statistical dependence and contaminating downstream predictive models. Here, the effects of GPA-induced contamination are formally characterised using controlled 2D and 3D simulations across varying sample sizes, landmark densities, and allometric patterns. A novel realignment procedure is proposed, whereby test specimens are aligned to the training set prior to model fitting, eliminating cross-sample dependency. Simulations reveal a robust "diagonal" in sample-size vs. landmark-space, reflecting the scaling of RMSE under isotropic variation, with slopes analytically derived from the degrees of freedom in Procrustes tangent space. The importance of spatial autocorrelation among landmarks is further demonstrated using linear and convolutional regression models, highlighting performance degradation when landmark relationships are ignored. This work establishes the need for careful preprocessing in ML applications of GMM, provides practical guidelines for realignment, and clarifies fundamental statistical constraints inherent to Procrustes shape space.[253] 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control
Xuanmeng Sha,Liyun Zhang,Tomohiro Mashita,Naoya Chiba,Yuki Uranishi
Main category: cs.CV
TL;DR: 本文提出3DGesPolicy,一种基于动作的扩散策略框架,用于生成语义和空间上一致的全身协同语音手势,通过GAP融合模块实现语音、身体运动和面部表情的精细对齐。
Details
Motivation: 现有方法在生成全身协同语音手势时存在语义不协调和空间不稳定的问题,难以实现身体动作与面部表情的自然整合。 Method: 将整体手势生成重新定义为连续轨迹控制问题,采用机器人领域的扩散策略建模帧间变化为统一整体动作;引入Gesture-Audio-Phoneme(GAP)融合模块,深度融合并优化多模态信号。 Result: 在BEAT2数据集上的大量定量与定性实验表明,3DGesPolicy在生成自然、富有表现力且高度语音对齐的全身手势方面优于当前最先进方法。 Conclusion: 3DGesPolicy通过动作建模与多模态融合,有效解决了协同语音手势生成中的语义连贯性与空间稳定性难题,提升了生成质量与语音对齐精度。 Abstract: Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.[254] Fair-Eye Net: A Fair, Trustworthy, Multimodal Integrated Glaucoma Full Chain AI System
Wenbin Wei,Suyuan Yao,Cheng Huang,Xiangyu Gao
Main category: cs.CV
TL;DR: 本文提出Fair-Eye Net,一种公平、可靠的多模态AI系统,整合眼底照相、OCT结构指标、视野功能指数和人口统计学因素,用于青光眼筛查、随访与风险预警;其双流异构融合架构结合不确定性感知的分层门控策略,并将公平性作为核心优化目标,显著降低种族间漏诊差异,提升临床可部署性与全球眼健康公平性。
Details
Motivation: 当前青光眼筛查与进展评估依赖单一或松散关联的检查,存在主观性强、诊疗碎片化问题;加之高质量影像设备与专家资源分布不均,导致真实世界中诊断一致性与公平性不足。 Method: 提出Fair-Eye Net:采用双流异构融合架构整合眼底照片、OCT结构参数、视野功能指标及人口统计学特征;引入不确定性感知的分层门控机制实现选择性预测与安全转诊;嵌入公平性约束以减少弱势亚组漏诊,并通过多任务学习联合优化公平性与临床可靠性。 Result: 在实验中达到AUC 0.912(特异性96.7%);将种族间假阴性率差异降低73.4%(从12.31%降至3.28%);跨域性能稳定;可提前3–12个月发出风险预警(敏感性92%,特异性88%)。 Conclusion: Fair-Eye Net将公平性内生于模型设计而非事后调整,兼顾高精度、鲁棒性与临床安全性,为青光眼全流程管理提供了可复现、可扩展的AI解决方案,助力全球眼健康公平。 Abstract: Glaucoma is a top cause of irreversible blindness globally, making early detection and longitudinal follow-up pivotal to preventing permanent vision loss. Current screening and progression assessment, however, rely on single tests or loosely linked examinations, introducing subjectivity and fragmented care. Limited access to high-quality imaging tools and specialist expertise further compromises consistency and equity in real-world use. To address these gaps, we developed Fair-Eye Net, a fair, reliable multimodal AI system closing the clinical loop from glaucoma screening to follow-up and risk alerting. It integrates fundus photos, OCT structural metrics, VF functional indices, and demographic factors via a dual-stream heterogeneous fusion architecture, with an uncertainty-aware hierarchical gating strategy for selective prediction and safe referral. A fairness constraint reduces missed diagnoses in disadvantaged subgroups. Experimental results show it achieved an AUC of 0.912 (96.7% specificity), cut racial false-negativity disparity by 73.4% (12.31% to 3.28%), maintained stable cross-domain performance, and enabled 3-12 months of early risk alerts (92% sensitivity, 88% specificity). Unlike post hoc fairness adjustments, Fair-Eye Net optimizes fairness as a primary goal with clinical reliability via multitask learning, offering a reproducible path for clinical translation and large-scale deployment to advance global eye health equity.[255] DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment
Sara Tehrani,Yonghao Xu,Leif Haglund,Amanda Berg,Michael Felsberg
Main category: cs.CV
TL;DR: 本文提出了DisasterInsight多模态基准,用于评估视觉-语言模型在真实灾害分析任务中的性能,并提出了DI-Chat模型以提升灾害场景下的理解与报告生成能力。
Details
Motivation: 现有遥感视觉-语言基准多聚焦于粗粒度图像级识别,缺乏对灾害响应中所需功能理解与指令鲁棒性的评估。 Method: 将xBD数据集重构为约11.2万建筑中心实例,构建DisasterInsight基准;提出DI-Chat模型,基于LoRA对现有VLM进行灾害指令微调。 Result: 实验表明现有模型在损毁理解与结构化报告生成上存在显著性能差距;DI-Chat在损毁等级、灾害类型分类及报告质量上显著提升,但建筑功能分类仍具挑战性。 Conclusion: DisasterInsight为灾害影像中的具身多模态推理提供了统一评测平台,DI-Chat验证了领域适配微调的有效性。 Abstract: Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.[256] From Cold Start to Active Learning: Embedding-Based Scan Selection for Medical Image Segmentation
Devon Levy,Bar Assayag,Laura Gaspar,Ilan Shimshoni,Bella Specktor-Fadida
Main category: cs.CV
TL;DR: 本文提出了一种结合基础模型嵌入与聚类的冷启动采样策略,并融合空间多样性与不确定性驱动的主动学习框架,以提升医学图像分割在低标注数据下的性能。
Details
Motivation: 手动标注医学图像分割数据耗时且依赖专家知识,成为疾病监测的瓶颈;传统主动学习的冷启动阶段多样性不足,影响后续性能。 Method: 提出基于基础模型嵌入的聚类冷启动策略(含自动簇数选择和按比例跨簇采样),随后采用融合熵不确定性和空间多样性的主动学习框架;支持特征空间分布可视化,增强可解释性。 Result: 在CheXmask、Montgomery和SynthStrip三个X射线与MRI数据集上验证:冷启动阶段显著提升Dice分数并降低Hausdorff距离;AL阶段进一步提升性能,尤其在低数据场景下稳定优于基线方法。 Conclusion: 所提框架在低标注数据条件下能有效提升医学图像分割精度,兼具实用性、可解释性与跨模态泛化能力。 Abstract: Accurate segmentation annotations are critical for disease monitoring, yet manual labeling remains a major bottleneck due to the time and expertise required. Active learning (AL) alleviates this burden by prioritizing informative samples for annotation, typically through a diversity-based cold-start phase followed by uncertainty-driven selection. We propose a novel cold-start sampling strategy that combines foundation-model embeddings with clustering, including automatic selection of the number of clusters and proportional sampling across clusters, to construct a diverse and representative initial training. This is followed by an uncertainty-based AL framework that integrates spatial diversity to guide sample selection. The proposed method is intuitive and interpretable, enabling visualization of the feature-space distribution of candidate samples. We evaluate our approach on three datasets spanning X-ray and MRI modalities. On the CheXmask dataset, the cold-start strategy outperforms random selection, improving Dice from 0.918 to 0.929 and reducing the Hausdorff distance from 32.41 to 27.66 mm. In the AL setting, combined entropy and diversity selection improves Dice from 0.919 to 0.939 and reduces the Hausdorff distance from 30.10 to 19.16 mm. On the Montgomery dataset, cold-start gains are substantial, with Dice improving from 0.928 to 0.950 and Hausdorff distance decreasing from 14.22 to 9.38 mm. On the SynthStrip dataset, cold-start selection slightly affects Dice but reduces the Hausdorff distance from 9.43 to 8.69 mm, while active learning improves Dice from 0.816 to 0.826 and reduces the Hausdorff distance from 7.76 to 6.38 mm. Overall, the proposed framework consistently outperforms baseline methods in low-data regimes, improving segmentation accuracy.[257] GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning
Kaixun Jiang,Yuzheng Wang,Junjie Zhou,Pandeng Li,Zhihang Liu,Chen-Wei Xie,Zhaoyu Chen,Yun Zheng,Wenqiang Zhang
Main category: cs.CV
TL;DR: GenAgent 提出一种代理式多模态框架,将视觉理解与生成解耦:理解由多模态模型完成,生成则通过调用图像生成模型作为工具实现;支持自主多轮推理链(含推理、工具调用、判断与反思),并采用两阶段训练(监督微调 + 结合点式与对式奖励的强化学习),显著提升生成质量与泛化能力。
Details
Motivation: 现有统一模型存在训练成本高、理解与生成能力相互制约的问题;而模块化系统受限于静态流程,缺乏自主迭代优化能力。 Method: 提出GenAgent代理框架:1)解耦理解(多模态模型)与生成(调用图像生成模型为工具);2)支持多轮多模态思维链(含推理、工具调用、判断、反思);3)两阶段训练:先监督微调高质量工具调用与反思数据,再进行结合点式(图像质量)和对式(反思准确性)奖励的端到端强化学习,并引入轨迹重采样增强探索。 Result: 在GenEval++和WISE基准上分别提升FLUX.1-dev生成性能23.6%和14%;验证了跨工具泛化性、测试时随交互轮次持续提升的可扩展性,以及任务自适应推理能力。 Conclusion: GenAgent通过代理式架构与新型训练范式,有效解耦并协同视觉理解与生成,为多模态生成系统提供了更灵活、可扩展、可推理的新范式。 Abstract: We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities through an agentic framework: understanding is handled by the multimodal model itself, while generation is achieved by treating image generation models as invokable tools. Crucially, unlike existing modular systems constrained by static pipelines, this design enables autonomous multi-turn interactions where the agent generates multimodal chains-of-thought encompassing reasoning, tool invocation, judgment, and reflection to iteratively refine outputs. We employ a two-stage training strategy: first, cold-start with supervised fine-tuning on high-quality tool invocation and reflection data to bootstrap agent behaviors; second, end-to-end agentic reinforcement learning combining pointwise rewards (final image quality) and pairwise rewards (reflection accuracy), with trajectory resampling for enhanced multi-turn exploration. GenAgent significantly boosts base generator(FLUX.1-dev) performance on GenEval++ (+23.6\%) and WISE (+14\%). Beyond performance gains, our framework demonstrates three key properties: 1) cross-tool generalization to generators with varying capabilities, 2) test-time scaling with consistent improvements across interaction rounds, and 3) task-adaptive reasoning that automatically adjusts to different tasks. Our code will be available at \href{https://github.com/deep-kaixun/GenAgent}{this url}.[258] REMAC: Reference-Based Martian Asymmetrical Image Compression
Qing Ding,Mai Xu,Shengxi Li,Xin Deng,Xin Zou
Main category: cs.CV
TL;DR: 本文提出了一种面向火星图像的参考式非对称压缩方法REMAC,通过将计算负担从资源受限的火星端编码器转移到地球端解码器,并利用图像间和图像内相似性,显著降低编码复杂度并提升压缩性能。
Details
Motivation: 现有基于学习的图像压缩方法在火星图像上效果不佳,主要因为忽略了火星端极有限的计算资源,且未利用火星图像间强烈的跨图像相似性。 Method: 提出参考式火星非对称图像压缩(REMAC)方法:1)设计参考引导的熵模块和ref-decoder以利用图像间相似性;2)ref-decoder采用深层多尺度结构扩大感受野以建模图像内长程依赖;3)引入潜在特征复用机制进一步缓解火星端计算压力。 Result: 相比当前最优方法,REMAC将编码器复杂度降低43.51%,同时BD-PSNR提升0.2664 dB。 Conclusion: REMAC有效平衡了火星端低计算开销与高压缩性能的需求,为深空图像传输提供了实用可行的新方案。 Abstract: To expedite space exploration on Mars, it is indispensable to develop an efficient Martian image compression method for transmitting images through the constrained Mars-to-Earth communication channel. Although the existing learned compression methods have achieved promising results for natural images from earth, there remain two critical issues that hinder their effectiveness for Martian image compression: 1) They overlook the highly-limited computational resources on Mars; 2) They do not utilize the strong \textit{inter-image} similarities across Martian images to advance image compression performance. Motivated by our empirical analysis of the strong \textit{intra-} and \textit{inter-image} similarities from the perspective of texture, color, and semantics, we propose a reference-based Martian asymmetrical image compression (REMAC) approach, which shifts computational complexity from the encoder to the resource-rich decoder and simultaneously improves compression performance. To leverage \textit{inter-image} similarities, we propose a reference-guided entropy module and a ref-decoder that utilize useful information from reference images, reducing redundant operations at the encoder and achieving superior compression performance. To exploit \textit{intra-image} similarities, the ref-decoder adopts a deep, multi-scale architecture with enlarged receptive field size to model long-range spatial dependencies. Additionally, we develop a latent feature recycling mechanism to further alleviate the extreme computational constraints on Mars. Experimental results show that REMAC reduces encoder complexity by 43.51\% compared to the state-of-the-art method, while achieving a BD-PSNR gain of 0.2664 dB.[259] Automated Landmark Detection for assessing hip conditions: A Cross-Modality Validation of MRI versus X-ray
Roberto Di Via,Vito Paolo Pastore,Francesca Odone,Siôn Glyn-Jones,Irina Voiculescu
Main category: cs.CV
TL;DR: 本文提出了一种基于MRI和X光配对数据的跨模态临床等效性验证方法,证明在冠状面3D MRI上进行关键点定位可达到与X光相当的cam型FAI诊断精度,支持将自动化FAI评估整合进常规MRI流程。
Details
Motivation: FAI筛查传统依赖X光角度测量,但评估撞击区域高度与范围需MRI三维视图;两种模态互补,亟需验证MRI能否替代或补充X光实现同等临床诊断能力。 Method: 在89例配对MRI/X光患者队列上,采用标准热图回归架构进行跨模态关键点检测与诊断验证,重点评估冠状面3D MRI中cam型FAI关键点定位性能。 Result: MRI在关键点定位精度和cam型FAI诊断准确性方面达到与X光相当水平,证实其临床可行性,并支持后续扩展至全容积分析。 Conclusion: 3D MRI冠状面可用于可靠、自动化的FAI评估,为将AI辅助FAI诊断无缝嵌入常规MRI工作流提供了实证基础。 Abstract: Many clinical screening decisions are based on angle measurements. In particular, FemoroAcetabular Impingement (FAI) screening relies on angles traditionally measured on X-rays. However, assessing the height and span of the impingement area requires also a 3D view through an MRI scan. The two modalities inform the surgeon on different aspects of the condition. In this work, we conduct a matched-cohort validation study (89 patients, paired MRI/X-ray) using standard heatmap regression architectures to assess cross-modality clinical equivalence. Seen that landmark detection has been proven effective on X-rays, we show that MRI also achieves equivalent localisation and diagnostic accuracy for cam-type impingement. Our method demonstrates clinical feasibility for FAI assessment in coronal views of 3D MRI volumes, opening the possibility for volumetric analysis through placing further landmarks. These results support integrating automated FAI assessment into routine MRI workflows. Code is released at https://github.com/Malga-Vision/Landmarks-Hip-Conditions[260] Generative Diffusion Augmentation with Quantum-Enhanced Discrimination for Medical Image Diagnosis
Jingsong Xia,Siqi Wang
Main category: cs.CV
TL;DR: 本文提出SDA-QEC框架,结合简化的扩散数据增强与量子增强分类,有效解决医学图像分类中严重的类别不平衡问题,在冠状动脉造影图像分类任务中取得高准确率、AUC和F1分数,并实现高敏感性与特异性平衡。
Details
Motivation: 现实医学数据集常存在严重类别不平衡,导致模型对少数类召回率低,影响诊断准确性并增加临床误诊风险。 Method: 提出SDA-QEC框架:使用轻量级扩散增强器为少数类生成高质量合成样本以重平衡数据分布;在MobileNetV2中嵌入量子特征层,通过Hilbert空间高维映射增强判别能力。 Result: 在冠状动脉造影图像分类任务中达到98.33%准确率、98.78% AUC、98.33% F1分数,同时获得98.33%敏感性和98.33%特异性,显著优于ResNet18、MobileNetV2、DenseNet121和VGG16等经典模型。 Conclusion: SDA-QEC验证了生成式增强与量子增强建模融合在小样本、高度不平衡、高风险医学影像诊断任务中的可行性,为构建高可靠性医疗AI系统提供了新路径。 Abstract: In biomedical engineering, artificial intelligence has become a pivotal tool for enhancing medical diagnostics, particularly in medical image classification tasks such as detecting pneumonia from chest X-rays and breast cancer screening. However, real-world medical datasets frequently exhibit severe class imbalance, where positive samples substantially outnumber negative samples, leading to biased models with low recall rates for minority classes. This imbalance not only compromises diagnostic accuracy but also poses clinical misdiagnosis risks. To address this challenge, we propose SDA-QEC (Simplified Diffusion Augmentation with Quantum-Enhanced Classification), an innovative framework that integrates simplified diffusion-based data augmentation with quantum-enhanced feature discrimination. Our approach employs a lightweight diffusion augmentor to generate high-quality synthetic samples for minority classes, rebalancing the training distribution. Subsequently, a quantum feature layer embedded within MobileNetV2 architecture enhances the model's discriminative capability through high-dimensional feature mapping in Hilbert space. Comprehensive experiments on coronary angiography image classification demonstrate that SDA-QEC achieves 98.33% accuracy, 98.78% AUC, and 98.33% F1-score, significantly outperforming classical baselines including ResNet18, MobileNetV2, DenseNet121, and VGG16. Notably, our framework simultaneously attains 98.33% sensitivity and 98.33% specificity, achieving a balanced performance critical for clinical deployment. The proposed method validates the feasibility of integrating generative augmentation with quantum-enhanced modeling in real-world medical imaging tasks, offering a novel research pathway for developing highly reliable medical AI systems in small-sample, highly imbalanced, and high-risk diagnostic scenarios.[261] AI-enabled Satellite Edge Computing: A Single-Pixel Feature based Shallow Classification Model for Hyperspectral Imaging
Li Fang,Tianyu Li,Yanghong Lin,Shudong Zhou,Wei Yao
Main category: cs.CV
TL;DR: 本文提出了一种面向高光谱图像分类的轻量级、非深度学习的星上边缘计算方法,结合少样本学习与两阶段像素级标签传播策略,以应对卫星平台资源受限及传感器故障导致的图像质量下降问题。
Details
Motivation: 卫星下行链路传输速度慢,难以满足灾害监测和应急制图等需快速响应的应用需求;同时星上处理面临传感器故障和扫描模式误差导致的图像退化问题。 Method: 提出一种轻量级、非深度学习框架,结合少样本学习;设计两阶段像素级标签传播方案:第一阶段通过锚点-像素亲和矩阵传播初始标签;第二阶段基于top-k稀疏图和闭式解优化标签传播;并引入基于秩约束的图聚类算法自动选择锚点标签。 Result: 该方法在资源受限的卫星平台上实现了高效、鲁棒的高光谱图像自主分类,无需依赖空间结构信息,对坏点、错位像素和混合噪声具有较强适应性。 Conclusion: 所提AI赋能的星上边缘计算范式有效缓解了下行带宽瓶颈,提升了卫星自主决策能力,为实时地球观测应用提供了可行技术路径。 Abstract: As the important component of the Earth observation system, hyperspectral imaging satellites provide high-fidelity and enriched information for the formulation of related policies due to the powerful spectral measurement capabilities. However, the transmission speed of the satellite downlink has become a major bottleneck in certain applications, such as disaster monitoring and emergency mapping, which demand a fast response ability. We propose an efficient AI-enabled Satellite Edge Computing paradigm for hyperspectral image classification, facilitating the satellites to attain autonomous decision-making. To accommodate the resource constraints of satellite platforms, the proposed method adopts a lightweight, non-deep learning framework integrated with a few-shot learning strategy. Moreover, onboard processing on satellites could be faced with sensor failure and scan pattern errors, which result in degraded image quality with bad/misaligned pixels and mixed noise. To address these challenges, we develop a novel two-stage pixel-wise label propagation scheme that utilizes only intrinsic spectral features at the single pixel level without the necessity to consider spatial structural information as requested by deep neural networks. In the first stage, initial pixel labels are obtained by propagating selected anchor labels through the constructed anchor-pixel affinity matrix. Subsequently, a top-k pruned sparse graph is generated by directly computing pixel-level similarities. In the second stage, a closed-form solution derived from the sparse graph is employed to replace iterative computations. Furthermore, we developed a rank constraint-based graph clustering algorithm to determine the anchor labels.[262] Self-Refining Video Sampling
Sangwon Jang,Taekyung Ki,Jaehyeong Jo,Saining Xie,Jaehong Yoon,Sung Ju Hwang
Main category: cs.CV
TL;DR: 本文提出了一种无需额外训练或外部验证器的自精炼视频采样方法,通过将预训练视频生成器视为去噪自编码器,在推理时进行迭代内循环精炼,并引入不确定性感知策略选择性地优化区域,显著提升了运动连贯性和物理合理性。
Details
Motivation: 现代视频生成器在复杂物理动力学建模上仍不足,现有方法依赖外部验证器或数据增强训练,计算开销大且难以捕捉细粒度运动。 Method: 将预训练视频生成器解释为去噪自编码器,实现推理时的自迭代精炼;引入基于自我一致性评估的不确定性感知区域选择策略,避免过度精炼导致的伪影。 Result: 在多个SOTA视频生成器上实验表明,该方法显著提升运动连贯性与物理对齐性,人类偏好率超70%,优于默认采样器和引导式采样器。 Conclusion: 自精炼视频采样是一种轻量、高效且通用的推理时优化方法,无需额外训练或外部模块即可提升物理真实性。 Abstract: Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70\% human preference compared to the default sampler and guidance-based sampler.[263] GimmBO: Interactive Generative Image Model Merging via Bayesian Optimization
Chenxi Liu,Selena Ling,Alec Jacobson
Main category: cs.CV
TL;DR: 本文提出GimmBO框架,利用偏好贝叶斯优化(PBO)支持交互式扩散模型适配器融合探索,显著提升高维权重空间中的采样效率与收敛性,并通过模拟与真实用户实验验证其有效性。
Details
Motivation: 现有手动滑块调参方式在适配器融合中扩展性差、难以高效选择权重,尤其当候选适配器达20–30个时;同时实际使用中存在权重稀疏性和取值范围受限等观察现象。 Method: 提出基于偏好贝叶斯优化(PBO)的两阶段贝叶斯优化后端,结合适配器融合场景中的稀疏性与约束性先验,实现高效高维权重空间探索;构建交互式图像生成系统GimmBO。 Result: 在模拟用户和真实用户研究中,GimmBO相比标准贝叶斯优化和线搜索基线展现出更快收敛、更高成功率与稳定性能增益,并验证了框架在多种扩展任务中的灵活性。 Conclusion: GimmBO为扩散模型适配器融合提供了可扩展、高效且用户友好的交互优化范式,推动社区驱动视觉创作工具的发展。 Abstract: Fine-tuning-based adaptation is widely used to customize diffusion-based image generation, leading to large collections of community-created adapters that capture diverse subjects and styles. Adapters derived from the same base model can be merged with weights, enabling the synthesis of new visual results within a vast and continuous design space. To explore this space, current workflows rely on manual slider-based tuning, an approach that scales poorly and makes weight selection difficult, even when the candidate set is limited to 20-30 adapters. We propose GimmBO to support interactive exploration of adapter merging for image generation through Preferential Bayesian Optimization (PBO). Motivated by observations from real-world usage, including sparsity and constrained weight ranges, we introduce a two-stage BO backend that improves sampling efficiency and convergence in high-dimensional spaces. We evaluate our approach with simulated users and a user study, demonstrating improved convergence, high success rates, and consistent gains over BO and line-search baselines, and further show the flexibility of the framework through several extensions.[264] AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment
KV Karthikeya,Ashok Kumar Das,Shantanu Pal,Vivekananda Bhat K,Arun Sekar Rajasekaran
Main category: cs.CV
TL;DR: 本文提出了一种自适应图信号处理与动态语义对齐(AGSP-DSA)框架,用于鲁棒的异构多模态数据融合(文本、音频、图像),通过双图构建、谱图滤波、多尺度GCN嵌入和语义感知注意力机制提升性能,在多个基准数据集上达到SOTA效果。
Details
Motivation: 解决异构多模态数据(文本、音频、图像)融合中的模态间关系建模不充分、噪声干扰及缺失模态下的鲁棒性不足问题。 Method: 提出AGSP-DSA框架:1)构建双图结构分别学习模态内与模态间关系;2)采用谱图滤波增强信息性信号;3)使用多尺度图卷积网络(GCNs)进行节点嵌入;4)引入语义感知注意力机制,使各模态根据上下文相关性动态贡献。 Result: 在CMU-MOSEI上达95.3%准确率、0.936 F1、0.924 mAP(较MM-GNN提升2.6%准确率);AVE上93.4%准确率、0.911 F1;MM-IMDB上91.8%准确率、0.886 F1;在缺失模态设定下展现良好泛化性与鲁棒性。 Conclusion: AGSP-DSA有效提升了多模态学习在情感分析、事件识别与多媒体分类任务中的性能,验证了其在动态语义对齐与图信号处理结合上的有效性与实用性。 Abstract: In this paper, we introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources, including text, audio, and images. The requested approach uses a dual-graph construction to learn both intra-modal and inter-modal relations, spectral graph filtering to boost the informative signals, and effective node embedding with Multi-scale Graph Convolutional Networks (GCNs). Semantic aware attention mechanism: each modality may dynamically contribute to the context with respect to contextual relevance. The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art. More precisely, it achieves 95.3% accuracy, 0.936 F1-score, and 0.924 mAP on CMU-MOSEI, improving MM-GNN by 2.6 percent in accuracy. It gets 93.4% accuracy and 0.911 F1-score on AVE and 91.8% accuracy and 0.886 F1-score on MM-IMDB, which demonstrate good generalization and robustness in the missing modality setting. These findings verify the efficiency of AGSP-DSA in promoting multimodal learning in sentiment analysis, event recognition and multimedia classification.[265] EFSI-DETR: Efficient Frequency-Semantic Integration for Real-Time Small Object Detection in UAV Imagery
Yu Xia,Chang Liu,Tianqi Xiang,Zhigang Tu
Main category: cs.CV
TL;DR: 本文提出EFSI-DETR框架,通过动态频域-空间协同网络(DyFusNet)与高效语义特征浓缩器(ESFC),结合细粒度特征保留(FFR)策略,提升无人机图像中小目标的实时检测性能。
Details
Motivation: 现有方法在无人机图像小目标检测中受限于特征表达能力不足、多尺度融合效果差,且忽视频率信息、依赖静态卷积操作。 Method: 提出EFSI-DETR:包含DyFusNet(联合建模频域与空间信息实现鲁棒多尺度融合)、ESFC(低计算开销下增强深层语义特征)及FFR策略(融合富含空间细节的浅层特征以保留小目标细节)。 Result: 在VisDrone和CODrone数据集上达到SOTA:VisDrone上AP提升1.6%,APₛ提升5.8%,单RTX 4090 GPU达188 FPS。 Conclusion: EFSI-DETR有效融合频域与空间信息,兼顾检测精度与实时性,显著提升无人机图像中小目标检测性能。 Abstract: Real-time small object detection in Unmanned Aerial Vehicle (UAV) imagery remains challenging due to limited feature representation and ineffective multi-scale fusion. Existing methods underutilize frequency information and rely on static convolutional operations, which constrain the capacity to obtain rich feature representations and hinder the effective exploitation of deep semantic features. To address these issues, we propose EFSI-DETR, a novel detection framework that integrates efficient semantic feature enhancement with dynamic frequency-spatial guidance. EFSI-DETR comprises two main components: (1) a Dynamic Frequency-Spatial Unified Synergy Network (DyFusNet) that jointly exploits frequency and spatial cues for robust multi-scale feature fusion, (2) an Efficient Semantic Feature Concentrator (ESFC) that enables deep semantic extraction with minimal computational cost. Furthermore, a Fine-grained Feature Retention (FFR) strategy is adopted to incorporate spatially rich shallow features during fusion to preserve fine-grained details, crucial for small object detection in UAV imagery. Extensive experiments on VisDrone and CODrone benchmarks demonstrate that our EFSI-DETR achieves the state-of-the-art performance with real-time efficiency, yielding improvement of \textbf{1.6}\% and \textbf{5.8}\% in AP and AP$_{s}$ on VisDrone, while obtaining \textbf{188} FPS inference speed on a single RTX 4090 GPU.[266] Scale-Aware Self-Supervised Learning for Segmentation of Small and Sparse Structures
Jorge Quesada,Ghassan AlRegib
Main category: cs.CV
TL;DR: 本文提出了一种尺度感知的自监督学习(SSL)方法,通过在预训练中引入小窗口裁剪来聚焦细粒度结构,在地震成像(断层分割)和神经成像(细胞结构分割)任务中显著提升了小目标、稀疏目标的分割性能。
Details
Motivation: 现有SSL方法在分割任务中对大而均匀区域效果好,但在小、稀疏或局部不规则目标上性能下降,亟需适配目标尺度的SSL策略。 Method: 在SSL增强流程中引入小窗口裁剪,使模型在预训练阶段关注细尺度结构,实现尺度感知的表示学习。 Result: 在地震断层分割和神经细胞分割两个任务中,相比标准及SOTA基线,在标注受限下分别提升精度达13%和5%;而对大尺度特征(如地震相、组织区域)增益甚微。 Conclusion: SSL的有效性高度依赖于目标对象的尺度与稀疏性,应根据具体任务的尺度特性设计SSL流程,该原则可推广至各类科学图像分析领域。 Abstract: Self-supervised learning (SSL) has emerged as a powerful strategy for representation learning under limited annotation regimes, yet its effectiveness remains highly sensitive to many factors, especially the nature of the target task. In segmentation, existing pipelines are typically tuned to large, homogeneous regions, but their performance drops when objects are small, sparse, or locally irregular. In this work, we propose a scale-aware SSL adaptation that integrates small-window cropping into the augmentation pipeline, zooming in on fine-scale structures during pretraining. We evaluate this approach across two domains with markedly different data modalities: seismic imaging, where the goal is to segment sparse faults, and neuroimaging, where the task is to delineate small cellular structures. In both settings, our method yields consistent improvements over standard and state-of-the-art baselines under label constraints, improving accuracy by up to 13% for fault segmentation and 5% for cell delineation. In contrast, large-scale features such as seismic facies or tissue regions see little benefit, underscoring that the value of SSL depends critically on the scale of the target objects. Our findings highlight the need to align SSL design with object size and sparsity, offering a general principle for buil ding more effective representation learning pipelines across scientific imaging domains.[267] Adaptive Domain Shift in Diffusion Models for Cross-Modality Image Translation
Zihao Wang,Yuzhou Chen,Shaogang Ren
Main category: cs.CV
TL;DR: 本文提出了一种将域变换动态嵌入生成过程的跨模态图像翻译新方法,通过每步预测空间变化的混合场并引入目标一致的恢复项,提升了结构保真度和语义一致性,并减少了去噪步数。
Details
Motivation: 现有扩散模型依赖单一全局线性域间映射,导致采样器需穿越流形外高代价区域,引发校正负担加重和语义漂移,即‘固定调度域转移’这一共性失败模式。 Method: 模型在每个反向步骤中预测空间变化的混合场,并在漂移项中注入显式的、目标一致的恢复项;提供连续时间建模与精确解形式,并推导出保持边缘一致性的实用一阶采样器。 Result: 在医学影像、遥感和电致发光语义映射等多个跨模态翻译任务上,该框架提升了结构保真度与语义一致性,并以更少的去噪步数实现收敛。 Conclusion: 将域偏移动力学显式建模并嵌入生成过程,可使模型从全局对齐转向局部残差校正,显著提升跨模态图像翻译的鲁棒性与效率。 Abstract: Cross-modal image translation remains brittle and inefficient. Standard diffusion approaches often rely on a single, global linear transfer between domains. We find that this shortcut forces the sampler to traverse off-manifold, high-cost regions, inflating the correction burden and inviting semantic drift. We refer to this shared failure mode as fixed-schedule domain transfer. In this paper, we embed domain-shift dynamics directly into the generative process. Our model predicts a spatially varying mixing field at every reverse step and injects an explicit, target-consistent restoration term into the drift. This in-step guidance keeps large updates on-manifold and shifts the model's role from global alignment to local residual correction. We provide a continuous-time formulation with an exact solution form and derive a practical first-order sampler that preserves marginal consistency. Empirically, across translation tasks in medical imaging, remote sensing, and electroluminescence semantic mapping, our framework improves structural fidelity and semantic consistency while converging in fewer denoising steps.[268] CONQUER: Context-Aware Representation with Query Enhancement for Text-Based Person Search
Zequn Xie
Main category: cs.CV
TL;DR: 本文提出了CONQUER框架,通过多粒度编码、互补对挖掘和基于最优传输的上下文引导匹配,在训练阶段增强跨模态对齐;在推理阶段通过锚点选择与属性驱动增强来自适应优化模糊或不完整查询,无需重训主干网络,显著提升了文本驱动行人检索在多个数据集上的性能。
Details
Motivation: 文本驱动行人检索(TBPS)在公共安全等实际应用中至关重要,但面临跨模态差异与用户查询模糊/不完整两大挑战。 Method: 提出两阶段框架CONQUER:训练阶段采用多粒度编码、互补对挖掘和基于最优传输的上下文引导最优匹配;推理阶段引入即插即用的查询增强模块,通过锚点选择与属性驱动丰富查询语义。 Result: 在CUHK-PEDES、ICFG-PEDES和RSTPReid上显著超越强基线,Rank-1和mAP均提升明显,尤其在跨域和不完整查询场景下表现突出。 Conclusion: CONQUER是一种实用且高效的TBPS解决方案,具备良好泛化性与部署友好性。 Abstract: Text-Based Person Search (TBPS) aims to retrieve pedestrian images from large galleries using natural language descriptions. This task, essential for public safety applications, is hindered by cross-modal discrepancies and ambiguous user queries. We introduce CONQUER, a two-stage framework designed to address these challenges by enhancing cross-modal alignment during training and adaptively refining queries at inference. During training, CONQUER employs multi-granularity encoding, complementary pair mining, and context-guided optimal matching based on Optimal Transport to learn robust embeddings. At inference, a plug-and-play query enhancement module refines vague or incomplete queries via anchor selection and attribute-driven enrichment, without requiring retraining of the backbone. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that CONQUER consistently outperforms strong baselines in both Rank-1 accuracy and mAP, yielding notable improvements in cross-domain and incomplete-query scenarios. These results highlight CONQUER as a practical and effective solution for real-world TBPS deployment. Source code is available at https://github.com/zqxie77/CONQUER.[269] Splat-Portrait: Generalizing Talking Heads with Gaussian Splatting
Tong Shi,Melonie de Almeida,Daniela Ivanova,Nicolas Pugeault,Paul Henderson
Main category: cs.CV
TL;DR: 本文提出Splat-Portrait,一种基于高斯溅射(Gaussian Splatting)的3D说话头生成方法,无需3D监督或运动先验,通过自动解耦单张人像为静态3D高斯表示与2D背景,并结合音频驱动自然唇部运动,显著提升生成视频的真实感和新视角合成效果。
Details
Motivation: 现有3D说话头生成方法依赖领域特定启发式(如基于扭曲的面部运动表征先验),导致3D头重建不准确,影响动画真实感。 Method: 提出基于高斯溅射的Splat-Portrait方法:从单张人像自动学习解耦为静态3D高斯溅射表示和2D背景;利用音频条件生成唇部运动,无运动驱动先验;仅使用2D重建损失和分数蒸馏损失进行训练,无需3D监督或关键点标注。 Result: 在说话头生成与新视角合成任务上性能优于先前方法,视觉质量更优。 Conclusion: Splat-Portrait实现了高质量、无需显式3D监督或运动先验的端到端3D说话头生成,验证了高斯溅射在该任务中的有效性与潜力。 Abstract: Talking Head Generation aims at synthesizing natural-looking talking videos from speech and a single portrait image. Previous 3D talking head generation methods have relied on domain-specific heuristics such as warping-based facial motion representation priors to animate talking motions, yet still produce inaccurate 3D avatar reconstructions, thus undermining the realism of generated animations. We introduce Splat-Portrait, a Gaussian-splatting-based method that addresses the challenges of 3D head reconstruction and lip motion synthesis. Our approach automatically learns to disentangle a single portrait image into a static 3D reconstruction represented as static Gaussian Splatting, and a predicted whole-image 2D background. It then generates natural lip motion conditioned on input audio, without any motion driven priors. Training is driven purely by 2D reconstruction and score-distillation losses, without 3D supervision nor landmarks. Experimental results demonstrate that Splat-Portrait exhibits superior performance on talking head generation and novel view synthesis, achieving better visual quality compared to previous works. Our project code and supplementary documents are public available at https://github.com/stonewalking/Splat-portrait.[270] Are Video Generation Models Geographically Fair? An Attraction-Centric Evaluation of Global Visual Knowledge
Xiao Liu,Jiawei Zhang
Main category: cs.CV
TL;DR: 本文提出Geo-Attraction Landmark Probing(GAP)框架和GEOATTRACTION-500基准,评估文本到视频模型在地理公平性与地理视觉知识方面的表现;研究发现Sora 2在不同地区、发展水平和文化群体间展现出相对均匀的地理视觉知识,仅弱依赖于景点流行度。
Details
Motivation: 探究当前文本到视频生成模型是否编码了地理上公平的视觉知识,填补对模型地理偏见系统性评估的空白。 Method: 提出GAP评估框架,结合全局结构对齐、关键点级对齐和视觉语言模型判断,并构建包含500个全球分布旅游景点的GEOATTRACTION-500基准;所有指标均经人工评估验证。 Result: Sora 2在不同地理区域、发展水平和文化分组中表现出较均匀的地理视觉知识,仅对景点流行度有微弱依赖,未显现显著地理偏差。 Conclusion: 当前文本到视频模型在全球视觉知识表达上比预期更均衡,具备全球部署潜力,但仍需持续评估其演化过程中的公平性。 Abstract: Recent advances in text-to-video generation have produced visually compelling results, yet it remains unclear whether these models encode geographically equitable visual knowledge. In this work, we investigate the geo-equity and geographically grounded visual knowledge of text-to-video models through an attraction-centric evaluation. We introduce Geo-Attraction Landmark Probing (GAP), a systematic framework for assessing how faithfully models synthesize tourist attractions from diverse regions, and construct GEOATTRACTION-500, a benchmark of 500 globally distributed attractions spanning varied regions and popularity levels. GAP integrates complementary metrics that disentangle overall video quality from attraction-specific knowledge, including global structural alignment, fine-grained keypoint-based alignment, and vision-language model judgments, all validated against human evaluation. Applying GAP to the state-of-the-art text-to-video model Sora 2, we find that, contrary to common assumptions of strong geographic bias, the model exhibits a relatively uniform level of geographically grounded visual knowledge across regions, development levels, and cultural groupings, with only weak dependence on attraction popularity. These results suggest that current text-to-video models express global visual knowledge more evenly than expected, highlighting both their promise for globally deployed applications and the need for continued evaluation as such systems evolve.[271] Low Cost, High Efficiency: LiDAR Place Recognition in Vineyards with Matryoshka Representation Learning
Judith Vilella-Cantos,Mauro Martini,Marcello Chiaberge,Mónica Ballesta,David Valiente
Main category: cs.CV
TL;DR: 本文提出了一种轻量级深度学习方法MinkUNeXt-VINE,专为葡萄园环境下的机器人定位与场景识别设计,利用稀疏低成本LiDAR数据和多损失Matryoshka表征学习,在性能与效率间取得良好平衡。
Details
Motivation: 农业环境结构不规则、缺乏显著地标,导致移动机器人在其中的地点识别(place recognition)极具挑战性;现有研究多集中于目标分类与分割,而对定位任务关注不足。 Method: 提出MinkUNeXt-VINE模型,融合预处理策略与Matryoshka表征学习的多损失训练机制,适配低分辨率、稀疏LiDAR输入,并输出低维特征以提升实时性;辅以详尽的消融实验及在两个长期葡萄园数据集上的跨传感器验证。 Result: 在葡萄园场景中显著超越当前最优方法,对低质量LiDAR输入鲁棒性强,兼顾高精度与高效率;代码已开源。 Conclusion: MinkUNeXt-VINE为农业机器人提供了高效、轻量且实用的地点识别解决方案,验证了在资源受限条件下优化表征学习的有效性。 Abstract: Localization in agricultural environments is challenging due to their unstructured nature and lack of distinctive landmarks. Although agricultural settings have been studied in the context of object classification and segmentation, the place recognition task for mobile robots is not trivial in the current state of the art. In this study, we propose MinkUNeXt-VINE, a lightweight, deep-learning-based method that surpasses state-of-the-art methods in vineyard environments thanks to its pre-processing and Matryoshka Representation Learning multi-loss approach. Our method prioritizes enhanced performance with low-cost, sparse LiDAR inputs and lower-dimensionality outputs to ensure high efficiency in real-time scenarios. Additionally, we present a comprehensive ablation study of the results on various evaluation cases and two extensive long-term vineyard datasets employing different LiDAR sensors. The results demonstrate the efficiency of the trade-off output produced by this approach, as well as its robust performance on low-cost and low-resolution input data. The code is publicly available for reproduction.[272] SeNeDiF-OOD: Semantic Nested Dichotomy Fusion for Out-of-Distribution Detection Methodology in Open-World Classification. A Case Study on Monument Style Classification
Ignacio Antequera-Sánchez,Juan Luis Suárez-Díaz,Rosana Montes,Francisco Herrera
Main category: cs.CV
TL;DR: 本文提出SeNeDiF-OOD方法,通过语义嵌套二分融合框架分层检测分布外数据,在建筑风格识别系统MonuMAI上验证其对多种OOD场景(如非纪念碑图像、未知风格、对抗攻击)的优越检测性能。