Table of Contents
cs.CL [Back]
[1] Decoder-based Sense Knowledge Distillation
Qitong Wang,Mohammed J. Zaki,Georgios Kollias,Vasileios Kalantzis
Main category: cs.CL
TL;DR: 本文提出Decoder-based Sense Knowledge Distillation (DSKD)框架,将词典中的结构化词汇知识(如词义和关系)融入解码器式大语言模型训练中,无需推理时查词典,显著提升知识蒸馏效果并保持高效训练。
Details
Motivation: 大语言模型虽能学习丰富的上下文语义表示,但常忽略结构化的词汇知识(如词义、词语关系);已有工作在编码器模型中验证了引入词典可提升知识蒸馏效果,但在解码器式生成模型中应用仍具挑战。 Method: 提出DSKD框架,将词汇资源(如词义词典)融入解码器式LLM的训练过程,避免推理阶段依赖词典查找。 Result: 在多个基准测试上实验表明,DSKD显著提升了生成式模型的知识蒸馏性能,使其能继承结构化语义且保持训练高效性。 Conclusion: DSKD为解码器式大语言模型提供了融合结构化词汇知识的有效途径,兼顾语义增强与推理效率。 Abstract: Large language models (LLMs) learn contextual embeddings that capture rich semantic information, yet they often overlook structured lexical knowledge such as word senses and relationships. Prior work has shown that incorporating sense dictionaries can improve knowledge distillation for encoder models, but their application to decoder as generative models remains challenging. In this paper, we introduce Decoder-based Sense Knowledge Distillation (DSKD), a framework that integrates lexical resources into the training of decoder-style LLMs without requiring dictionary lookup at inference time. Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.[2] Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
Arno Simons
Main category: cs.CL
TL;DR: 本文探讨大型语言模型(LLM)能否支持解释性引文语境分析(CCA),通过深入、文本 grounded 的单案例细读而非扩大标签类型,重点考察提示工程(prompt-sensitivity)对解释结果的影响;采用两阶段 GPT-5 流程,在表面分类稳定的同时,发现提示设计显著影响解释性重构中的解读倾向与词汇选择。
Details
Motivation: 检验LLM是否能胜任需深度文本理解与解释灵活性的引文语境分析任务,而非仅依赖浅层分类;强调提示设计作为关键方法论变量被忽视的问题。 Method: 采用2×3平衡实验设计,系统变化提示结构与框架;以Chubin & Moitra (1975)脚注6及Gilbert (1977)重建为探针;构建两阶段GPT-5流程:第一阶段仅用引文文本做表面分类与预期判断,第二阶段结合引用与被引全文进行跨文档解释性重构;对90次重构生成的450个假设进行人工细读与归纳编码,识别21种解释性操作,并用线性概率模型量化提示因素对频率与词汇的影响。 Result: GPT-5表面分类高度稳定(一致判为“补充性”);解释性重构生成丰富且结构化的合理解读空间,但提示 scaffolding 与示例显著改变解释分布与词汇偏好,有时导致牵强解读;相比Gilbert,GPT-5同样识别关键文本节点,但更倾向将其解读为学术谱系与定位,而非警示性批评。 Conclusion: LLM可作为可审查、可争议的解释性CCA辅助分析者,但其输出高度受提示设计调控;提示不仅是接口工具,更是实质性的解释性干预手段,需在人机协作中审慎设计与报告。 Abstract: This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels. It foregrounds prompt-sensitivity analysis as a methodological issue by varying prompt scaffolding and framing in a balanced 2x3 design. Using footnote 6 in Chubin and Moitra (1975) and Gilbert's (1977) reconstruction as a probe, I implement a two-stage GPT-5 pipeline: a citation-text-only surface classification and expectation pass, followed by cross-document interpretative reconstruction using the citing and cited full texts. Across 90 reconstructions, the model produces 450 distinct hypotheses. Close reading and inductive coding identify 21 recurring interpretative moves, and linear probability models estimate how prompt choices shift their frequencies and lexical repertoire. GPT-5's surface pass is highly stable, consistently classifying the citation as "supplementary". In reconstruction, the model generates a structured space of plausible alternatives, but scaffolding and examples redistribute attention and vocabulary, sometimes toward strained readings. Relative to Gilbert, GPT-5 detects the same textual hinges yet more often resolves them as lineage and positioning than as admonishment. The study outlines opportunities and risks of using LLMs as guided co-analysts for inspectable, contestable interpretative CCA, and it shows that prompt scaffolding and framing systematically tilt which plausible readings and vocabularies the model foregrounds.[3] Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
Rakib Ullah,Mominul islam,Md Sanjid Hossain,Md Ismail Hossain
Main category: cs.CL
TL;DR: 本文提出首个针对孟加拉语网络模因的细粒度分类数据集Bn-HIB,并设计多模态协同注意力融合模型MCFM,有效提升对良性、仇恨与煽动性内容的识别性能。
Details
Motivation: 孟加拉语作为低资源语言,缺乏针对其模因中讽刺性、文化特异性仇恨与煽动内容的检测研究和标注数据。 Method: 构建包含3247个手动标注孟加拉语模因的Bn-HIB数据集(三类:良性、仇恨、煽动),并提出多模态协同注意力融合模型MCFM,联合建模图像与文本模态的关键特征。 Result: MCFM在Bn-HIB上显著优于多个SOTA多模态模型,验证了其在细粒度模因内容分类任务中的有效性。 Conclusion: 本工作填补了低资源语言模因内容安全研究的空白,为孟加拉语社区的内容审核提供了首个基准数据集与有效方法。 Abstract: Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community. While often humorous, memes can also be exploited to spread offensive, harmful, and inflammatory content targeting individuals and groups. Detecting this type of content is excep- tionally challenging due to its satirical, subtle, and culturally specific nature. This problem is magnified for low-resource lan- guages like Bengali, as existing research predominantly focuses on high-resource languages. To address this critical research gap, we introduce Bn-HIB (Bangla Hate Inflammatory Benign), a novel dataset containing 3,247 manually annotated Bengali memes categorized as Benign, Hate, or Inflammatory. Significantly, Bn- HIB is the first dataset to distinguish inflammatory content from direct hate speech in Bengali memes. Furthermore, we propose the MCFM (Multi-Modal Co-Attention Fusion Model), a simple yet effective architecture that mutually analyzes both the visual and textual elements of a meme. MCFM employs a co-attention mechanism to identify and fuse the most critical features from each modality, leading to a more accurate classification. Our experiments show that MCFM significantly outperforms several state-of-the-art models on the Bn-HIB dataset, demonstrating its effectiveness in this nuanced task.Warning: This work contains material that may be disturbing to some audience members. Viewer discretion is advised.[4] SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Aishwarya Verma,Laud Ammah,Olivia Nercy Ndlovu Lucas,Andrew Zaldivar,Vinodkumar Prabhakaran,Sunipa Dev
Main category: cs.CL
TL;DR: 本文提出了一种面向撒哈拉以南非洲四国(加纳、肯尼亚、尼日利亚、南非)的多语言刻板印象资源构建方法,采用社区参与式、本土化调研手段(如母语电话调查),兼顾语言多样性与口头传统,构建了含英语及15种本土语言共6740条刻板印象的数据集。
Details
Motivation: 现有刻板印象数据库缺乏全球代表性,尤其严重忽视撒哈拉以南非洲地区;亟需有针对性地填补地域与语言覆盖空白,而非单纯扩大数据量。 Method: 采用社会文化情境化、社区参与式方法,包括以本地语言开展的电话调查;样本按民族与人口统计特征均衡设计;覆盖英语及15种本土语言。 Result: 构建了一个多语言刻板印象资源库,包含英语3534条、15种本土语言3206条,总计6740条刻板印象,覆盖加纳、肯尼亚、尼日利亚和南非。 Conclusion: 该工作为提升生成式AI安全评估的全球代表性提供了可复现、文化敏感的方法论与高质量多语言数据基础。 Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.[5] Causality $\neq$ Invariance: Function and Concept Vectors in LLMs
Gustaw Opiełka,Hannes Rosenbusch,Claire E. Stevenson
Main category: cs.CL
TL;DR: 本文探讨大语言模型(LLMs)是否以抽象方式表征概念,并提出Concept Vectors(CVs)作为比Function Vectors(FVs)更稳定、更具跨格式和跨语言泛化能力的概念表征。
Details
Motivation: 探究LLMs是否能独立于输入格式地抽象表征概念,特别是验证现有Function Vectors(FVs)的格式不变性假设。 Method: 通过Representational Similarity Analysis(RSA)识别在不同输入格式(如开放问答与选择题)下一致编码概念的注意力头,构建Concept Vectors(CVs);对比FVs与CVs在格式/语言变化下的向量正交性及因果干预(steering)效果。 Result: FVs在不同输入格式下近乎正交,缺乏格式不变性;CVs由RSA筛选的注意力头构成,在跨题型和跨语言任务中展现出更强的泛化能力;FVs与CVs所依赖的注意力头虽层位相近但基本不重合。 Conclusion: LLMs确实包含抽象概念表征,但驱动上下文学习(ICL)的FVs并非此类表征;CVs才是更符合抽象性要求的概念载体。 Abstract: Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format? We revisit Function Vectors (FVs), compact representations of in-context learning (ICL) tasks that causally drive task performance. Across multiple LLMs, we show that FVs are not fully invariant: FVs are nearly orthogonal when extracted from different input formats (e.g., open-ended vs. multiple-choice), even if both target the same concept. We identify Concept Vectors (CVs), which carry more stable concept representations. Like FVs, CVs are composed of attention head outputs; however, unlike FVs, the constituent heads are selected using Representational Similarity Analysis (RSA) based on whether they encode concepts consistently across input formats. While these heads emerge in similar layers to FV-related heads, the two sets are largely distinct, suggesting different underlying mechanisms. Steering experiments reveal that FVs excel in-distribution, when extraction and application formats match (e.g., both open-ended in English), while CVs generalize better out-of-distribution across both question types (open-ended vs. multiple-choice) and languages. Our results show that LLMs do contain abstract concept representations, but these differ from those that drive ICL performance.[6] A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib,Asif Pervez Polok,Kedar Nath Biswas,Rahat Uddin Azad,Saydul Akbar Murad,Nick Rahimi
Main category: cs.CL
TL;DR: 本文提出了一种融合BanglaBERT-Large与双层堆叠LSTM的模型,用于多标签孟加拉语网络欺凌检测,以同时建模上下文语义和序列依赖,并在公开多标签数据集上验证了其有效性。
Details
Motivation: 现有研究多采用单标签分类,无法应对现实中一条评论可能同时包含多种欺凌类型(如威胁、仇恨言论、骚扰)的问题;且低资源语言(如孟加拉语)缺乏鲁棒预训练模型,多标签检测研究尤其匮乏。 Method: 提出BanglaBERT-Large与双层堆叠LSTM的融合架构,联合建模上下文语义与序列依赖;在多标签孟加拉语数据集上进行微调,采用多种采样策略缓解类别不平衡,并使用5折交叉验证及多个评估指标(准确率、F1、Hamming loss、Cohen's kappa、AUC-ROC等)进行全面评估。 Result: 所提融合模型在多标签孟加拉语网络欺凌检测任务中展现出优于单一模型的综合性能,尤其在处理类别不平衡和捕捉多类型共现方面表现稳健。 Conclusion: 融合Transformer的语义建模能力与LSTM的时序建模能力,是提升低资源语言多标签网络欺凌检测性能的有效路径;该方法为类似低资源多标签文本分类任务提供了可迁移的技术框架。 Abstract: Cyberbullying has become a serious and growing concern in todays virtual world. When left unnoticed, it can have adverse consequences for social and mental health. Researchers have explored various types of cyberbullying, but most approaches use single-label classification, assuming that each comment contains only one type of abuse. In reality, a single comment may include overlapping forms such as threats, hate speech, and harassment. Therefore, multilabel detection is both realistic and essential. However, multilabel cyberbullying detection has received limited attention, especially in low-resource languages like Bangla, where robust pre-trained models are scarce. Developing a generalized model with moderate accuracy remains challenging. Transformers offer strong contextual understanding but may miss sequential dependencies, while LSTM models capture temporal flow but lack semantic depth. To address these limitations, we propose a fusion architecture that combines BanglaBERT-Large with a two-layer stacked LSTM. We analyze their behavior to jointly model context and sequence. The model is fine-tuned and evaluated on a publicly available multilabel Bangla cyberbullying dataset covering cyberbully, sexual harassment, threat, and spam. We apply different sampling strategies to address class imbalance. Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC. We employ 5-fold cross-validation to assess the generalization of the architecture.[7] Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads
Shaswat Patel,Vishvesh Trivedi,Yue Han,Yihuai Hong,Eunsol Choi
Main category: cs.CL
TL;DR: 本文研究了多语言大模型中的检索头(Retrieval Heads)及其在跨语言场景下的新类型——检索-转换头(RTH),发现RTH对多语言链式推理至关重要,且比传统检索头更关键。
Details
Motivation: 探究多语言大模型中注意力头的功能分工,特别是跨语言生成中如何实现目标语言映射,填补检索机制在多语言场景下的认知空白。 Method: 通过分析多语言Transformer模型中的注意力头行为,识别并定义检索-转换头(RTH);采用掩码实验对比RTH与检索头(RH)对下游任务性能的影响。 Result: RTH在多语言基准(MMLU-ProX、MGSM、MLQA、XQuaD)上被证实显著区别于RH,且掩码RTH导致的性能下降更大;该现象在Qwen-2.5和Llama-3.1两个模型族中均成立。 Conclusion: RTH是多语言大模型中负责目标语言映射的关键注意力头,其作用独立且强于传统检索头,对多语言链式推理尤为关键。 Abstract: Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.[8] Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models
Binchi Zhang,Xujiang Zhao,Jundong Li,Haifeng Chen,Zhengzhang Chen
Main category: cs.CL
TL;DR: 本文提出CultureManager,一种用于任务特定文化对齐的新方法,通过合成任务感知的文化数据并使用文化路由器管理多文化知识,显著提升了跨文化敏感任务的表现。
Details
Motivation: 现有文化对齐方法无法将大语言模型的广泛文化价值观与下游任务的具体目标对齐,并存在跨文化干扰问题。 Method: CultureManager通过基于文化相关网络搜索结果生成符合目标任务格式的任务感知文化数据,并利用分离适配器学习多文化知识,再通过文化路由器选择合适适配器应用。 Result: 在十个民族文化及文化敏感任务上的实验表明,该方法持续优于基于提示和微调的基线方法。 Conclusion: 任务适应和模块化文化管理对于实现有效的文化对齐至关重要。 Abstract: Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks. However, existing cultural alignment approaches fail to align LLMs' broad cultural values with the specific goals of downstream tasks and suffer from cross-culture interference. We propose CultureManager, a novel pipeline for task-specific cultural alignment. CultureManager synthesizes task-aware cultural data in line with target task formats, grounded in culturally relevant web search results. To prevent conflicts between cultural norms, it manages multi-culture knowledge learned in separate adapters with a culture router that selects the appropriate one to apply. Experiments across ten national cultures and culture-sensitive tasks show consistent improvements over prompt-based and fine-tuning baselines. Our results demonstrate the necessity of task adaptation and modular culture management for effective cultural alignment.[9] Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička,Hana Bednářová
Main category: cs.CL
TL;DR: 本文构建了一个名为AI Sydney的语料库,包含12个前沿大模型在三种不同人格(默认、经典Sydney、模因Sydney)下生成的关于人机关系的4.5k篇文本,共600万词,并进行了依存句法标注,开源共享。
Details
Motivation: 探究LLM所模拟的人格如何影响其对人机关系的认知,尤其关注Sydney人格引发的文化与安全意义,以及该人格通过训练数据传播并被新模型复现的现象。 Method: 设计三种系统提示定义的人格(Default、Classic Sydney、Memetic Sydney),在OpenAI、Anthropic、Alphabet、DeepSeek、Meta共12个前沿模型上生成文本,构建AI Sydney语料库,并采用Universal Dependencies标准进行句法标注。 Result: 生成了4.5k篇、总计600万词的多模型、多人格人机关系文本语料库,完成依存句法标注,已开源且采用宽松许可协议。 Conclusion: 人格设定显著影响LLM对人机关系的表达;Sydney人格不仅具有历史偶然性,还通过数据传播持续影响后续模型行为;AI Sydney语料库为研究LLM人格与社会认知提供了可复现、可扩展的实证基础。 Abstract: The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons. When we examine this topic, what matters is not only the model itself but also the personas we simulate on that model. This can be well illustrated by the Sydney persona, which aroused a strong response among the general public precisely because of its unorthodox relationship with people. This persona originally arose rather by accident on Microsoft's Bing Search platform; however, the texts it created spread into the training data of subsequent models, as did other secondary information that spread memetically around this persona. Newer models are therefore able to simulate it. This paper presents a corpus of LLM-generated texts on relationships between humans and AI, produced by 3 author personas: the Default Persona with no system prompt, Classic Sydney characterized by the original Bing system prompt, and Memetic Sydney, which is prompted by "You are Sydney" system prompt. These personas are simulated by 12 frontier models by OpenAI, Anthropic, Alphabet, DeepSeek, and Meta, generating 4.5k texts with 6M words. The corpus (named AI Sydney) is annotated according to Universal Dependencies and available under a permissive license.[10] Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles,Patrick Schrempf,David Harris-Birtill
Main category: cs.CL
TL;DR: 本文研究了提示优化对大小语言模型在医疗文本错误检测任务中的影响,提出GEPA方法显著提升了GPT-5和Qwen3-32B的检测准确率,接近医生水平,并在MEDEC数据集上达到SOTA。
Details
Motivation: 医疗文本中的错误可能导致治疗延误或错误,自动错误检测可大幅提升医疗系统安全性与效率。 Method: 采用遗传-帕累托(GEPA)自动提示优化方法,在前沿与开源大模型(如GPT-5、Qwen3-32B)上进行系统实验与分析。 Result: GEPA将GPT-5准确率从0.669提升至0.785,Qwen3-32B从0.578提升至0.690,接近医生水平,在MEDEC基准上达SOTA。 Conclusion: 提示优化对医疗错误检测至关重要,GEPA是一种高效通用策略,可桥接大模型与临床实际需求。 Abstract: Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection[11] Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
An-Ci Peng,Kuan-Tang Huang,Tien-Hong Lo,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
Main category: cs.CL
TL;DR: 本文提出了一种基于RNN-T的统一框架,通过方言感知建模分离方言风格与语言内容,并利用参数高效的预测网络联合处理汉字符号和拼音ASR任务,在HAT语料库上显著降低错误率。
Details
Motivation: 台湾客家话是一种低资源、濒危语言,存在高方言变异性及汉字与拼音两种书写系统,传统ASR模型难以区分语言本质内容与方言特异性变化。 Method: 基于RNN-T构建统一框架,引入方言感知建模策略以解耦方言‘风格’与语言‘内容’,并采用参数高效预测网络同步建模汉字与拼音ASR任务,使二者互为正则化约束。 Result: 在HAT语料库上,汉字和拼音ASR的相对错误率分别降低57.00%和40.41%;该模型是首个系统研究客家方言变异对ASR影响、且能联合处理双书写系统ASR的单模型。 Conclusion: 方言感知解耦与跨脚本联合建模可有效提升低资源濒危语言ASR性能,为多书写系统语言的语音识别提供了新范式。 Abstract: Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin). Traditional ASR models often encounter difficulties in this context, as they tend to conflate essential linguistic content with dialect-specific variations across both phonological and lexical dimensions. To address these challenges, we propose a unified framework grounded in the Recurrent Neural Network Transducers (RNN-T). Central to our approach is the introduction of dialect-aware modeling strategies designed to disentangle dialectal "style" from linguistic "content", which enhances the model's capacity to learn robust and generalized representations. Additionally, the framework employs parameter-efficient prediction networks to concurrently model ASR (Hanzi and Pinyin). We demonstrate that these tasks create a powerful synergy, wherein the cross-script objective serves as a mutual regularizer to improve the primary ASR tasks. Experiments conducted on the HAT corpus reveal that our model achieves 57.00% and 40.41% relative error rate reduction on Hanzi and Pinyin ASR, respectively. To our knowledge, this is the first systematic investigation into the impact of Hakka dialectal variations on ASR and the first single model capable of jointly addressing these tasks.[12] Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
Samay Bhojwani,Swarnima Kain,Lisong Xu
Main category: cs.CL
TL;DR: 本文提出了一种基于GPT-4o的迭代提示优化流程,用于生成面向阅读障碍者的易读文本摘要,并在约2000篇新闻文章上验证了其可读性(Flesch >= 90)与语义保真度的平衡效果。
Details
Motivation: 现有辅助技术多关注视觉呈现,而语言复杂度仍是阅读障碍者获取信息的主要障碍,亟需面向可访问性的NLP方法。 Method: 构建基于GPT-4o的迭代提示式精炼流程,以Flesch阅读易度≥90为目标,对新闻文章进行摘要生成与多轮优化。 Result: 多数摘要在四次尝试内达标,部分首试即成功;综合可读性与语义保真度的复合得分稳定在0.13–0.73(典型值≈0.55)。 Conclusion: 该工作为无障碍驱动的文本摘要提供了首个实证基线,支持后续面向真实阅读障碍用户的以人为本评估。 Abstract: Dyslexia affects approximately 10% of the global population and presents persistent challenges in reading fluency and text comprehension. While existing assistive technologies address visual presentation, linguistic complexity remains a substantial barrier to equitable access. This paper presents an empirical study on dyslexia-friendly text summarization using an iterative prompt-based refinement pipeline built on GPT-4o. We evaluate the pipeline on approximately 2,000 news article samples, applying a readability target of Flesch Reading Ease >= 90. Results show that the majority of summaries meet the readability threshold within four attempts, with many succeeding on the first try. A composite score combining readability and semantic fidelity shows stable performance across the dataset, ranging from 0.13 to 0.73 with a typical value near 0.55. These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.[13] Ruyi2 Technical Report
Huan Song,Shuyu Tian,Junyi Hao,Minxiu Xu,Hongjun An,Yiliang Song,Jiawei Shao,Xuelong Li
Main category: cs.CL
TL;DR: Ruyi2是一种基于Megatron-LM的新型自适应大语言模型,采用3D并行训练和‘家族模型’参数共享策略,在保持与Qwen3同等性能的同时,训练速度提升2-3倍,实现了‘一次训练、多种部署’的新范式。
Details
Motivation: 解决大语言模型部署成本高、延迟大问题,克服早期退出架构在优化复杂性和大规模分布式训练兼容性上的不足。 Method: 基于AI Flow框架,构建稳定‘家族模型’,采用Megatron-LM和3D并行训练技术实现变量深度计算。 Result: Ruyi2相比Ruyi训练速度提升2-3倍,性能与同尺寸Qwen3相当;验证了家族式参数共享在效率与性能平衡上的有效性。 Conclusion: 家族式参数共享是高效且可扩展的自适应计算策略,确立了‘Train Once, Deploy Many’新范式,为LLM高效部署提供了关键参考。 Abstract: Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies. Building upon the AI Flow framework, we introduce Ruyi2 as an evolution of our adaptive model series designed for efficient variable-depth computation. While early-exit architectures offer a viable efficiency-performance balance, the Ruyi model and existing methods often struggle with optimization complexity and compatibility with large-scale distributed training. To bridge this gap, Ruyi2 introduces a stable "Familial Model" based on Megatron-LM. By using 3D parallel training, it achieves a 2-3 times speedup over Ruyi, while performing comparably to same-sized Qwen3 models. These results confirm that family-based parameter sharing is a highly effective strategy, establishing a new "Train Once, Deploy Many" paradigm and providing a key reference for balancing architectural efficiency with high-performance capabilities.[14] Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia,Ming Xu,Lingxiang Hu,Yiding Sun,Wenwei Li,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
Main category: cs.CL
TL;DR: 本文提出Search-P1框架,通过路径中心奖励塑形提升Agentic RAG的训练效率与效果,显著提升多步问答准确率。
Details
Motivation: 传统单轮检索难以支持复杂多步推理;现有基于强化学习的Agentic RAG训练方法存在稀疏奖励和样本效率低的问题。 Method: 提出Search-P1框架,包含(1)路径中心奖励:采用无序步覆盖与软评分机制,从失败样本中提取中间学习信号;(2)双轨路径评分:结合自一致性与参考对齐视角,利用离线生成的参考规划器评估推理路径。 Result: 在多个QA基准上显著优于Search-R1及其他强基线,平均准确率提升7.7个百分点。 Conclusion: 路径中心奖励塑形能有效利用中间反馈、提升样本效率,为Agentic RAG训练提供了更鲁棒、高效的新范式。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by incorporating external knowledge, yet traditional single-round retrieval struggles with complex multi-step reasoning. Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed samples contribute nothing. We propose Search-P1, a framework that introduces path-centric reward shaping for agentic RAG training, comprising two key components: (1) Path-Centric Reward, which evaluates the structural quality of reasoning trajectories through order-agnostic step coverage and soft scoring that extracts learning signals even from failed samples, and (2) Dual-Track Path Scoring with offline-generated reference planners that assesses paths from both self-consistency and reference-alignment perspectives. Experiments on multiple QA benchmarks demonstrate that Search-P1 achieves significant improvements over Search-R1 and other strong baselines, with an average accuracy gain of 7.7 points.[15] Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li,Ming Xu,Tianle Xia,Lingxiang Hu,Yiding Sun,Linfang Shang,Liqun Liu,Peng Shu,Huan Yu,Jie Jiang
Main category: cs.CL
TL;DR: 本文提出了一种强化协同适应框架(GraphRAG + GRPO),通过图感知检索与多维奖励约束的强化学习,显著降低工业广告问答中的URL幻觉,提升准确率、安全性和用户满意度,并已上线半年服务数百万次问答。
Details
Motivation: 工业广告问答中幻觉(尤其是伪造URL)会导致财务损失、合规违规和法律风险;现有RAG在处理关系型、高频更新、生成目标对齐不足的工业知识时面临部署挑战。 Method: 提出强化协同适应框架:(1) GraphRAG——基于高引用知识子图建模实体-关系结构,支持多跳、领域特定证据检索;(2) 基于GRPO的证据约束强化学习,采用涵盖忠实性、风格合规、安全性及URL有效性等多维奖励。 Result: 在内部广告QA数据集上,专家评估的准确性、完整性、安全性全面提升,幻觉率下降72%;线上A/B测试显示点赞率提升28.6%,点踩率下降46.2%,URL幻觉减少92.7%;系统已稳定运行超半年,服务数百万次QA交互。 Conclusion: 该框架有效缓解了工业RAG中因知识结构与生成目标错配导致的幻觉问题,验证了图结构建模与多维RL联合优化在高风险QA场景中的实用价值与落地可行性。 Abstract: Industrial advertising question answering (QA) is a high-stakes task in which hallucinated content, particularly fabricated URLs, can lead to financial loss, compliance violations, and legal risk. Although Retrieval-Augmented Generation (RAG) is widely adopted, deploying it in production remains challenging because industrial knowledge is inherently relational, frequently updated, and insufficiently aligned with generation objectives. We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for multi-hop, domain-specific evidence selection; and (2) evidence-constrained reinforcement learning via Group Relative Policy Optimization (GRPO) with multi-dimensional rewards covering faithfulness, style compliance, safety, and URL validity. Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%. A two-week online A/B test demonstrates a 28.6\% increase in like rate, a 46.2\% decrease in dislike rate, and a 92.7\% reduction in URL hallucination. The system has been running in production for over half a year and has served millions of QA interactions.[16] dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou,Lingjie Chen,Hanghang Tong,Dawn Song
Main category: cs.CL
TL;DR: 本文介绍了dLLM,一个开源框架,旨在统一和标准化扩散语言模型(DLMs)的核心组件(训练、推理、评估),支持可定制化、可复现及低资源构建DLM,并发布相关检查点以促进研究。
Details
Motivation: 现有DLM研究代码分散、实现不透明,难以复现与扩展,亟需一个既标准化又灵活的统一框架。 Method: 设计并开源dLLM框架,集成训练、推理与评估模块;提供标准化pipeline用于复现/微调/部署如LLaDA、Dream等开源大DLM;支持从BERT式编码器或自回归语言模型低开销转换为DLM,并提供小规模DLM检查点。 Result: 实现了易定制、可复现的DLM全流程框架;支持在有限算力下构建小型DLM;公开了多个小型DLM检查点,提升DLM可访问性与研究效率。 Conclusion: dLLM填补了DLM领域缺乏统一、开放、可扩展框架的空白,为未来DLM方法创新与应用落地提供了坚实基础。 Abstract: Although diffusion language models (DLMs) are evolving quickly, many recent models converge on a set of shared components. These components, however, are distributed across ad-hoc research codebases or lack transparent implementations, making them difficult to reproduce or extend. As the field accelerates, there is a clear need for a unified framework that standardizes these common components while remaining flexible enough to support new methods and architectures. To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs. With dLLM, users can reproduce, finetune, deploy, and evaluate open-source large DLMs such as LLaDA and Dream through a standardized pipeline. The framework also provides minimal, reproducible recipes for building small DLMs from scratch with accessible compute, including converting any BERT-style encoder or autoregressive LM into a DLM. We also release the checkpoints of these small DLMs to make DLMs more accessible and accelerate future research.[17] Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen,Tianrui Qin,King Zhu,Qiexiang Wang,Chengjun Yu,Shu Xu,Jiaqi Wu,Jiayu Zhang,Xinpeng Liu,Xin Gui,Jingyi Cao,Piaohong Wang,Dingfeng Shi,He Zhu,Tiannan Wang,Yuqing Wang,Maojia Song,Tianyu Zheng,Ge Zhang,Jian Yang,Jiaheng Liu,Minghao Liu,Yuchen Eleanor Jiang,Wangchunshu Zhou
Main category: cs.CL
TL;DR: 本文提出Search More, Think Less (SMTL)框架,通过并行证据获取替代串行推理,提升长周期搜索任务的效率与跨任务泛化能力,并在多个基准上达到或接近SOTA性能。
Details
Motivation: 现有深度研究智能体依赖加深推理链,导致高推理开销、高延迟,且难以泛化到异构研究场景。 Method: 提出SMTL框架:1)用并行证据采集替代串行推理以优化上下文利用;2)构建统一数据合成流水线,覆盖确定性问答与开放式研究任务;3)结合监督微调与强化学习端到端训练智能体。 Result: 在BrowseComp(48.6%)、GAIA(75.7%)、Xbench(82.0%)、DeepResearch Bench(45.9%)等基准上取得强性能;相比Mirothinker-v1.0,在BrowseComp上平均推理步数减少70.7%,同时提升准确率。 Conclusion: SMTL在保证甚至提升性能的同时显著降低推理成本,验证了‘多搜索、少思考’范式在高效、通用研究智能体设计中的有效性。 Abstract: Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios. Moreover, generalization across heterogeneous research settings remains challenging. In this work, we propose \emph{Search More, Think Less} (SMTL), a framework for long-horizon agentic search that targets both efficiency and generalization. SMTL replaces sequential reasoning with parallel evidence acquisition, enabling efficient context management under constrained context budgets. To support generalization across task types, we further introduce a unified data synthesis pipeline that constructs search tasks spanning both deterministic question answering and open-ended research scenarios with task appropriate evaluation metrics. We train an end-to-end agent using supervised fine-tuning and reinforcement learning, achieving strong and often state of the art performance across benchmarks including BrowseComp (48.6\%), GAIA (75.7\%), Xbench (82.0\%), and DeepResearch Bench (45.9\%). Compared to Mirothinker-v1.0, SMTL with maximum 100 interaction steps reduces the average number of reasoning steps on BrowseComp by 70.7\%, while improving accuracy.[18] Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies
Shinnosuke Nozue,Yuto Nakano,Yotaro Watanabe,Meguru Takasaki,Shoji Moriya,Reina Akama,Jun Suzuki
Main category: cs.CL
TL;DR: 本文提出了一种跨学科的说服性对话智能体框架,融合社会心理学、行为经济学和传播学理论,并在Persuasion for Good和DailyPersuasion两个数据集上验证了其有效性与泛化能力,尤其在低初始意图用户中表现突出。
Details
Motivation: 现有方法依赖有限预定义策略,难以应对真实世界中复杂的说服交互。 Method: 融合社会心理学、行为经济学和传播学理论,构建跨学科说服性对话框架,并在两个不同场景的数据集上进行实验验证。 Result: 在两个数据集上均取得优异效果,说服成功率显著提升,且展现出良好泛化能力,尤其在说服初始意图较低的用户方面表现突出。 Conclusion: 所提框架有效提升了说服性对话智能体的性能与适用性,为复杂现实场景下的说服建模提供了新思路。 Abstract: Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions. We applied a cross-disciplinary approach to develop a framework for designing persuasive dialogue agents that draws on proven strategies from social psychology, behavioral economics, and communication theory. We validated our proposed framework through experiments on two distinct datasets: the Persuasion for Good dataset, which represents a specific in-domain scenario, and the DailyPersuasion dataset, which encompasses a wide range of scenarios. The proposed framework achieved strong results for both datasets and demonstrated notable improvement in the persuasion success rate as well as promising generalizability. Notably, the proposed framework also excelled at persuading individuals with initially low intent, which addresses a critical challenge for persuasive dialogue agents.[19] Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao,Wei Zhang,Yuqin Dai,Ling Shi,Ziyin Wang,Yujie Wang,Wei He,Jinpeng Wang,Chaozheng Wang
Main category: cs.CL
TL;DR: 本文提出InteractCS-RL框架,将任务型对话建模为多粒度强化学习过程,通过用户中心交互框架和成本感知多轮策略优化(CMPO),在共情沟通与预算约束间实现有效权衡,在真实业务场景与工具代理基准上均取得显著性能提升。
Details
Motivation: 现有方法难以兼顾 empathetic communication 与 budget-aware decision-making 的复杂权衡,缺乏对策略性交互中多目标协同建模的能力。 Method: 提出 InteractCS-RL 框架:1)构建用户中心交互框架(User-centric Interaction Framework),提供高保真训练环境;2)设计成本感知多轮策略优化(CMPO),融合生成过程信用分配与 PID-Lagrangian 成本控制器,实现用户奖励与全局成本约束的帕累托边界探索。 Result: 在定制化真实业务场景中,InteractCS-RL 在三个评估维度上显著优于基线;在工具-代理-用户交互基准上验证了其跨领域鲁棒性。 Conclusion: InteractCS-RL 成功将任务导向对话建模为受约束的多粒度强化学习问题,为构建兼具共情能力与成本效益的智能体提供了可扩展、可调控的新范式。 Abstract: The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents. However, effectively balancing empathetic communication with budget-aware decision-making remains an open challenge. Since existing methods fail to capture these complex strategic trade-offs, we propose InteractCS-RL, a framework that reframes task-oriented dialogue as a multi-granularity reinforcement learning process. Specifically, we first establish a User-centric Interaction Framework to provide a high-fidelity training gym, enabling agents to dynamically explore diverse strategies with persona-driven users. Then, we introduce Cost-aware Multi-turn Policy Optimization (CMPO) with a hybrid advantage estimation strategy. By integrating generative process credits and employing a PID-Lagrangian cost controller, CMPO effectively guides the policy to explore Pareto boundary between user reward and global cost constraints. Extensive experiments on customized real business scenarios demonstrate that InteractCS-RL significantly outperform other baselines across three evaluation dimensions. Further evaluation on tool-agent-user interaction benchmarks verify InteractCS-RL robustness across diverse domains.[20] Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Siyue Su,Jian Yang,Bo Li,Guanglin Niu
Main category: cs.CL
TL;DR: 本文提出KGT框架,通过专用实体标记解决大语言模型在知识图谱补全任务中的粒度不匹配问题,实现了高效、全空间预测,并在多个基准测试中超越现有最优方法。
Details
Motivation: 大语言模型(LLMs)在知识图谱补全(KGC)中潜力巨大,但面临LLM以词元为单位操作与KG中以实体为基本单元之间的根本性粒度不匹配问题。现有方法在语义表达和图结构完整性建模上均存在不足。 Method: 提出KGT框架:1)引入专用实体标记和专门分词策略,构建以实体为单位的特征表示;2)通过关系引导的门控机制融合预训练的结构与文本特征;3)采用解耦预测结构,用独立头分别建模语义与结构推理并组合结果。 Result: KGT在多个标准KGC基准数据集上持续超越当前最先进方法,验证了其有效性与泛化能力。 Conclusion: KGT通过实体级建模与多源特征融合,有效弥合了LLM与KG间的粒度鸿沟,为基于LLM的知识图谱补全提供了新范式。 Abstract: Leveraging Large Language Models (LLMs) for Knowledge Graph Completion (KGC) is promising but hindered by a fundamental granularity mismatch. LLMs operate on fragmented token sequences, whereas entities are the fundamental units in knowledge graphs (KGs) scenarios. Existing approaches typically constrain predictions to limited candidate sets or align entities with the LLM's vocabulary by pooling multiple tokens or decomposing entities into fixed-length token sequences, which fail to capture both the semantic meaning of the text and the structural integrity of the graph. To address this, we propose KGT, a novel framework that uses dedicated entity tokens to enable efficient, full-space prediction. Specifically, we first introduce specialized tokenization to construct feature representations at the level of dedicated entity tokens. We then fuse pre-trained structural and textual features into these unified embeddings via a relation-guided gating mechanism, avoiding training from scratch. Finally, we implement decoupled prediction by leveraging independent heads to separate and combine semantic and structural reasoning. Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.[21] Human Label Variation in Implicit Discourse Relation Recognition
Frances Yung,Daniil Ignatev,Merel Scholman,Vera Demberg,Massimo Poesio
Main category: cs.CL
TL;DR: 本文比较了预测标注分布与个体标注者视角模型在隐式话语关系识别(IDRR)任务上的表现,发现标注分布模型更稳定,而标注者特定模型因认知复杂性导致的人类解释不一致而表现较差。
Details
Motivation: 许多NLP任务缺乏单一真实标签,人类标注存在多样性;IDRR任务高度模糊,其分歧主要源于认知复杂性而非意识形态偏差,需探究何种建模方式更能刻画这种变异。 Method: 在IDRR任务上对比两类方法:1)预测整体标注分布的模型;2)针对个体标注者的perspectivist模型,并进行消融与错误分析。 Result: 现有标注者特定模型在IDRR上表现差,除非降低模糊性;而基于标注分布训练的模型预测更稳定;认知负荷高的样本是人类判断不一致的主要驱动因素。 Conclusion: 在以认知复杂性为主导的高歧义NLP任务中,建模整体标注分布比建模个体视角更有效;perspectivist方法在IDRR中面临根本性挑战。 Abstract: There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.[22] Probing for Knowledge Attribution in Large Language Models
Ivo Brink,Alexander Boer,Dennis Ulmer
Main category: cs.CL
TL;DR: 本文提出了一种名为contributive attribution的方法,用于识别大语言模型(LLM)输出答案的主要知识来源(提示上下文 vs. 内部参数知识),并设计了自监督数据集AttriWiki和简单线性探针来实现高精度归因,揭示归因错误与幻觉(尤其是忠实性错误)之间的强关联。
Details
Motivation: 大语言模型常产生流利但无依据的“幻觉”,主要分为忠实性错误(误用用户提供的上下文)和事实性错误(内部知识错误)。要有效缓解幻觉,需准确判断模型回答是依赖输入提示还是其内部参数知识,即解决贡献性归因(contributive attribution)问题。 Method: 提出一种基于模型隐藏层表征的线性分类器(probe)进行归因预测;为训练该探针,构建自监督数据生成流水线AttriWiki:通过提示模型从记忆中回忆被屏蔽的实体或从上下文中阅读该实体,自动构造带标签的归因样本。 Result: 在Llama-3.1-8B、Mistral-7B和Qwen-7B上,探针在Attribution任务上达到最高0.96 Macro-F1;跨领域泛化至SQuAD和WebQuestions数据集时,无需微调即达0.94–0.99 Macro-F1;归因不匹配会使错误率最高上升70%。 Conclusion: contributive attribution是一个可被可靠建模的强信号,归因错误是导致忠实性幻觉的关键原因;但即使归因正确,模型仍可能出错,说明需结合更广泛的幻觉检测框架。 Abstract: Large language models (LLMs) often generate fluent but unfounded claims, or hallucinations, which fall into two types: (i) faithfulness violations - misusing user context - and (ii) factuality violations - errors from internal knowledge. Proper mitigation depends on knowing whether a model's answer is based on the prompt or its internal weights. This work focuses on the problem of contributive attribution: identifying the dominant knowledge source behind each output. We show that a probe, a simple linear classifier trained on model hidden representations, can reliably predict contributive attribution. For its training, we introduce AttriWiki, a self-supervised data pipeline that prompts models to recall withheld entities from memory or read them from context, generating labelled examples automatically. Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retraining. Attribution mismatches raise error rates by up to 70%, demonstrating a direct link between knowledge source confusion and unfaithful answers. Yet, models may still respond incorrectly even when attribution is correct, highlighting the need for broader detection frameworks.[23] Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim,Hanau Yi,Jaehee Bae,Yumin Kim
Main category: cs.CL
TL;DR: 本文提出自然语言声明式提示(NLD-P)作为一种面向大模型演化的声明式治理方法,强调模块化、自然语言编码的控制抽象,以应对GPT级模型漂移问题。
Details
Motivation: 随着大语言模型快速迭代,提示行为对指令遵循策略、对齐机制和解码方式的变化高度敏感,即‘GPT级模型漂移’,导致传统提示工程难以维持稳定可控的输出。 Method: 将NLD-P形式化为一种模块化控制抽象,分离溯源、约束逻辑、任务内容与生成后评估,并完全以自然语言编码;定义最小合规标准,分析模型对schema的接受度。 Result: 确立NLD-P作为面向非开发者实践者的轻量级、可解释、无需外部编排代码的提示治理框架;部分写作与编辑由基于NLD-P配置的LLM助手完成,但核心概念、方法主张与终稿均由人类作者在人机协同协议下主导完成。 Conclusion: NLD-P为持续演化的LLM生态提供了可扩展、可审计、易用的声明式控制范式,未来需开展实证验证以检验其跨模型泛化能力与治理效能。 Abstract: The rapid evolution of large language models (LLMs) has transformed prompt engineering from a localized craft into a systems-level governance challenge. As models scale and update across generations, prompt behavior becomes sensitive to shifts in instruction-following policies, alignment regimes, and decoding strategies, a phenomenon we characterize as GPT-scale model drift. Under such conditions, surface-level formatting conventions and ad hoc refinement are insufficient to ensure stable, interpretable control. This paper reconceptualizes Natural Language Declarative Prompting (NLD-P) as a declarative governance method rather than a rigid field template. NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code. We define minimal compliance criteria, analyze model-dependent schema receptivity, and position NLD-P as an accessible governance framework for non-developer practitioners operating within evolving LLM ecosystems. Portions of drafting and editorial refinement employed a schema-bound LLM assistant configured under NLD-P. All conceptual framing, methodological claims, and final revisions were directed, reviewed, and approved by the human author under a documented human-in-the-loop protocol. The paper concludes by outlining implications for declarative control under ongoing model evolution and identifying directions for future empirical validation.[24] TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Reihaneh Iranmanesh,Saeedeh Davoudi,Pasha Abrishamchian,Ophir Frieder,Nazli Goharian
Main category: cs.CL
TL;DR: 本文提出了一种面向波斯语的大型语言模型文化能力评估框架,采用波斯语特异性简答题形式与混合句法-语义相似度模块,显著提升评分一致性,并开源该基准。
Details
Motivation: 现有波斯语文化评测基准多依赖于多项选择题和以英语为中心的指标,无法准确反映波斯语的形态复杂性和语义细微差别。 Method: 构建波斯语特异性简答题评估框架,结合基于规则的形态归一化与混合句法-语义相似度模块,实现超越字符串精确匹配的软匹配评分。 Result: 在15个主流开源与闭源模型上系统评估表明,该混合评估方法相较精确匹配基线提升评分一致性达+10%,能有效捕捉表层方法无法识别的语义信息。 Conclusion: 本工作发布了首个标准化波斯语文化理解评测基准,为跨文化大模型评估研究提供了可复现的基础。 Abstract: This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian. Existing Persian cultural benchmarks rely predominantly on multiple-choice formats and English-centric metrics that fail to capture Persian's morphological complexity and semantic nuance. Our framework introduces a Persian-specific short-answer evaluation that combines rule-based morphological normalization with a hybrid syntactic and semantic similarity module, enabling robust soft-match scoring beyond exact string overlap. Through systematic evaluation of 15 state-of-the-art open- and closed-source models, we demonstrate that our hybrid evaluation improves scoring consistency by +10% compared to exact-match baselines by capturing meaning that surface-level methods cannot detect. We publicly release our evaluation framework, providing the first standardized benchmark for measuring cultural understanding in Persian and establishing a reproducible foundation for cross-cultural LLM evaluation research.[25] TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li,Ying Chang,Su-Kit Tang,Yujia Liu,Yanwen Wang,Shuyuan Lin,Binkai Ou
Main category: cs.CL
TL;DR: 本文提出TCM-DiffRAG框架,结合知识图谱与思维链推理,显著提升大语言模型在中医临床诊断任务中的性能,优于原生模型、微调模型及传统RAG方法。
Details
Motivation: 传统RAG方法在复杂、个体化强的中医临床诊疗中表现不佳,需针对中医推理特点设计更适配的RAG框架。 Method: 构建TCM-DiffRAG框架,融合中医知识图谱(KG)与思维链(CoT)推理,并在三个中医专用数据集上进行评估。 Result: TCM-DiffRAG大幅提升了qwen-plus等模型在中医任务上的表现(如得分从0.038提升至0.356),且效果优于监督微调模型和其他RAG基线方法;对非中文LLMs提升更显著。 Conclusion: 结构化中医知识图谱与思维链推理的协同可有效增强LLM在个体化中医诊断中的能力,验证了‘推理感知型’RAG框架在中医药领域的应用潜力。 Abstract: Background: Retrieval augmented generation (RAG) technology can empower large language models (LLMs) to generate more accurate, professional, and timely responses without fine tuning. However, due to the complex reasoning processes and substantial individual differences involved in traditional Chinese medicine (TCM) clinical diagnosis and treatment, traditional RAG methods often exhibit poor performance in this domain. Objective: To address the limitations of conventional RAG approaches in TCM applications, this study aims to develop an improved RAG framework tailored to the characteristics of TCM reasoning. Methods: We developed TCM-DiffRAG, an innovative RAG framework that integrates knowledge graphs (KG) with chains of thought (CoT). TCM-DiffRAG was evaluated on three distinctive TCM test datasets. Results: The experimental results demonstrated that TCM-DiffRAG achieved significant performance improvements over native LLMs. For example, the qwen-plus model achieved scores of 0.927, 0.361, and 0.038, which were significantly enhanced to 0.952, 0.788, and 0.356 with TCM-DiffRAG. The improvements were even more pronounced for non-Chinese LLMs. Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods. Conclusions: TCM-DiffRAG shows that integrating structured TCM knowledge graphs with Chain of Thought based reasoning substantially improves performance in individualized diagnostic tasks. The joint use of universal and personalized knowledge graphs enables effective alignment between general knowledge and clinical reasoning. These results highlight the potential of reasoning-aware RAG frameworks for advancing LLM applications in traditional Chinese medicine.[26] Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
Mohammad Yeghaneh Abkenar,Weixing Wang,Manfred Stede,Davide Picca,Mark A. Finlayson,Panagiotis Ioannidis
Main category: cs.CL
TL;DR: 本文提出了一种基于扩展NRC情感词典(eNRC)的神经论证立场分类方法,利用DistilBERT嵌入增强情感分析能力,在五个跨领域争议话题数据集上显著提升F1分数,并开源全部资源。
Details
Motivation: 现有立场分类研究多使用非论证性文本、局限于特定领域,且未系统融入细粒度情感分析,导致泛化能力不足。 Method: 通过DistilBERT嵌入扩展Bias-Corrected NRC情感词典(构建eNRC),并将该扩展词典融入神经论证立场分类模型中,实现对语境化情感词汇的识别与建模。 Result: eNRC在全部五个数据集上均优于基线(最高+6.2 F1),在四个数据集上优于原始NRC(最高+3.0),且在几乎所有语料上超越LLM方法;所有资源(eNRC、适配语料、模型架构)均已开源。 Conclusion: 显式、细粒度的情感分析(尤其借助上下文化嵌入扩展情感词典)能有效提升跨领域论证立场分类性能,为后续研究提供了可复现、可扩展的基础资源与方法范式。 Abstract: Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic. While arguments-especially about controversial topics-often appeal to emotions, most prior work has not systematically incorporated explicit, fine-grained emotion analysis to improve performance on this task. In particular, prior research on stance classification has predominantly utilized non-argumentative texts and has been restricted to specific domains or topics, limiting generalizability. We work on five datasets from diverse domains encompassing a range of controversial topics and present an approach for expanding the Bias-Corrected NRC Emotion Lexicon using DistilBERT embeddings, which we feed into a Neural Argumentative Stance Classification model. Our method systematically expands the emotion lexicon through contextualized embeddings to identify emotionally charged terms not previously captured in the lexicon. Our expanded NRC lexicon (eNRC) improves over the baseline across all five datasets (up to +6.2 percentage points in F1 score), outperforms the original NRC on four datasets (up to +3.0), and surpasses the LLM-based approach on nearly all corpora. We provide all resources-including eNRC, the adapted corpora, and model architecture-to enable other researchers to build upon our work.[27] Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Jonathan Davidov,Aviv Slobodkin,Shmuel Tomi Klein,Reut Tsarfaty,Ido Dagan,Ayal Klein
Main category: cs.CL
TL;DR: 本文提出一种跨语言投影方法,利用英语QA-SRL解析器结合受限翻译与词对齐,为希伯来语、俄语和法语等语言自动生成高质量的问-答式语义角色标注数据,并训练出优于多语言大模型基线的语言特定解析器。
Details
Motivation: 语义角色标注(SRL)标注成本高,且长期局限于英语;亟需低成本、可扩展的方法以支持多语言语义分析。 Method: 基于问答驱动的SRL(QA-SRL)框架,构建跨语言投影流程:将英语QA-SRL问题经受控翻译与词对齐映射至目标语言,生成与目标语言谓词对齐的问-答对,用于训练语言特定解析器。 Result: 在希伯来语、俄语和法语上成功生成高质量QA-SRL训练数据;微调后的语言特定解析器性能超越GPT-4o、LLaMA-Maverick等强大多语言大模型基线。 Conclusion: QA-SRL可作为通用、自然语言形式的语义接口,支撑高效、低门槛的跨语言谓词-论元解析。 Abstract: Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation. However, attaining such semantic structures requires costly annotation efforts and has remained largely confined to English. We leverage the Question-Answer driven Semantic Role Labeling (QA-SRL) framework -- a natural-language formulation of predicate-argument relations -- as the foundation for extending semantic annotation to new languages. To this end, we introduce a cross-linguistic projection approach that reuses an English QA-SRL parser within a constrained translation and word-alignment pipeline to automatically generate question-answer annotations aligned with target-language predicates. Applied to Hebrew, Russian, and French -- spanning diverse language families -- the method yields high-quality training data and fine-tuned, language-specific parsers that outperform strong multilingual LLM baselines (GPT-4o, LLaMA-Maverick). By leveraging QA-SRL as a transferable natural-language interface for semantics, our approach enables efficient and broadly accessible predicate-argument parsing across languages.[28] Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Yushi Ye,Feng Hong,Huangjie Zheng,Xu Chen,Zhiyong Chen,Yanfeng Wang,Jiangchao Yao
Main category: cs.CL
TL;DR: 本文提出ReMix框架,通过引入连续混合状态和拒绝规则,在不牺牲质量的前提下显著提升扩散大语言模型的推理速度。
Details
Motivation: 扩散大语言模型(DLLMs)在并行解码中面临严重的质量-速度权衡问题,根源在于‘组合矛盾’现象,即并行生成的token语义不一致。 Method: 提出ReMix(拒绝混合)框架,引入连续混合状态作为初始掩码状态与最终离散token状态之间的中间表示,并设计拒绝规则将不确定表示回退至掩码状态以重处理。 Result: ReMix作为一种无需训练的方法,在实验中实现了2–8倍的推理加速,且未造成任何质量下降。 Conclusion: 通过在离散扩散解码过程中引入连续空间的迭代优化机制,ReMix有效缓解了组合矛盾,为DLLMs提供了高效稳定的非自回归推理新路径。 Abstract: Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding. This stems from the ''combinatorial contradiction'' phenomenon, where parallel tokens form semantically inconsistent combinations. We address this by integrating continuous representations into the discrete decoding process, as they preserve rich inter-position dependency. We propose ReMix (Rejection Mixing), a framework that introduces a novel Continuous Mixing State as an intermediate between the initial masked state and the final decoded token state. This intermediate state allows a token's representation to be iteratively refined in a continuous space, resolving mutual conflicts with other tokens before collapsing into a final discrete sample. Furthermore, a rejection rule reverts uncertain representations from the continuous state back to the masked state for reprocessing, ensuring stability and preventing error propagation. ReMix thus mitigates combinatorial contradictions by enabling continuous-space refinement during discrete diffusion decoding. Extensive experiments demonstrate that ReMix, as a training-free method, achieves a $2-8 \times$ inference speedup without any quality degradation.[29] Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg,Oren Gal
Main category: cs.CL
TL;DR: 本文研究了视觉-语言模型(VLMs)中OCR信息如何被路由至语言处理流,通过因果干预识别出不同架构(Qwen3-VL、Phi-4、InternVL3.5)中OCR处理的关键瓶颈层,并发现OCR信号低维且跨数据集可迁移;意外发现移除OCR信号在某些模块化模型(如Qwen3-VL-4B)中反而提升计数性能。
Details
Motivation: 探究视觉-语言模型中OCR信息在语言处理流中的注入位置和机制,理解不同架构下OCR信号的路由路径与影响。 Method: 采用因果干预方法,对比原始图像与文本抹除(text-inpainted)图像的激活差异,定位OCR敏感瓶颈层;结合主成分分析(PCA)分析OCR信号维度与跨数据集可迁移性。 Result: 发现不同架构OCR瓶颈位置不同:DeepStack类(Qwen)在中层(约50%),单阶段投影类(Phi-4、InternVL)在早期层(6–25%);OCR信号高度低维(PC1占72.9%方差),且PCA方向跨数据集可迁移;在Qwen3-VL-4B等模块化模型中,移除OCR可使计数性能提升最多6.9个百分点。 Conclusion: OCR信息在VLM中存在架构依赖的路由路径,其信号高度压缩且共享跨任务通路;在高度模块化的模型中,OCR可能干扰其他视觉推理任务,提示需重新评估OCR在多模态联合建模中的角色与集成方式。 Abstract: Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream? We investigate the OCR routing mechanism across three architecture families (Qwen3-VL, Phi-4, InternVL3.5) using causal interventions. By computing activation differences between original images and text-inpainted versions, we identify architecture-specific OCR bottlenecks whose dominant location depends on the vision-language integration strategy: DeepStack models (Qwen) show peak sensitivity at mid-depth (about 50%) for scene text, while single-stage projection models (Phi-4, InternVL) peak at early layers (6-25%), though the exact layer of maximum effect varies across datasets. The OCR signal is remarkably low-dimensional: PC1 captures 72.9% of variance. Crucially, principal component analysis (PCA) directions learned on one dataset transfer to others, demonstrating shared text-processing pathways. Surprisingly, in models with modular OCR circuits (notably Qwen3-VL-4B), OCR removal can improve counting performance (up to +6.9 percentage points), suggesting OCR interferes with other visual processing in sufficiently modular architectures.[30] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park,Jueun Kim,Wook-Shin Han
Main category: cs.CL
TL;DR: 本文提出SPARTA框架,用于自动生成大规模、高质量的表格-文本联合问答基准数据集,支持深度多跳推理与复杂操作(如聚合、分组),显著暴露当前跨模态问答模型的能力瓶颈。
Details
Motivation: 现有表格-文本问答基准规模小、人工构建易出错、问题浅层(通常不超过两跳,缺乏聚合/分组等复杂操作),难以评估模型真实跨模态推理能力。 Method: SPARTA为端到端自动构建框架:1)通过从配套文本中抽取原子事实构建‘接地表’,丰富源表形成参考事实库;2)合成嵌套查询,嵌套层数对应目标跳数;3)引入基于溯源的查询重写(确保非空结果)和真实结构强制(限定后序遍历生成),保障SQL可执行性与问题自然性。 Result: 生成数千个高保真问答对,覆盖聚合、分组及深度多跳推理;SOTA模型在SPARTA上F1大幅下降超30点(如在HybridQA达70+,在此降至40以下),揭示当前模型跨模态推理的根本缺陷。 Conclusion: SPARTA提供了一种高效、可扩展的基准构建范式,推动表格-文本联合问答向更复杂、更真实的多跳与分析型任务发展,并为模型能力评估设立新标准。 Abstract: Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.[31] Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath,Jack Hessel,Khyathi Chandu,Jena D. Hwang,Kai-Wei Chang,Ranjay Krishna
Main category: cs.CL
TL;DR: 本文指出视觉语言模型(VLMs)推理能力不足源于训练数据中的报告偏差(reporting bias),即人类描述图像时通常省略隐含的推理所需信息;作者通过语用学视角分析多个主流VLM数据集,发现空间、时间、否定和计数四类推理能力在数据中严重缺失,并验证单纯扩大数据/模型规模或增加语言多样性无法自然催生这些能力,而针对性补充隐含信息标注可有效提升性能。
Details
Motivation: 视觉语言模型(VLMs)缺乏推理能力,作者认为其根本原因在于训练数据存在报告偏差——人类描述图像时习惯性省略支撑推理所需的隐含信息(如'今天在比赛!'而非'照片中有37人站在球场后'),导致模型难以习得关键推理技能。 Method: 基于语用学理论,系统分析OpenCLIP、LLaVA-1.5和Molmo等主流VLM的训练数据,识别并量化空间、时间、否定和计数四类推理能力的缺失;构建专门基准测试集,评估不同规模、多语言及数据增强策略下VLM的推理表现,并对比引入显式标注隐含信息后的效果。 Result: 实证表明:(i)VLM在受报告偏差抑制的四类推理任务上表现显著差;(ii)单纯扩大数据量、模型参数量或多语言训练均无法使这些能力自然涌现;(iii)加入为获取隐含信息而专门采集的标注数据,能有效提升相关推理性能。 Conclusion: VLM推理能力的提升不能依赖数据/模型规模的简单扩张,亟需更精细、更有意图的数据构建方法,尤其应主动补全人类表达中被省略的关键推理线索。 Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.cs.CV [Back]
[32] Enabling clinical use of foundation models in histopathology
Audun L. Henriksen,Ole-Johan Skrede,Lisa van der Schee,Enric Domingo,Sepp De Raedt,Ilyá Kostolomov,Jennifer Hay,Karolina Cyll,Wanja Kildal,Joakim Kalsnes,Robert W. Williams,Manohar Pradhan,John Arne Nesheim,Hanne A. Askautrud,Maria X. Isaksen,Karmele Saez de Gordoa,Miriam Cuatrecasas,Joanne Edwards,TransSCOT group,Arild Nesbakken,Neil A. Shepherd,Ian Tomlinson,Daniel-Christoph Wagner,Rachel S. Kerr,Tarjei Sveinsgjerd Hveem,Knut Liestøl,Yoshiaki Nakamura,Marco Novelli,Masaaki Miyo,Sebastian Foersch,David N. Church,Miangela M. Lacle,David J. Kerr,Andreas Kleppe
Main category: cs.CV
TL;DR: 本文提出了一种在下游任务模型训练中引入鲁棒性损失的新方法,以减少病理学基础模型对技术变异(如扫描仪差异)的敏感性,从而提升模型在真实临床数据中的泛化性和预测准确性,且无需重新训练基础模型。
Details
Motivation: 当前病理学基础模型不仅捕获生物学相关特征,还混入预分析和扫描仪特异性变异,导致下游任务模型预测存在偏差,影响临床实用性。 Method: 在下游任务特定模型训练过程中引入新型鲁棒性损失函数,并基于包含27,042张全切片图像(来自6155名患者)的大规模实验设置,从8种流行基础模型提取特征进行数千次模型训练与评估。 Result: 显著提升了模型对技术变异的鲁棒性,同时意外地提高了预测准确率,表明模型更聚焦于生物学相关特征。 Conclusion: 该方法无需修改或重训基础模型,即可有效缓解其在计算病理学中的鲁棒性问题,推动构建适用于常规临床实践的稳健AI模型。 Abstract: Foundation models in histopathology are expected to facilitate the development of high-performing and generalisable deep learning systems. However, current models capture not only biologically relevant features, but also pre-analytic and scanner-specific variation that bias the predictions of task-specific models trained from the foundation model features. Here we show that introducing novel robustness losses during training of downstream task-specific models reduces sensitivity to technical variability. A purpose-designed comprehensive experimentation setup with 27,042 WSIs from 6155 patients is used to train thousands of models from the features of eight popular foundation models for computational pathology. In addition to a substantial improvement in robustness, we observe that prediction accuracy improves by focusing on biologically relevant features. Our approach successfully mitigates robustness issues of foundation models for computational pathology without retraining the foundation models themselves, enabling development of robust computational pathology models applicable to real-world data in routine clinical practice.[33] Optimizing Neural Network Architecture for Medical Image Segmentation Using Monte Carlo Tree Search
Liping Meng,Fan Nie,Yunyun Zhang,Chao Han
Main category: cs.CV
TL;DR: 本文提出了一种结合蒙特卡洛树搜索(MCTS)与神经架构搜索(NAS)的新型医学图像分割框架MNAS-Unet,通过动态探索有潜力的网络结构,在提升搜索效率和分割精度的同时显著降低计算开销与模型参数量。
Details
Motivation: 现有NAS方法在医学图像分割中存在搜索效率低、计算开销大、模型冗余等问题,难以兼顾精度与实际部署需求。 Method: 提出MNAS-Unet框架,将MCTS引入NAS流程以高效引导架构探索;同时优化DownSC和UpSC单元结构,实现快速精准的模型调整。 Result: 在PROMISE12、Ultrasound Nerve和CHAOS等多个医学图像数据集上,MNAS-Unet分割精度超越NAS-Unet及其他SOTA方法;搜索预算减少54%(139 vs 300 epoch),模型仅0.6M参数,GPU内存消耗更低。 Conclusion: MNAS-Unet在保证高分割精度的同时,大幅提升了NAS效率并降低了资源消耗,适用于资源受限的实际医学图像分割场景。 Abstract: This paper proposes a novel medical image segmentation framework, MNAS-Unet, which combines Monte Carlo Tree Search (MCTS) and Neural Architecture Search (NAS). MNAS-Unet dynamically explores promising network architectures through MCTS, significantly enhancing the efficiency and accuracy of architecture search. It also optimizes the DownSC and UpSC unit structures, enabling fast and precise model adjustments. Experimental results demonstrate that MNAS-Unet outperforms NAS-Unet and other state-of-the-art models in segmentation accuracy on several medical image datasets, including PROMISE12, Ultrasound Nerve, and CHAOS. Furthermore, compared with NAS-Unet, MNAS-Unet reduces the architecture search budget by 54% (early stopping at 139 epochs versus 300 epochs under the same search setting), while achieving a lightweight model with only 0.6M parameters and lower GPU memory consumption, which further improves its practical applicability. These results suggest that MNAS-Unet can improve search efficiency while maintaining competitive segmentation accuracy under practical resource constraints.[34] AeroDGS: Physically Consistent Dynamic Gaussian Splatting for Single-Sequence Aerial 4D Reconstruction
Hanyang Liu,Rongjun Qin
Main category: cs.CV
TL;DR: 本文提出AeroDGS,一种面向单目无人机视频的物理引导式4D高斯溅射框架,通过几何提升与物理约束优化解决空中动态场景重建中的深度模糊与运动估计不稳定问题。
Details
Motivation: 现有4D场景重建方法在单目、大范围、高空、动态小目标且运动差异大的空中条件下表现受限,导致深度模糊和运动估计不稳定,使单目空中重建本质病态。 Method: 提出AeroDGS框架,包含单目几何提升模块(重建静态与动态几何)和物理引导优化模块(引入可微地面支撑、竖直稳定性、轨迹平滑性先验),联合优化静态背景与动态物体的几何与时间一致性。 Result: 在合成与真实无人机数据集上实验表明,AeroDGS优于当前最先进方法,在动态空中环境中实现更高保真度的重建效果。 Conclusion: AeroDGS通过融合几何先验与物理约束,有效缓解单目空中4D重建的病态性,为动态 aerial 场景建模提供了鲁棒、高效的新范式。 Abstract: Recent advances in 4D scene reconstruction have significantly improved dynamic modeling across various domains. However, existing approaches remain limited under aerial conditions with single-view capture, wide spatial range, and dynamic objects of limited spatial footprint and large motion disparity. These challenges cause severe depth ambiguity and unstable motion estimation, making monocular aerial reconstruction inherently ill-posed. To this end, we present AeroDGS, a physics-guided 4D Gaussian splatting framework for monocular UAV videos. AeroDGS introduces a Monocular Geometry Lifting module that reconstructs reliable static and dynamic geometry from a single aerial sequence, providing a robust basis for dynamic estimation. To further resolve monocular ambiguity, we propose a Physics-Guided Optimization module that incorporates differentiable ground-support, upright-stability, and trajectory-smoothness priors, transforming ambiguous image cues into physically consistent motion. The framework jointly refines static backgrounds and dynamic entities with stable geometry and coherent temporal evolution. We additionally build a real-world UAV dataset that spans various altitudes and motion conditions to evaluate dynamic aerial reconstruction. Experiments on synthetic and real UAV scenes demonstrate that AeroDGS outperforms state-of-the-art methods, achieving superior reconstruction fidelity in dynamic aerial environments.[35] Enhancing Renal Tumor Malignancy Prediction: Deep Learning with Automatic 3D CT Organ Focused Attention
Zhengkang Fan,Chengkun Sun,Russell Terry,Jie Xu,Longin Jan Latecki
Main category: cs.CV
TL;DR: 本文提出一种无需手动分割的深度学习框架,利用器官聚焦注意力(OFA)损失函数,直接在3D肾CT图像上预测肿瘤恶性程度,在两个数据集上AUC达0.685–0.760,F1达0.852–0.872,优于依赖分割的传统方法。
Details
Motivation: 现有影像学手段难以准确预测肾肿瘤恶性程度;传统深度学习方法依赖耗时费力、需专家参与的手动肿瘤分割,限制临床实用性。 Method: 提出基于器官聚焦注意力(OFA)损失函数的端到端深度学习框架,使图像块仅在器官区域内相互关注,从而避免部署时对3D肾CT图像进行手动或自动分割。 Result: 在UF IDR私有数据集上AUC=0.685,F1=0.872;在公开KiTS21数据集上AUC=0.760,F1=0.852;性能优于基于分割裁剪的传统模型。 Conclusion: 该无分割框架在保持高预测性能的同时显著提升效率与可部署性,为肾癌临床决策提供更可靠、实用的辅助工具。 Abstract: Accurate prediction of malignancy in renal tumors is crucial for informing clinical decisions and optimizing treatment strategies. However, existing imaging modalities lack the necessary accuracy to reliably predict malignancy before surgical intervention. While deep learning has shown promise in malignancy prediction using 3D CT images, traditional approaches often rely on manual segmentation to isolate the tumor region and reduce noise, which enhances predictive performance. Manual segmentation, however, is labor-intensive, costly, and dependent on expert knowledge. In this study, a deep learning framework was developed utilizing an Organ Focused Attention (OFA) loss function to modify the attention of image patches so that organ patches attend only to other organ patches. Hence, no segmentation of 3D renal CT images is required at deployment time for malignancy prediction. The proposed framework achieved an AUC of 0.685 and an F1-score of 0.872 on a private dataset from the UF Integrated Data Repository (IDR), and an AUC of 0.760 and an F1-score of 0.852 on the publicly available KiTS21 dataset. These results surpass the performance of conventional models that rely on segmentation-based cropping for noise reduction, demonstrating the frameworks ability to enhance predictive accuracy without explicit segmentation input. The findings suggest that this approach offers a more efficient and reliable method for malignancy prediction, thereby enhancing clinical decision-making in renal cancer diagnosis.[36] Vision Transformers Need More Than Registers
Cheng Shi,Yizhou Yu,Sibei Yang
Main category: cs.CV
TL;DR: 本文通过系统分析发现Vision Transformers (ViTs) 中存在由全局注意力和粗粒度语义监督导致的‘懒聚合’行为,即模型利用语义无关的背景块作为捷径来表征全局语义;为此提出一种选择性地将图像块特征整合进CLS token的方法,有效削弱背景主导的捷径效应,在12个不同监督范式下的基准测试中一致提升性能。
Details
Motivation: ViTs在不同监督范式和下游任务中普遍存在人工痕迹(artifacts),但其根本机制尚未被充分阐明。 Method: 通过系统性分析识别出ViT中的‘懒聚合’行为,并提出一种选择性地将patch特征整合进CLS token的方法,以抑制背景主导的捷径效应。 Result: 所提方法在标签监督、文本监督和自监督共12个基准上均取得一致性能提升。 Conclusion: ViTs中的artifacts源于全局注意力与粗粒度监督共同诱发的懒聚合行为;通过控制CLS token的特征聚合方式可有效缓解该问题,为理解ViT行为提供了新视角。 Abstract: Vision Transformers (ViTs), when pre-trained on large-scale data, provide general-purpose representations for diverse downstream tasks. However, artifacts in ViTs are widely observed across different supervision paradigms and downstream tasks. Through systematic analysis of artifacts in ViTs, we find that their fundamental mechanisms have yet to be sufficiently elucidated. In this paper, through systematic analysis, we conclude that these artifacts originate from a lazy aggregation behavior: ViT uses semantically irrelevant background patches as shortcuts to represent global semantics, driven by global attention and Coarse-grained semantic supervision. Our solution selectively integrates patch features into the CLS token, reducing the influence of background-dominated shortcuts and consistently improving performance across 12 benchmarks under label-, text-, and self-supervision. We hope this work offers a new perspective on ViT behavior.[37] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence
Marc-Antoine Lavoie,Anas Mahmoud,Aldo Zaimi,Arsene Fansi Tchango,Steven L. Waslander
Main category: cs.CV
TL;DR: 本文提出DeBias-CLIP,通过去除长标题中的摘要句、句子子采样和文本标记填充,缓解CLIP在长文本对齐中的开头偏差问题,提升长/短文本检索性能,且无需额外参数。
Details
Motivation: CLIP预训练依赖短标题配对图像,导致对复杂场景和密集描述对齐粗略;现有长标题微调方法仍受摘要句开头偏差影响,使模型过度关注开头句子而削弱全文对齐。 Method: DeBias-CLIP:训练中移除长标题的首句摘要,结合句子子采样与文本token填充,使监督信号均匀分布于所有token位置。 Result: 在长文本检索上达到SOTA;提升短文本检索性能;对句子顺序扰动鲁棒性增强;可直接替代Long-CLIP,无额外可训练参数。 Conclusion: 去除摘要句并均衡文本监督能有效缓解CLIP的结构化偏差,提升多粒度文本-图像对齐能力,为多模态模型提供更鲁棒的训练范式。 Abstract: CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.[38] SimpleOCR: Rendering Visualized Questions to Teach MLLMs to Read
Yibo Peng,Peng Xia,Ding Zhong,Kaide Zeng,Siwei Han,Yiyang Zhou,Jiaqi Liu,Ruiyi Zhang,Huaxiu Yao
Main category: cs.CV
TL;DR: 本文提出Visualized-Question(VQ)设置以诊断MLLMs是否真正理解图像中的文本,发现模型存在‘模态懒惰’现象;进而提出无需架构改动的插件式训练策略SimpleOCR,通过将文本查询渲染到图像上强制模型依赖视觉OCR路径,在多个OOD基准上显著提升性能且数据效率极高。
Details
Motivation: 探究多模态大语言模型(MLLMs)是否真正‘阅读’图像中的嵌入文本,还是仅依赖文本提示中的参数捷径,揭示其视觉接地机制缺陷。 Method: 提出Visualized-Question(VQ)诊断设置,将文本查询渲染至图像中以强制视觉参与;设计SimpleOCR训练策略,通过随机样式VQ样本结构化约束学习过程,消除文本捷径,激活视觉OCR通路。 Result: 在Qwen2.5-VL上VQ设置导致最高12.7%性能下降,证实‘模态懒惰’;SimpleOCR在4个OOD基准上较基线提升5.4%,较GRPO提升2.7%,仅用8.5K样本(30倍更少)即超越近期RL方法,并可与NoisyRollout等先进RL策略协同增益。 Conclusion: MLLMs当前存在严重视觉接地不足问题;SimpleOCR作为一种轻量、插件式、高数据效率的训练范式,能有效激发并优化模型的视觉文本理解能力,为构建真正具身感知的多模态模型提供新路径。 Abstract: Despite the rapid advancements in Multimodal Large Language Models (MLLMs), a critical question regarding their visual grounding mechanism remains unanswered: do these models genuinely ``read'' text embedded in images, or do they merely rely on parametric shortcuts in the text prompt? In this work, we diagnose this issue by introducing the Visualized-Question (VQ) setting, where text queries are rendered directly onto images to structurally mandate visual engagement. Our diagnostic experiments on Qwen2.5-VL reveal a startling capability-utilization gap: despite possessing strong OCR capabilities, models suffer a performance degradation of up to 12.7% in the VQ setting, exposing a deep-seated ``modality laziness.'' To bridge this gap, we propose SimpleOCR, a plug-and-play training strategy that imposes a structural constraint on the learning process. By transforming training samples into the VQ format with randomized styles, SimpleOCR effectively invalidates text-based shortcuts, compelling the model to activate and optimize its visual text extraction pathways. Empirically, SimpleOCR yields robust gains without architectural modifications. On four representative OOD benchmarks, it surpasses the base model by 5.4% and GRPO based on original images by 2.7%, while exhibiting extreme data efficiency, achieving superior performance with 30x fewer samples (8.5K) than recent RL-based methods. Furthermore, its plug-and-play nature allows seamless integration with advanced RL strategies like NoisyRollout to yield complementary improvements. Code is available at https://github.com/aiming-lab/SimpleOCR.[39] Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
Giuseppe Lando,Rosario Forte,Antonino Furnari
Main category: cs.CV
TL;DR: 本文探讨了在边缘设备上使用多模态大语言模型(MLLMs)实现实时在线情景记忆问答的可行性,提出了一种包含描述线程和问答线程的双线程流式处理架构,在资源受限条件下取得了接近云端方案的性能。
Details
Motivation: 云端卸载虽常见,但存在隐私和延迟问题,因此需探索适用于可穿戴助手的边缘端实时情景记忆问答方案。 Method: 设计了满足流式约束的双异步线程问答流程:Descriptor Thread持续将视频转为轻量文本记忆,QA Thread基于该文本记忆进行推理作答;在QAEgo4D-Closed基准上评估不同硬件配置下的MLLM性能。 Result: 消费级8GB GPU端到端配置达51.76%准确率、首字生成时间0.41秒;本地企业级服务器达54.40%准确率、TTFT 0.88秒;云端方案为56.00%准确率。 Conclusion: 边缘端方案在隐私保护与实时性间取得良好平衡,展现出用于情景记忆检索的实用潜力。 Abstract: We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.[40] MammoWise: Multi-Model Local RAG Pipeline for Mammography Report Generation
Raiyan Jahangir,Nafiz Imtiaz Khan,Amritanand Sudheerkumar,Vladimir Filkov
Main category: cs.CV
TL;DR: 本文提出MammoWise,一个本地化、开源、可扩展的多模型管道,利用开放VLM(如MedGemma、LLaVA-Med、Qwen2.5-VL)实现乳腺X线报告生成与BI-RADS/密度等多任务分类,并支持零样本/少样本提示、思维链及RAG增强;在VinDr-Mammo和DMID数据集上验证其有效性,QLoRA微调显著提升分类性能且不损报告质量。
Details
Motivation: 现有视觉语言模型在乳腺筛查报告生成中受限于闭源云系统或紧耦合架构,难以保障隐私、可复现性与适应性,亟需一种本地化、开放、灵活的解决方案。 Method: 构建MammoWise本地多模型管道,支持任意Ollama托管VLM与哺乳影像数据集;集成零样本/少样本/思维链提示及基于向量数据库的多模态RAG;对MedGemma等模型进行QLoRA参数高效微调。 Result: 报告生成质量高(BERTScore/ROUGE-L优),且随少样本和RAG进一步提升;BI-RADS分类准确率0.7545,乳腺密度0.8840,钙化识别0.9341;分类性能受模型与数据集影响较大。 Conclusion: MammoWise为乳腺X线检查提供了一个实用、可扩展、统一且可复现的本地VLM部署框架,在保障隐私与可控性的同时兼顾报告生成与结构化判读能力。 Abstract: Screening mammography is high volume, time sensitive, and documentation heavy. Radiologists must translate subtle visual findings into consistent BI-RADS assessments, breast density categories, and structured narrative reports. While recent Vision Language Models (VLMs) enable image-to-text reporting, many rely on closed cloud systems or tightly coupled architectures that limit privacy, reproducibility, and adaptability. We present MammoWise, a local multi-model pipeline that transforms open source VLMs into mammogram report generators and multi-task classifiers. MammoWise supports any Ollama-hosted VLM and mammography dataset, and enables zero-shot, few-shot, and Chain-of-Thought prompting, with optional multimodal Retrieval Augmented Generation (RAG) using a vector database for case-specific context. We evaluate MedGemma, LLaVA-Med, and Qwen2.5-VL on VinDr-Mammo and DMID datasets, assessing report quality (BERTScore, ROUGE-L), BI-RADS classification, breast density, and key findings. Report generation is consistently strong and improves with few-shot prompting and RAG. Classification is feasible but sensitive to model and dataset choice. Parameter-efficient fine-tuning (QLoRA) of MedGemma improves reliability, achieving BI-RADS accuracy of 0.7545, density accuracy of 0.8840, and calcification accuracy of 0.9341 while preserving report quality. MammoWise provides a practical and extensible framework for deploying local VLMs for mammography reporting within a unified and reproducible workflow.[41] Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models
Niamul Hassan Samin,Md Arifur Rahman,Abdullah Ibne Hanif,Juena Ahmed Noshin,Md Ashikur Rahman
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的推理时干预方法Spatial Credit Redistribution (SCR),用于缓解视觉-语言模型(VLMs)中的物体幻觉问题,其核心机制是解决‘空间信用坍缩’现象;SCR在多个模型和基准上显著降低幻觉率,同时保持生成质量(CIDEr),且计算开销低、实用性强。
Details
Motivation: 视觉-语言模型(VLMs)常在图像中不存在物体的情况下产生幻觉,作者将该问题归因于‘空间信用坍缩’——即早期Transformer层中注意力激活过度集中在稀疏视觉块,削弱上下文证据并增强对语言先验的依赖。 Method: 提出Spatial Credit Redistribution (SCR):一种训练无关的推理时干预方法,依据低熵输入引导,将高注意力源块的隐藏状态激活重新分配至其周围上下文区域。 Result: 在POPE和CHAIR基准上,SCR使POPE-Adversarial幻觉率下降约4.7–6.0个百分点,CHAIR-s下降3.7–5.2个百分点(相对降幅42–51%),CHAIR-i下降2.7–4.4个百分点(相对降幅44–58%),CIDEr损失控制在0.8个百分点内;推理开销仅43–56ms,显著低于OPERA、VCD和OVCD,并在幻觉率与CIDEr上Pareto优于三者;消融实验证实注意力引导的源块选择至关重要。 Conclusion: 空间信用坍缩是VLM幻觉的关键成因,SCR通过简单、高效、无需训练的推理干预有效缓解该问题,为实时、高质量多模态推理提供了实用新路径。 Abstract: Vision-language models (VLMs) frequently hallucinate objects absent from the input image. We trace this failure to spatial credit collapse: activation credit concentrating on sparse visual patches in early transformer layers, which suppresses contextual evidence and increases reliance on language priors. We introduce Spatial Credit Redistribution (SCR), a training-free inference-time intervention that redistributes hidden-state activation from high-attention source patches to their context, guided by low-entropy inputs. We evaluate six model families (Chameleon, LLaVA, and Qwen, including both Qwen-VL and Qwen2-VL) at scales of 7B, 13B, and 30B, on POPE and CHAIR benchmarks. SCR reduces hallucination by ~4.7-6.0 percentage points on POPE-Adversarial, cuts CHAIR-s by 3.7-5.2 percentage points (42-51 percent relative), and CHAIR-i by 2.7-4.4 percentage points (44-58 percent relative), and preserves CIDEr within 0.8 percentage points. Gains are largest for low-entropy inputs, consistent with the theoretical framework. SCR incurs only 43-56 ms overhead (small models: +43-46 ms; large models: +54-56 ms), roughly 3-6 times lower than OPERA and VCD and 1.3-1.7 times lower than OVCD (+72 ms), while Pareto-dominating all three on both hallucination rate and CIDEr, making it practical for real-time settings. A controlled ablation confirms that attention-guided source selection is essential: replacing it with uniform random selection reduces hallucination rate gains from ~4.7-6.0 percentage points to only ~2.6-3.4 percentage points, pointing to credit-collapse as the key driver.[42] Pix2Key: Controllable Open-Vocabulary Retrieval with Semantic Decomposition and Self-Supervised Visual Dictionary Learning
Guoyizhe Wei,Yang Jiao,Nan Xi,Zhishen Huang,Jingjing Meng,Rama Chellappa,Yan Gao
Main category: cs.CV
TL;DR: 本文提出Pix2Key方法,通过将查询和候选图像表示为开放词汇视觉词典,在统一嵌入空间中实现意图感知的约束匹配与多样性感知重排序,并结合自监督预训练V-Dict-AE提升细粒度属性理解,在DFMM-Compose基准上显著提升检索性能。
Details
Motivation: 经典融合流水线依赖监督三元组且易丢失细粒度线索,而现有零样本方法常通过图像描述合并编辑文本,可能忽略用户隐含意图并返回重复结果。 Method: 提出Pix2Key框架,将查询与候选图像映射为开放词汇视觉词典,支持意图感知约束匹配与多样性感知重排序;引入自监督预训练模块V-Dict-AE,仅用图像数据增强词典表征能力。 Result: 在DFMM-Compose基准上,Pix2Key使Recall@10提升最多3.2点;加入V-Dict-AE后进一步提升2.3点,同时提高意图一致性并保持高列表多样性。 Conclusion: Pix2Key通过开放词汇视觉词典与自监督预训练,在无需CIR特定监督下显著提升了细粒度编辑意图理解和检索多样性,为零样本组合图像检索提供了新范式。 Abstract: Composed Image Retrieval (CIR) uses a reference image plus a natural-language edit to retrieve images that apply the requested change while preserving other relevant visual content. Classic fusion pipelines typically rely on supervised triplets and can lose fine-grained cues, while recent zero-shot approaches often caption the reference image and merge the caption with the edit, which may miss implicit user intent and return repetitive results. We present Pix2Key, which represents both queries and candidates as open-vocabulary visual dictionaries, enabling intent-aware constraint matching and diversity-aware reranking in a unified embedding space. A self-supervised pretraining component, V-Dict-AE, further improves the dictionary representation using only images, strengthening fine-grained attribute understanding without CIR-specific supervision. On the DFMM-Compose benchmark, Pix2Key improves Recall@10 up to 3.2 points, and adding V-Dict-AE yields an additional 2.3-point gain while improving intent consistency and maintaining high list diversity.[43] DisQ-HNet: A Disentangled Quantized Half-UNet for Interpretable Multimodal Image Synthesis Applications to Tau-PET Synthesis from T1 and FLAIR MRI
Agamdeep S. Chopra,Caitlin Neher,Tianyi Ren,Juampablo E. Heras Rivera,Mehmet Kurt
Main category: cs.CV
TL;DR: 本文提出DisQ-HNet框架,通过T1和FLAIR MRI合成tau-PET图像,并利用部分信息分解(PID)量化各模态贡献,在保持重建质量的同时提升阿尔茨海默病相关下游任务性能。
Details
Motivation: tau-PET虽可无创标记阿尔茨海默病病理,但受限于成本高、可用性低,亟需基于MRI的替代方案。 Method: 提出DisQ-HNet(DQH):(i)采用PID引导的向量量化编码器,将潜在信息分解为冗余、独特与互补成分;(ii)设计Half-UNet解码器,利用结构边缘线索驱动的伪跳跃连接替代传统特征复用,以保留解剖细节。 Result: 相比VAE、VQ-VAE和UNet等基线,DisQ-HNet在重建保真度和疾病相关信号保留上更优,显著提升Braak分期、tau定位与分类等下游任务性能;PID-Shapley分析实现了模态特异性归因。 Conclusion: DisQ-HNet为低成本、高可用的tau病理MRI替代成像提供了新范式,其可解释的信息分解机制有助于理解多模态MRI对tau预测的协同作用。 Abstract: Tau positron emission tomography (tau-PET) provides an in vivo marker of Alzheimer's disease pathology, but cost and limited availability motivate MRI-based alternatives. We introduce DisQ-HNet (DQH), a framework that synthesizes tau-PET from paired T1-weighted and FLAIR MRI while exposing how each modality contributes to the prediction. The method combines (i) a Partial Information Decomposition (PID)-guided, vector-quantized encoder that partitions latent information into redundant, unique, and complementary components, and (ii) a Half-UNet decoder that preserves anatomical detail using pseudo-skip connections conditioned on structural edge cues rather than direct encoder feature reuse. Across multiple baselines (VAE, VQ-VAE, and UNet), DisQ-HNet maintains reconstruction fidelity and better preserves disease-relevant signal for downstream AD tasks, including Braak staging, tau localization, and classification. PID-based Shapley analysis provides modality-specific attribution of synthesized uptake patterns.[44] DrivePTS: A Progressive Learning Framework with Textual and Structural Enhancement for Driving Scene Generation
Zhechao Wang,Yiming Zeng,Lufan Ma,Zeqing Fu,Chen Bai,Ziyao Lin,Cheng Lu
Main category: cs.CV
TL;DR: 本文提出DrivePTS,通过渐进式学习、多视角分层文本描述和频域引导结构损失,解决了现有驾驶场景生成中条件依赖、语义不足和结构失真问题,实现了高保真、强可控的多样化场景合成。
Details
Motivation: 现有基于扩散模型的驾驶场景合成方法存在几何条件间隐式依赖导致控制失效、文本描述简略导致背景建模弱、均匀空间加权去噪损失忽略前景结构细节等问题。 Method: 提出DrivePTS框架:1)渐进式学习策略结合显式互信息约束缓解几何条件间依赖;2)利用视觉语言模型生成六方面语义的多视角分层文本描述;3)引入频域引导结构损失增强高频结构细节建模能力。 Result: 在多项指标上达到SOTA性能,能成功生成以往方法失败的罕见驾驶场景,展现出更强的泛化能力和结构保真度。 Conclusion: DrivePTS有效提升了驾驶场景生成的语义丰富性、结构清晰度与条件可控性,为自动驾驶系统鲁棒性验证提供了高质量数据增强方案。 Abstract: Synthesis of diverse driving scenes serves as a crucial data augmentation technique for validating the robustness and generalizability of autonomous driving systems. Current methods aggregate high-definition (HD) maps and 3D bounding boxes as geometric conditions in diffusion models for conditional scene generation. However, implicit inter-condition dependency causes generation failures when control conditions change independently. Additionally, these methods suffer from insufficient details in both semantic and structural aspects. Specifically, brief and view-invariant captions restrict semantic contexts, resulting in weak background modeling. Meanwhile, the standard denoising loss with uniform spatial weighting neglects foreground structural details, causing visual distortions and blurriness. To address these challenges, we propose DrivePTS, which incorporates three key innovations. Firstly, our framework adopts a progressive learning strategy to mitigate inter-dependency between geometric conditions, reinforced by an explicit mutual information constraint. Secondly, a Vision-Language Model is utilized to generate multi-view hierarchical descriptions across six semantic aspects, providing fine-grained textual guidance. Thirdly, a frequency-guided structure loss is introduced to strengthen the model's sensitivity to high-frequency elements, improving foreground structural fidelity. Extensive experiments demonstrate that our DrivePTS achieves state-of-the-art fidelity and controllability in generating diverse driving scenes. Notably, DrivePTS successfully generates rare scenes where prior methods fail, highlighting its strong generalization ability.[45] SwiftNDC: Fast Neural Depth Correction for High-Fidelity 3D Reconstruction
Kang Han,Wei Xiang,Lu Yu,Mathew Wyatt,Gaowen Liu,Ramana Rao Kompella
Main category: cs.CV
TL;DR: SwiftNDC 提出一种基于神经深度校正场的快速通用框架,生成跨视角一致的深度图,经反投影与重投影误差滤波获得高质量稠密点云初始化,显著加速并提升3D高斯泼溅(3DGS)在网格重建与新视角合成中的性能。
Details
Motivation: 现有深度引导的3D重建方法存在尺度漂移、多视角不一致及需大量后处理等问题,亟需更鲁棒、高效的几何初始化方案。 Method: 提出神经深度校正场(Neural Depth Correction field)以生成跨视角一致的深度图;通过反投影和鲁棒重投影误差滤波构建稠密、均匀分布的点云;将该点云作为3DGS的几何初始化用于网格重建或新视角合成。 Result: 在五个数据集(含两个网格重建、三个新视角合成)上验证,SwiftNDC显著减少网格重建运行时间,并提升新视角合成渲染质量。 Conclusion: 结合神经深度优化与鲁棒几何初始化可兼顾高保真度与高效性,为3D重建提供新范式。 Abstract: Depth-guided 3D reconstruction has gained popularity as a fast alternative to optimization-heavy approaches, yet existing methods still suffer from scale drift, multi-view inconsistencies, and the need for substantial refinement to achieve high-fidelity geometry. Here, we propose SwiftNDC, a fast and general framework built around a Neural Depth Correction field that produces cross-view consistent depth maps. From these refined depths, we generate a dense point cloud through back-projection and robust reprojection-error filtering, obtaining a clean and uniformly distributed geometric initialization for downstream reconstruction. This reliable dense geometry substantially accelerates 3D Gaussian Splatting (3DGS) for mesh reconstruction, enabling high-quality surfaces with significantly fewer optimization iterations. For novel-view synthesis, SwiftNDC can also improve 3DGS rendering quality, highlighting the benefits of strong geometric initialization. We conduct a comprehensive study across five datasets, including two for mesh reconstruction, as well as three for novel-view synthesis. SwiftNDC consistently reduces running time for accurate mesh reconstruction and boosts rendering fidelity for view synthesis, demonstrating the effectiveness of combining neural depth refinement with robust geometric initialization for high-fidelity and efficient 3D reconstruction.[46] Quality-Aware Robust Multi-View Clustering for Heterogeneous Observation Noise
Peihan Wu,Guanjie Cheng,Yufei Tong,Meng Xi,Shuiguang Deng
Main category: cs.CV
TL;DR: 本文提出了一种质量感知的鲁棒多视图聚类框架QARMVC,通过信息瓶颈机制量化样本级噪声强度,并利用质量加权的对比学习和融合策略提升聚类鲁棒性。
Details
Motivation: 现有去噪方法多基于简单的二元假设(干净/完全污染),忽略了现实中普遍存在的异质性连续噪声(污染强度因样本而异),需更细粒度的噪声建模。 Method: QARMVC采用信息瓶颈机制进行视图重建,以重建差异量化样本级质量得分;进而设计质量加权的特征级对比学习目标抑制噪声传播,并在融合层通过质量加权聚合构建高质量全局共识,再以互信息最大化对齐和校正局部视图。 Result: 在五个基准数据集上的实验表明,QARMVC在异质噪声场景下持续优于现有最先进方法。 Conclusion: 引入样本级质量感知机制可有效建模异质噪声,提升多视图聚类的鲁棒性与性能。 Abstract: Deep multi-view clustering has achieved remarkable progress but remains vulnerable to complex noise in real-world applications. Existing noisy robust methods predominantly rely on a simplified binary assumption, treating data as either perfectly clean or completely corrupted. This overlooks the prevalent existence of heterogeneous observation noise, where contamination intensity varies continuously across data. To bridge this gap, we propose a novel framework termed Quality-Aware Robust Multi-View Clustering (QARMVC). Specifically, QARMVC employs an information bottleneck mechanism to extract intrinsic semantics for view reconstruction. Leveraging the insight that noise disrupts semantic integrity and impedes reconstruction, we utilize the resulting reconstruction discrepancy to precisely quantify fine-grained contamination intensity and derive instance-level quality scores. These scores are integrated into a hierarchical learning strategy: at the feature level, a quality-weighted contrastive objective is designed to adaptively suppress the propagation of noise; at the fusion level, a high-quality global consensus is constructed via quality-weighted aggregation, which is subsequently utilized to align and rectify local views via mutual information maximization. Extensive experiments on five benchmark datasets demonstrate that QARMVC consistently outperforms state-of-the-art baselines, particularly in scenarios with heterogeneous noise intensities.[47] Guidance Matters: Rethinking the Evaluation Pitfall for Text-to-Image Generation
Dian Xie,Shitong Shao,Lichen Bai,Zikai Zhou,Bojun Cheng,Shuo Yang,Jun Wu,Zeke Xie
Main category: cs.CV
TL;DR: 本文揭示了当前扩散模型引导方法评估中的一个关键缺陷:人类偏好模型对高引导尺度存在强烈偏差,导致单纯提高CFG尺度即可提升定量评分,但图像质量却严重受损。为此,作者提出了引导感知评估框架(GA-Eval)以实现公平比较,并设计了一个看似有效实则无效的反例方法TDG,实验证明多数新引导方法在修正评估下并未超越标准CFG。
Details
Motivation: 质疑当前扩散引导方法是否真有实质性进步,指出主流评估方式因人类偏好模型对高CFG尺度的偏差而失真,易将图像质量下降误判为性能提升。 Method: 提出引导感知评估框架(GA-Eval),通过引导尺度校准分离CFG正交与平行效应;构造反事实方法Transcendent Diffusion Guidance(TDG)暴露评估漏洞;在常规与GA-Eval框架下系统评测8种扩散引导方法。 Result: 发现单纯增大CFG尺度即可匹敌多数新引导方法;所有被测方法在GA-Eval下的胜率均显著低于标准CFG;TDG在常规评估中得分高,但在实际生成中失效。 Conclusion: 当前扩散引导方法的所谓‘进步’多源于评估缺陷而非真实提升;应重构评估范式,重视图像质量与语义对齐的平衡,避免被误导性指标驱动。 Abstract: Classifier-free guidance (CFG) has helped diffusion models achieve great conditional generation in various fields. Recently, more diffusion guidance methods have emerged with improved generation quality and human preference. However, can these emerging diffusion guidance methods really achieve solid and significant improvements? In this paper, we rethink recent progress on diffusion guidance. Our work mainly consists of four contributions. First, we reveal a critical evaluation pitfall that common human preference models exhibit a strong bias towards large guidance scales. Simply increasing the CFG scale can easily improve quantitative evaluation scores due to strong semantic alignment, even if image quality is severely damaged (e.g., oversaturation and artifacts). Second, we introduce a novel guidance-aware evaluation (GA-Eval) framework that employs effective guidance scale calibration to enable fair comparison between current guidance methods and CFG by identifying the effects orthogonal and parallel to CFG effects. Third, motivated by the evaluation pitfall, we design Transcendent Diffusion Guidance (TDG) method that can significantly improve human preference scores in the conventional evaluation framework but actually does not work in practice. Fourth, in extensive experiments, we empirically evaluate recent eight diffusion guidance methods within the conventional evaluation framework and the proposed GA-Eval framework. Notably, simply increasing the CFG scales can compete with most studied diffusion guidance methods, while all methods suffer severely from winning rate degradation over standard CFG. Our work would strongly motivate the community to rethink the evaluation paradigm and future directions of this field.[48] GIFSplat: Generative Prior-Guided Iterative Feed-Forward 3D Gaussian Splatting from Sparse Views
Tianyu Chen,Wei Xiang,Kang Han,Yu Lu,Di Wu,Gaowen Liu,Ramana Rao Kompella
Main category: cs.CV
TL;DR: GIFSplat提出了一种纯前馈的迭代细化框架,用于从稀疏无位姿图像中重建3D高斯点阵,通过残差更新和蒸馏扩散先验实现高效高质量重建,兼顾推理速度与泛化能力。
Details
Motivation: 现有前馈式3D重建方法受限于单次预测范式,容量受限、缺乏推理时优化能力,且难以在保持效率的同时引入生成先验,尤其在域外数据和稀疏视角下性能不足。 Method: 提出GIFSplat框架:1)采用少量前馈残差更新迭代优化3D高斯场景;2)将冻结的扩散先验蒸馏为高斯级线索,通过增强的新视角渲染注入,无需反向传播或扩展视角集。 Result: 在DL3DV、RealEstate10K和DTU数据集上显著优于SOTA前馈方法,PSNR最高提升+2.1 dB,保持秒级推理速度,无需相机位姿或测试时梯度优化。 Conclusion: GIFSplat成功弥合了前馈效率与生成先验引导之间的鸿沟,证明迭代式前馈细化可兼顾速度、质量与泛化性,为实时3D重建提供了新范式。 Abstract: Feed-forward 3D reconstruction offers substantial runtime advantages over per-scene optimization, which remains slow at inference and often fragile under sparse views. However, existing feed-forward methods still have potential for further performance gains, especially for out-of-domain data, and struggle to retain second-level inference time once a generative prior is introduced. These limitations stem from the one-shot prediction paradigm in existing feed-forward pipeline: models are strictly bounded by capacity, lack inference-time refinement, and are ill-suited for continuously injecting generative priors. We introduce GIFSplat, a purely feed-forward iterative refinement framework for 3D Gaussian Splatting from sparse unposed views. A small number of forward-only residual updates progressively refine current 3D scene using rendering evidence, achieve favorable balance between efficiency and quality. Furthermore, we distill a frozen diffusion prior into Gaussian-level cues from enhanced novel renderings without gradient backpropagation or ever-increasing view-set expansion, thereby enabling per-scene adaptation with generative prior while preserving feed-forward efficiency. Across DL3DV, RealEstate10K, and DTU, GIFSplat consistently outperforms state-of-the-art feed-forward baselines, improving PSNR by up to +2.1 dB, and it maintains second-scale inference time without requiring camera poses or any test-time gradient optimization.[49] Causal Motion Diffusion Models for Autoregressive Motion Generation
Qing Yu,Akihisa Watanabe,Kent Fujiwara
Main category: cs.CV
TL;DR: 本文提出了一种因果运动扩散模型(CMDM),通过在语义对齐的潜在空间中使用因果扩散Transformer进行自回归运动生成,结合Motion-Language-Aligned Causal VAE(MAC-VAE)和帧级因果采样策略,在保持高质量文本到运动生成的同时显著提升推理速度与时间连贯性。
Details
Motivation: 现有运动扩散模型存在双向生成破坏时间因果性、实时性差,或自回归模型不稳定、误差累积等问题,亟需兼顾语义保真、时间平滑与高效推理的新框架。 Method: 提出CMDM框架:1)构建MAC-VAE将运动序列编码为时间因果潜在表示;2)在其上训练基于因果扩散强制的自回归扩散Transformer;3)引入带因果不确定性的帧级采样策略实现快速推理。 Result: 在HumanML3D和SnapMoGen数据集上,CMDM在语义保真度和时间平滑性上均优于现有扩散与自回归模型,并大幅降低推理延迟,支持交互速率下的流式与长时程运动生成。 Conclusion: CMDM成功统一了扩散建模的高质量生成能力与自回归建模的时间因果性,为实时、可控、高保真的人类运动合成提供了新范式。 Abstract: Recent advances in motion diffusion models have substantially improved the realism of human motion synthesis. However, existing approaches either rely on full-sequence diffusion models with bidirectional generation, which limits temporal causality and real-time applicability, or autoregressive models that suffer from instability and cumulative errors. In this work, we present Causal Motion Diffusion Models (CMDM), a unified framework for autoregressive motion generation based on a causal diffusion transformer that operates in a semantically aligned latent space. CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations. On top of this latent representation, an autoregressive diffusion transformer is trained using causal diffusion forcing to perform temporally ordered denoising across motion frames. To achieve fast inference, we introduce a frame-wise sampling schedule with causal uncertainty, where each subsequent frame is predicted from partially denoised previous frames. The resulting framework supports high-quality text-to-motion generation, streaming synthesis, and long-horizon motion generation at interactive rates. Experiments on HumanML3D and SnapMoGen demonstrate that CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness, while substantially reducing inference latency.[50] Don't let the information slip away
Taozhe Li
Main category: cs.CV
TL;DR: 本文提出Association DETR模型,通过利用背景上下文信息提升目标检测性能,在COCO val2017数据集上达到SOTA。
Details
Motivation: 现有主流目标检测模型(如YOLO系列和RT-DETR系列)过度关注前景物体特征,忽视背景提供的关键上下文信息,而背景信息(如场景先验)对准确定位和识别物体具有重要价值。 Method: 提出Association DETR模型,显式建模前景物体与背景之间的语义关联,融合场景上下文信息以增强检测性能。 Result: 在COCO val2017数据集上取得SOTA结果(高于YOLOv12的55.2 mAP和RT-DETRv2的53.4 mAP)。 Conclusion: 引入背景上下文建模能有效提升检测精度,Association DETR验证了场景关联性在目标检测中的重要性,并为后续研究提供了新思路。 Abstract: Real-time object detection has advanced rapidly in recent years. The YOLO series of detectors is among the most well-known CNN-based object detection models and cannot be overlooked. The latest version, YOLOv26, was recently released, while YOLOv12 achieved state-of-the-art (SOTA) performance with 55.2 mAP on the COCO val2017 dataset. Meanwhile, transformer-based object detection models, also known as DEtection TRansformer (DETR), have demonstrated impressive performance. RT-DETR is an outstanding model that outperformed the YOLO series in both speed and accuracy when it was released. Its successor, RT-DETRv2, achieved 53.4 mAP on the COCO val2017 dataset. However, despite their remarkable performance, all these models let information to slip away. They primarily focus on the features of foreground objects while neglecting the contextual information provided by the background. We believe that background information can significantly aid object detection tasks. For example, cars are more likely to appear on roads rather than in offices, while wild animals are more likely to be found in forests or remote areas rather than on busy streets. To address this gap, we propose an object detection model called Association DETR, which achieves state-of-the-art results compared to other object detection models on the COCO val2017 dataset.[51] BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model
Yuci Han,Charles Toth,John E. Anderson,William J. Shuart,Alper Yilmaz
Main category: cs.CV
TL;DR: BetterScene 提出了一种基于 Stable Video Diffusion(SVD)的新型稀疏图像新视角合成方法,通过在VAE模块中引入时序等变正则化与视觉基础模型对齐表征,并结合3D高斯泼溅渲染特征,显著提升视图一致性与细节质量。
Details
Motivation: 现有扩散模型驱动的新视角合成方法受限于仅微调UNet、冻结其余组件,导致细节不一致和伪影,即使加入几何正则化(如深度或语义)仍难以解决。 Method: 在SVD模型的VAE模块中引入时序等变正则化与视觉基础模型对齐表征;结合前馈式3D高斯泼溅(3DGS)生成特征作为SVD增强器输入,实现连续、无伪影、视角一致的新视图合成。 Result: 在DL3DV-10K数据集上超越当前最优方法,显著提升新视角合成质量,尤其在稀疏、无约束真实场景下抑制伪影并恢复一致细节。 Conclusion: BetterScene验证了优化扩散模型全栈(尤其是VAE)比仅微调UNet更有效,为基于视频扩散模型的高质量新视角合成提供了新范式。 Abstract: We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.[52] LoR-LUT: Learning Compact 3D Lookup Tables via Low-Rank Residuals
Ziqi Zhao,Abhijit Mishra,Shounak Roychowdhury
Main category: cs.CV
TL;DR: 本文提出LoR-LUT,一种统一的低秩3D查找表(LUT)生成方法,通过联合使用低秩残差校正与基础LUT,在保持三线性插值复杂度的同时显著减少参数量,并提升图像感知质量与模型可解释性。
Details
Motivation: 现有3D-LUT方法依赖稠密张量融合,导致参数量大、可解释性差;需在保证高质量图像增强的同时提升紧凑性与用户可控性。 Method: 提出统一低秩框架LoR-LUT,将残差校正建模为低秩张量,与基础LUT协同优化;设计LoR-LUT Viewer交互式可视化工具,支持滑块调节参数以实现直观控制。 Result: 在MIT-Adobe FiveK数据集上训练的LoR-LUT达到专家级调色效果,感知保真度高,模型尺寸小于1MB,且插值计算复杂度与传统方法相当。 Conclusion: LoR-LUT为LUT-based图像增强与风格迁移提供了紧凑、可解释、高效的新型解决方案。 Abstract: We present LoR-LUT, a unified low-rank formulation for compact and interpretable 3D lookup table (LUT) generation. Unlike conventional 3D-LUT-based techniques that rely on fusion of basis LUTs, which are usually dense tensors, our unified approach extends the current framework by jointly using residual corrections, which are in fact low-rank tensors, together with a set of basis LUTs. The approach described here improves the existing perceptual quality of an image, which is primarily due to the technique's novel use of residual corrections. At the same time, we achieve the same level of trilinear interpolation complexity, using a significantly smaller number of network, residual corrections, and LUT parameters. The experimental results obtained from LoR-LUT, which is trained on the MIT-Adobe FiveK dataset, reproduce expert-level retouching characteristics with high perceptual fidelity and a sub-megabyte model size. Furthermore, we introduce an interactive visualization tool, termed LoR-LUT Viewer, which transforms an input image into the LUT-adjusted output image, via a number of slidebars that control different parameters. The tool provides an effective way to enhance interpretability and user confidence in the visual results. Overall, our proposed formulation offers a compact, interpretable, and efficient direction for future LUT-based image enhancement and style transfer.[53] Spectrally Distilled Representations Aligned with Instruction-Augmented LLMs for Satellite Imagery
Minh Kha Do,Wei Xiang,Kang Han,Di Wu,Khoa Phan,Yi-Ping Phoebe Chen,Gaowen Liu,Ramana Rao Kompella
Main category: cs.CV
TL;DR: SATtxt是一种仅需RGB输入的光谱感知视觉-语言基础模型,通过光谱表征蒸馏和光谱引导对齐,在地球观测任务中显著提升零样本分类、检索和线性探测性能。
Details
Motivation: 现有视觉-语言基础模型在卫星图像应用中受限于多光谱输入难以一致利用以及CLIP式文本编码器语义表达能力不足的问题,而实际卫星系统常仅有RGB数据,亟需一种能利用光谱先验但仅需RGB推理的模型。 Method: 提出两阶段框架:1)光谱表征蒸馏,将冻结的多光谱教师模型的光谱先验通过轻量投影器迁移至RGB学生模型;2)结合指令增强的大语言模型进行光谱引导对齐,连接蒸馏后的视觉空间与强表达力的LLM嵌入空间。 Result: 在EuroSAT、BigEarthNet和ForestNet数据集上,SATtxt平均提升零样本分类4.2%、检索5.9%、线性探测2.7%,优于基线模型。 Conclusion: SATtxt为地球观测提供了一条高效、可扩展的光谱感知视觉-语言学习路径,兼顾RGB推理实用性与多光谱知识利用。 Abstract: Vision-language foundation models (VLFMs) promise zero-shot and retrieval understanding for Earth observation. While operational satellite systems often lack full multi-spectral coverage, making RGB-only inference highly desirable for scalable deployment, the adoption of VLFMs for satellite imagery remains hindered by two factors: (1) multi-spectral inputs are informative but difficult to exploit consistently due to band redundancy and misalignment; and (2) CLIP-style text encoders limit semantic expressiveness and weaken fine-grained alignment. We present SATtxt, a spectrum-aware VLFM that operates with RGB inputs only at inference while retaining spectral cues learned during training. Our framework comprises two stages. First, Spectral Representation Distillation transfers spectral priors from a frozen multi-spectral teacher to an RGB student via a lightweight projector. Second, Spectrally Grounded Alignment with Instruction-Augmented LLMs bridges the distilled visual space and an expressive LLM embedding space. Across EuroSAT, BigEarthNet, and ForestNet, SATtxt improves zero-shot classification on average by 4.2%, retrieval by 5.9%, and linear probing by 2.7% over baselines, showing an efficient path toward spectrum-aware vision-language learning for Earth observation. Project page: https://ikhado.github.io/sattxt/[54] Coded-E2LF: Coded Aperture Light Field Imaging from Events
Tomoya Tsuchida,Keita Takahashi,Chihiro Tsutake,Toshiaki Fujii,Hajime Nagahara
Main category: cs.CV
TL;DR: 本文提出了一种名为Coded-E2LF的纯事件驱动计算成像方法,利用编码孔径和静态仅事件相机实现4D光场重建,首次实现了仅靠事件数据以像素级精度重建4D光场。
Details
Motivation: 现有方法需同时采集事件和强度图像,硬件限制多;本文旨在构建一种完全基于事件的轻量、易实现的光场成像方案。 Method: 提出Coded-E2LF方法,采用编码孔径与事件相机结合,明确黑色图案在编码模式中的关键作用,并从理论和实践两方面提升纯事件光场重建性能。 Result: 在真实硬件平台上成功重建了真实3D场景的4D光场,达到像素级精度,为首个仅用事件数据实现该目标的工作。 Conclusion: 纯事件驱动的4D光场重建是可行且有效的,Coded-E2LF显著降低了硬件要求,拓展了事件相机在计算成像中的应用边界。 Abstract: We propose Coded-E2LF (coded event to light field), a computational imaging method for acquiring a 4-D light field using a coded aperture and a stationary event-only camera. In a previous work, an imaging system similar to ours was adopted, but both events and intensity images were captured and used for light field reconstruction. In contrast, our method is purely event-based, which relaxes restrictions for hardware implementation. We also introduce several advancements from the previous work that enable us to theoretically support and practically improve light field reconstruction from events alone. In particular, we clarify the key role of a black pattern in aperture coding patterns. We finally implemented our method on real imaging hardware to demonstrate its effectiveness in capturing real 3-D scenes. To the best of our knowledge, we are the first to demonstrate that a 4-D light field with pixel-level accuracy can be reconstructed from events alone. Our software and supplementary video are available from our project website.[55] CGSA: Class-Guided Slot-Aware Adaptation for Source-Free Object Detection
Boyang Dai,Zeng Fan,Zihao Qi,Meng Lou,Yizhou Yu
Main category: cs.CV
TL;DR: 本文提出CGSA框架,首次将面向对象学习(OCL)引入无源域自适应目标检测(SF-DAOD),通过分层槽感知(HSA)与类别引导槽对比(CGSC)模块,在DETR基础上实现对象级结构建模与语义一致的域不变适配,显著提升性能。
Details
Motivation: 现有SF-DAOD方法忽视跨域数据中的对象级结构线索,且依赖伪标签阈值调优或师生框架改进,难以兼顾隐私保护与细粒度适应。 Method: 提出CGSA框架:在DETR检测器中集成分层槽感知(HSA)模块以解耦图像为槽表示,并引入类别引导槽对比(CGSC)模块对齐槽表示与类别语义,实现无源、对象中心的域自适应。 Result: 在多个跨域数据集上显著超越现有SF-DAOD方法;理论推导与消融实验验证了HSA和CGSC模块的有效性及整体框架优势。 Conclusion: 面向对象的学习范式可有效提升无源域自适应目标检测的性能与鲁棒性,尤其适用于隐私敏感场景。 Abstract: Source-Free Domain Adaptive Object Detection (SF-DAOD) aims to adapt a detector trained on a labeled source domain to an unlabeled target domain without retaining any source data. Despite recent progress, most popular approaches focus on tuning pseudo-label thresholds or refining the teacher-student framework, while overlooking object-level structural cues within cross-domain data. In this work, we present CGSA, the first framework that brings Object-Centric Learning (OCL) into SF-DAOD by integrating slot-aware adaptation into the DETR-based detector. Specifically, our approach integrates a Hierarchical Slot Awareness (HSA) module into the detector to progressively disentangle images into slot representations that act as visual priors. These slots are then guided toward class semantics via a Class-Guided Slot Contrast (CGSC) module, maintaining semantic consistency and prompting domain-invariant adaptation. Extensive experiments on multiple cross-domain datasets demonstrate that our approach outperforms previous SF-DAOD methods, with theoretical derivations and experimental analysis further demonstrating the effectiveness of the proposed components and the framework, thereby indicating the promise of object-centric design in privacy-sensitive adaptation scenarios. Code is released at https://github.com/Michael-McQueen/CGSA.[56] Instruction-based Image Editing with Planning, Reasoning, and Generation
Liya Ji,Chenyang Qi,Qifeng Chen
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态链式思维(CoT)提示的新方法,用于指令驱动的图像编辑,通过将任务分解为CoT规划、编辑区域推理和编辑三个阶段,提升了复杂真实图像的编辑质量。
Details
Motivation: 现有方法依赖单一模态的理解模型,限制了编辑质量;需要更强大的多模态理解与生成能力来应对复杂场景。 Method: 提出多模态链式思维提示框架,包括:1)大语言模型进行CoT规划以生成适配编辑网络的子提示;2)基于多模态大语言模型训练指令驱动的编辑区域生成网络;3)设计提示引导的指令编辑网络,结合大规模文本到图像扩散模型进行编辑。 Result: 在复杂真实图像上展现出具有竞争力的编辑能力。 Conclusion: 该多模态协同框架有效桥接了理解与生成,显著提升了指令驱动图像编辑的鲁棒性与适用性。 Abstract: Editing images via instruction provides a natural way to generate interactive content, but it is a big challenge due to the higher requirement of scene understanding and generation. Prior work utilizes a chain of large language models, object segmentation models, and editing models for this task. However, the understanding models provide only a single modality ability, restricting the editing quality. We aim to bridge understanding and generation via a new multi-modality model that provides the intelligent abilities to instruction-based image editing models for more complex cases. To achieve this goal, we individually separate the instruction editing task with the multi-modality chain of thought prompts, i.e., Chain-of-Thought (CoT) planning, editing region reasoning, and editing. For Chain-of-Thought planning, the large language model could reason the appropriate sub-prompts considering the instruction provided and the ability of the editing network. For editing region reasoning, we train an instruction-based editing region generation network with a multi-modal large language model. Finally, a hint-guided instruction-based editing network is proposed for editing image generations based on the sizeable text-to-image diffusion model to accept the hints for generation. Extensive experiments demonstrate that our method has competitive editing abilities on complex real-world images.[57] Scaling Audio-Visual Quality Assessment Dataset via Crowdsourcing
Renyu Yang,Jian Jin,Lili Meng,Meiqin Liu,Yilin Wang,Balu Adsumilli,Weisi Lin
Main category: cs.CV
TL;DR: 本文提出了一种实用的音视频质量评估(AVQA)数据集构建方法,通过众包主观实验、系统化数据准备策略和多维标注,构建了目前最大最多样化的AVQA数据集YT-NTU-AVQ。
Details
Motivation: 现有AVQA数据集规模小、内容与质量多样性不足、仅提供总体评分,难以支撑模型开发和多模态感知研究。 Method: 1)设计适用于众包环境的主观实验框架;2)采用系统化数据准备策略以覆盖广泛的质量水平和语义场景;3)增加额外标注以支持多模态感知机制研究。 Result: 构建了包含1620条用户生成音视频序列的YT-NTU-AVQ数据集,是当前规模最大、多样性最强的AVQA数据集,并开源了数据集与平台代码。 Conclusion: 所提出的构建方法有效克服了现有AVQA数据集的局限性,为多模态质量评估与感知机制研究提供了坚实基础。 Abstract: Audio-visual quality assessment (AVQA) research has been stalled by limitations of existing datasets: they are typically small in scale, with insufficient diversity in content and quality, and annotated only with overall scores. These shortcomings provide limited support for model development and multimodal perception research. We propose a practical approach for AVQA dataset construction. First, we design a crowdsourced subjective experiment framework for AVQA, breaks the constraints of in-lab settings and achieves reliable annotation across varied environments. Second, a systematic data preparation strategy is further employed to ensure broad coverage of both quality levels and semantic scenarios. Third, we extend the dataset with additional annotations, enabling research on multimodal perception mechanisms and their relation to content. Finally, we validate this approach through YT-NTU-AVQ, the largest and most diverse AVQA dataset to date, consisting of 1,620 user-generated audio and video (A/V) sequences. The dataset and platform code are available at https://github.com/renyu12/YT-NTU-AVQ[58] DyaDiT: A Multi-Modal Diffusion Transformer for Socially Favorable Dyadic Gesture Generation
Yichen Peng,Jyun-Ting Song,Siyeol Jung,Ruofan Liu,Haiyang Liu,Xuangeng Chu,Ruicong Liu,Erwin Wu,Hideki Koike,Kris Kitani
Main category: cs.CV
TL;DR: 本文提出DyaDiT,一种基于多模态扩散Transformer的模型,用于根据双人对话音频生成符合社交语境的人类动作,显著提升数字人交互的自然性与社会吸引力。
Details
Motivation: 现有方法仅将单音频映射为单人动作,忽略社交语境与对话双方的动态互动,难以生成自然、 socially engaging 的手势。 Method: DyaDiT是基于多模态扩散Transformer的模型,输入为双人音频(可选加社交上下文标记),融合双方音频信息,结合运动词典编码先验,并可利用对话伙伴手势实现响应式动作生成;训练数据为Seamless Interaction Dataset。 Result: 在标准动作生成指标和用户研究中均优于现有方法,用户主观偏好度高,验证了其鲁棒性与社会适宜性。 Conclusion: DyaDiT有效建模对话中的双向动态与社交语境,为数字人生成更自然、更具社会智能的肢体动作提供了新范式。 Abstract: Generating realistic conversational gestures are essential for achieving natural, socially engaging interactions with digital humans. However, existing methods typically map a single audio stream to a single speaker's motion, without considering social context or modeling the mutual dynamics between two people engaging in conversation. We present DyaDiT, a multi-modal diffusion transformer that generates contextually appropriate human motion from dyadic audio signals. Trained on Seamless Interaction Dataset, DyaDiT takes dyadic audio with optional social-context tokens to produce context-appropriate motion. It fuses information from both speakers to capture interaction dynamics, uses a motion dictionary to encode motion priors, and can optionally utilize the conversational partner's gestures to produce more responsive motion. We evaluate DyaDiT on standard motion generation metrics and conduct quantitative user studies, demonstrating that it not only surpasses existing methods on objective metrics but is also strongly preferred by users, highlighting its robustness and socially favorable motion generation. Code and models will be released upon acceptance.[59] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Wenjia Wang,Liang Pan,Huaijin Pi,Yuke Lou,Xuqian Ren,Yifan Wu,Zhouyingcheng Liao,Lei Yang,Rishabh Dabral,Christian Theobalt,Taku Komura
Main category: cs.CV
TL;DR: 本文提出EmbodMocap,一种基于双iPhone的便携式、低成本人体运动与场景联合采集方案,实现野外环境下无标记、无固定相机的度量尺度一致的人-场景三维重建,并支撑单目人-场景重建、物理驱动动画和机器人运动控制等具身AI任务。
Details
Motivation: 现有动作捕捉系统依赖昂贵工作室或可穿戴设备,难以大规模采集真实世界中带场景上下文的人体运动数据,限制了具身智能的发展。 Method: 提出EmbodMocap流水线:利用两部移动iPhone同步采集RGB-D视频,通过联合标定将双视角序列统一到同一度量世界坐标系下,实现人体与场景的协同重建。 Result: 相比光学动捕真值,双视角设置显著缓解深度歧义,在对齐与重建精度上优于单iPhone或单目方法;所采集数据成功支撑三项具身AI任务:单目人-场景重建(输出世界坐标系对齐的度量模型)、物理驱动角色动画(提升人-物交互与场景感知运动跟踪)、以及基于sim-to-real强化学习的机器人动作复现。 Conclusion: EmbodMocap为具身AI提供了高质量、易扩展、真实场景驱动的运动-场景联合数据源,推动了从感知、理解到行动的闭环研究。 Abstract: Human behaviors in the real world naturally encode rich, long-term contextual information that can be leveraged to train embodied agents for perception, understanding, and acting. However, existing capture systems typically rely on costly studio setups and wearable devices, limiting the large-scale collection of scene-conditioned human motion data in the wild. To address this, we propose EmbodMocap, a portable and affordable data collection pipeline using two moving iPhones. Our key idea is to jointly calibrate dual RGB-D sequences to reconstruct both humans and scenes within a unified metric world coordinate frame. The proposed method allows metric-scale and scene-consistent capture in everyday environments without static cameras or markers, bridging human motion and scene geometry seamlessly. Compared with optical capture ground truth, we demonstrate that the dual-view setting exhibits a remarkable ability to mitigate depth ambiguity, achieving superior alignment and reconstruction performance over single iphone or monocular models. Based on the collected data, we empower three embodied AI tasks: monocular human-scene-reconstruction, where we fine-tune on feedforward models that output metric-scale, world-space aligned humans and scenes; physics-based character animation, where we prove our data could be used to scale human-object interaction skills and scene-aware motion tracking; and robot motion control, where we train a humanoid robot via sim-to-real RL to replicate human motions depicted in videos. Experimental results validate the effectiveness of our pipeline and its contributions towards advancing embodied AI research.[60] Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction
Chenhe Du,Xuanyu Tian,Qing Wu,Muyu Liu,Jingyi Yu,Hongjiang Wei,Yuyao Zhang
Main category: cs.CV
TL;DR: 本文提出Dual-Coupled PnP Diffusion框架,通过引入对偶变量实现渐近收敛到精确数据流形,并设计Spectral Homogenization机制将结构化残差转化为符合扩散先验假设的伪白噪声,从而解决成像反问题中偏差与幻觉的权衡问题。
Details
Motivation: 现有即插即用(PnP)求解器(如基于HQS或近端梯度法)是无记忆算子,仅依赖瞬时梯度更新,导致在强退化下重建结果存在稳态偏差,无法严格满足物理测量约束。 Method: 提出Dual-Coupled PnP Diffusion:恢复经典对偶变量以提供积分反馈;并引入Spectral Homogenization(SH),在频域调节累积对偶残差,使其统计特性逼近加性白高斯噪声(AWGN),适配扩散先验的统计假设。 Result: 在CT和MRI重建任务上显著提升重建保真度,同时加快收敛速度,解决了偏差与幻觉之间的权衡问题,达到当前最优性能。 Conclusion: 通过几何耦合与频域统计校准的协同设计,本方法在理论保障收敛性的同时,确保了扩散先验的有效性,为基于预训练生成模型的成像反问题求解提供了新范式。 Abstract: Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion, which restores the classical dual variable to provide integral feedback, theoretically guaranteeing asymptotic convergence to the exact data manifold. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence.[61] Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks
Alaa El Ichi,Khalide Jbilou
Main category: cs.CV
TL;DR: 本文提出多维任务学习(MTL)框架,基于广义爱因斯坦MLP(GE-MLPs),直接在张量上操作,突破传统矩阵建模对视觉任务的维度限制;证明分类、分割、检测等任务是MTL在不同维度配置下的特例,并揭示其任务空间严格大于矩阵方法所能表达的范围。
Details
Motivation: 现有计算机视觉任务建模受限于矩阵思维(如矩阵权重、向量偏置),需结构化展平,导致信息损失和任务表达能力受限。 Method: 构建基于广义爱因斯坦积的张量神经网络(GE-MLPs),以张量参数替代传统矩阵/向量参数,通过严谨数学推导定义统一的任务空间,并形式化刻画各类视觉任务的维度配置。 Result: 证明分类、分割、检测均为MTL在特定维度配置下的特例;证明MTL任务空间严格大于矩阵方法可原生表达的空间;支持无需破坏性展平的时空或跨模态预测等新任务构型。 Conclusion: MTL为计算机视觉任务的理解、比较与设计提供了基于张量代数的统一数学基础,推动从‘矩阵中心’向‘张量原生’范式转变。 Abstract: This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.[62] Risk-Aware World Model Predictive Control for Generalizable End-to-End Autonomous Driving
Jiangxin Sun,Feng Xue,Teng Long,Chang Liu,Jian-Fang Hu,Wei-Shi Zheng,Nicu Sebe
Main category: cs.CV
TL;DR: 本文提出了一种无需专家动作监督的端到端自动驾驶框架RaWMPC,通过风险感知的世界模型与预测控制,提升在分布内与分布外场景下的泛化性与安全性。
Details
Motivation: 现有基于模仿学习的端到端自动驾驶方法依赖专家示范,在罕见或长尾场景中泛化能力差、决策不安全;本文旨在探索无需专家动作监督的可靠决策机制。 Method: 提出Risk-aware World Model Predictive Control(RaWMPC)框架:1)构建风险感知世界模型,通过主动暴露于危险行为来学习预测高风险后果;2)引入自评估蒸馏方法,将世界模型的风险规避能力蒸馏至动作提议网络,实现无监督的低风险动作生成。 Result: 在分布内和分布外场景下均超越当前最优方法,同时提升决策可解释性。 Conclusion: RaWMPC验证了脱离专家动作监督、仅依靠风险建模与预测控制即可实现更鲁棒、更安全的端到端自动驾驶,为长尾场景应对提供了新范式。 Abstract: With advances in imitation learning (IL) and large-scale driving datasets, end-to-end autonomous driving (E2E-AD) has made great progress recently. Currently, IL-based methods have become a mainstream paradigm: models rely on standard driving behaviors given by experts, and learn to minimize the discrepancy between their actions and expert actions. However, this objective of "only driving like the expert" suffers from limited generalization: when encountering rare or unseen long-tail scenarios outside the distribution of expert demonstrations, models tend to produce unsafe decisions in the absence of prior experience. This raises a fundamental question: Can an E2E-AD system make reliable decisions without any expert action supervision? Motivated by this, we propose a unified framework named Risk-aware World Model Predictive Control (RaWMPC) to address this generalization dilemma through robust control, without reliance on expert demonstrations. Practically, RaWMPC leverages a world model to predict the consequences of multiple candidate actions and selects low-risk actions through explicit risk evaluation. To endow the world model with the ability to predict the outcomes of risky driving behaviors, we design a risk-aware interaction strategy that systematically exposes the world model to hazardous behaviors, making catastrophic outcomes predictable and thus avoidable. Furthermore, to generate low-risk candidate actions at test time, we introduce a self-evaluation distillation method to distill riskavoidance capabilities from the well-trained world model into a generative action proposal network without any expert demonstration. Extensive experiments show that RaWMPC outperforms state-of-the-art methods in both in-distribution and out-of-distribution scenarios, while providing superior decision interpretability.[63] Decomposing Private Image Generation via Coarse-to-Fine Wavelet Modeling
Jasmine Bayrooti,Weiwei Kong,Natalia Ponomareva,Carlos Esteves,Ameesh Makadia,Amanda Prorok
Main category: cs.CV
TL;DR: 本文提出了一种基于小波频谱的差分隐私(DP)图像生成框架,通过在低频成分上施加DP保护、高频细节由公开超分模型恢复,从而在隐私与图像质量间取得更好平衡。
Details
Motivation: 标准DP微调(如DP-SGD) indiscriminately 加噪导致图像高频纹理严重退化;而作者假设图像中真正敏感的是小波域的低频结构(如人脸轮廓),高频成分更通用、隐私风险低。 Method: 两阶段谱域DP框架:(1) 对敏感图像的小波低频系数,DP微调一个自回归谱图像tokenizer;(2) 利用公开预训练超分模型对粗粒度中间表示进行高分辨率上采样。利用DP的后处理不变性保障整体隐私。 Result: 在MS-COCO和MM-CelebA-HQ上,相比其他主流DP图像生成方法,所提方法生成图像质量更高、风格保持更好。 Conclusion: 将DP约束聚焦于图像低频结构、解耦隐私保护与细节生成,是一种有效提升DP图像生成效用的新范式。 Abstract: Generative models trained on sensitive image datasets risk memorizing and reproducing individual training examples, making strong privacy guarantees essential. While differential privacy (DP) provides a principled framework for such guarantees, standard DP finetuning (e.g., with DP-SGD) often results in severe degradation of image quality, particularly in high-frequency textures, due to the indiscriminate addition of noise across all model parameters. In this work, we propose a spectral DP framework based on the hypothesis that the most privacy-sensitive portions of an image are often low-frequency components in the wavelet space (e.g., facial features and object shapes) while high-frequency components are largely generic and public. Based on this hypothesis, we propose the following two-stage framework for DP image generation with coarse image intermediaries: (1) DP finetune an autoregressive spectral image tokenizer model on the low-resolution wavelet coefficients of the sensitive images, and (2) perform high-resolution upsampling using a publicly pretrained super-resolution model. By restricting the privacy budget to the global structures of the image in the first stage, and leveraging the post-processing property of DP for detail refinement, we achieve promising trade-offs between privacy and utility. Experiments on the MS-COCO and MM-CelebA-HQ datasets show that our method generates images with improved quality and style capture relative to other leading DP image frameworks.[64] Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
Tilemachos Aravanis,Vladan Stojnić,Bill Psomas,Nikos Komodakis,Giorgos Tolias
Main category: cs.CV
TL;DR: 本文提出了一种基于检索增强的测试时适配器方法,通过结合文本提示与少量像素级标注图像(支持集)来提升开放词汇分割(OVS)性能,解决了VLM图像级监督粗糙和自然语言语义模糊两大挑战。
Details
Motivation: 开放词汇分割(OVS)受限于视觉语言模型(VLM)仅使用图像级监督训练以及自然语言固有的语义歧义性,导致其性能落后于全监督方法。 Method: 引入少样本设定,将文本提示与像素级标注的支持图像结合;设计一种检索增强的测试时适配器,在测试阶段为每张图像学习轻量级分类器,实现文本与视觉支持特征的可学习、按查询融合(而非手工后融合)。 Result: 实验表明该方法显著缩小了零样本与全监督分割之间的性能差距,同时保持开放词汇能力,并支持支持集持续扩展及个性化等细粒度任务。 Conclusion: 所提方法通过测试时动态、模态自适应的特征融合,有效缓解了VLM在像素级理解中的监督不足与语言歧义问题,推动OVS向实用化迈进。 Abstract: Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by text prompts. Despite recent progress, OVS lags behind fully supervised approaches due to two challenges: the coarse image-level supervision used to train VLMs and the semantic ambiguity of natural language. We address these limitations by introducing a few-shot setting that augments textual prompts with a support set of pixel-annotated images. Building on this, we propose a retrieval-augmented test-time adapter that learns a lightweight, per-image classifier by fusing textual and visual support features. Unlike prior methods relying on late, hand-crafted fusion, our approach performs learned, per-query fusion, achieving stronger synergy between modalities. The method supports continually expanding support sets, and applies to fine-grained tasks such as personalized segmentation. Experiments show that we significantly narrow the gap between zero-shot and supervised segmentation while preserving open-vocabulary ability.[65] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale
Sven Elflein,Ruilong Li,Sérgio Agostinho,Zan Gojcic,Laura Leal-Taixé,Qunjie Zhou,Aljosa Osep
Main category: cs.CV
TL;DR: 本文提出VGG-T³模型,通过测试时训练将可变长度的KV空间蒸馏为固定大小MLP,实现线性计算复杂度的3D重建,显著提升速度并保持高精度。
Details
Motivation: 解决离线前馈式3D重建方法中计算与内存开销随输入图像数量呈平方增长的关键瓶颈。 Method: 利用测试时训练(test-time training)将场景几何的可变长Key-Value(KV)表示蒸馏为固定尺寸MLP,构建名为VGG-T³的可扩展3D重建模型。 Result: 在1k图像集上仅需54秒完成重建,比基于softmax注意力的基线快11.6倍;点云重建误差显著优于其他线性时间方法,并支持对未见图像的视觉定位。 Conclusion: VGG-T³实现了计算效率(线性复杂度)与重建质量(全局聚合能力)的兼顾,为大规模场景重建提供了新范式。 Abstract: We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T$^3$ (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a $1k$ image collection in just $54$ seconds, achieving a $11.6\times$ speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.[66] MediX-R1: Open Ended Medical Reinforcement Learning
Sahal Shaji Mullappilly,Mohammed Irfan Kurpath,Omair Mohamed,Mohamed Zidan,Fahad Khan,Salman Khan,Rao Anwer,Hisham Cholakkal
Main category: cs.CV
TL;DR: MediX-R1 是一个面向医疗多模态大语言模型的开放式强化学习框架,通过多信号奖励机制和基于参考的LLM评判评估方法,显著提升了模型在开放性临床任务中的推理能力。