Table of Contents
cs.CL [Back]
[1] EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research
Houping Yue,Zixiang Di,Mei Jiang,Bingdong Li,Hao Hao,Yu Song,Bo Jiang,Aimin Zhou
Main category: cs.CL
TL;DR: 本文提出EduResearchBench,首个面向教育学术写作的细粒度评估平台,基于分层原子任务分解(HATD)框架,涵盖6大研究模块、24个原子任务,并设计课程学习策略训练专用模型EduWrite(30B),实验证明其在教育学术写作上优于更大参数量通用模型。
Details
Motivation: 现有LLM评估基准多为单次整体生成,缺乏对复杂学术研究流程的细粒度评估能力,难以识别具体能力瓶颈。 Method: 提出分层原子任务分解(HATD)框架,构建含6模块24原子任务的EduResearchBench评估平台;设计课程学习策略;基于55K学术样本构建11K高质量指令对,训练专用模型EduWrite(30B)。 Result: EduWrite(30B)在多项核心指标上显著超越72B通用大模型,验证了垂直领域中数据质量密度与分阶段课程训练比参数规模更关键。 Conclusion: 细粒度任务分解与课程学习是提升LLM在教育学术写作等专业领域能力的有效路径,EduResearchBench为AI4SS提供了可诊断、可扩展的评估新范式。 Abstract: While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.[2] Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal,Deeksha Varshney,Mamta,Asif Ekbal
Main category: cs.CL
TL;DR: 本文提出Indic-TunedLens,一种专为印度语言设计的新型可解释性框架,通过学习共享仿射变换来对齐各语言的隐藏状态与目标输出分布,显著提升多语言大模型在印度语系(尤其是形态丰富、低资源语言)上的可解释性性能。
Details
Motivation: 多语言大语言模型在印度等语言多样性地区广泛应用,但现有可解释性工具主要面向英语;且LLM常在英语中心化表征空间中运行,导致跨语言可解释性困难。 Method: 提出Indic-TunedLens框架,区别于标准Logit Lens直接解码中间激活,该方法为每种目标语言调整隐藏状态,使其对齐目标输出分布,从而实现更保真的表征解码,并学习语言间共享的仿射变换。 Result: 在涵盖10种印度语言的MMLU基准上评估,Indic-TunedLens显著优于当前最优可解释性方法,尤其在形态复杂、低资源语言上提升明显;揭示了多语言Transformer的逐层语义编码特性。 Conclusion: Indic-TunedLens为多语言LLM(特别是印度语言)提供了更准确、更具语言适应性的可解释性工具,推动了跨语言表征理解与低资源语言AI透明度的发展。 Abstract: Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/AnonymousAccountACL/IndicTunedLens. Our code is available at https://github.com/AnonymousAccountACL/IndicTunedLens.[3] CGRA-DeBERTa Concept Guided Residual Augmentation Transformer for Theologically Islamic Understanding
Tahir Hussain,Saddam Hussain Khan
Main category: cs.CL
TL;DR: 本文提出了一种名为CGRA DeBERTa的新型概念引导残差域增强Transformer框架,用于提升古典伊斯兰文本(尤其是圣训)上的问答准确率。该模型通过结合定制化的DeBERTa主干、轻量级LoRA适配与概念感知残差门控机制,在保持计算效率的同时显著提升了语义理解与 theological precision。
Details
Motivation: 古典伊斯兰文本问答面临领域特定语义、长程上下文依赖和概念敏感推理等挑战,现有模型难以兼顾准确性与神学精确性。 Method: 构建CGRA DeBERTa模型:基于定制DeBERTa主干,引入LoRA轻量适配;设计概念引导残差块,融合包含12个核心术语的伊斯兰概念词典先验;采用重要性加权注意力的概念门控机制,对关键token进行1.04–3.00倍差异化缩放。 Result: 在Sahih al-Bukhari与Sahih Muslim构建的42591对QA数据集上,EM达97.85,显著优于BERT(75.87)和原始DeBERTa(89.77),仅增加约8%推理开销;定性评估显示其在抽取、判别与神学精度方面更优。 Conclusion: CGRA DeBERTa实现了高效、可解释且高精度的圣训问答系统,具备教育应用潜力,并能提供必要的神学细微差别。 Abstract: Accurate QA over classical Islamic texts remains challenging due to domain specific semantics, long context dependencies, and concept sensitive reasoning. Therefore, a new CGRA DeBERTa, a concept guided residual domain augmentation transformer framework, is proposed that enhances theological QA over Hadith corpora. The CGRA DeBERTa builds on a customized DeBERTa transformer backbone with lightweight LoRA based adaptations and a residual concept aware gating mechanism. The customized DeBERTa embedding block learns global and positional context, while Concept Guided Residual Blocks incorporate theological priors from a curated Islamic Concept Dictionary of 12 core terms. Moreover, the Concept Gating Mechanism selectively amplifies semantically critical tokens via importance weighted attention, applying differential scaling from 1.04 to 3.00. This design preserves contextual integrity, strengthens domain-specific semantic representations, and enables accurate, efficient span extraction while maintaining computational efficiency. This paper reports the results of training CGRA using a specially constructed dataset of 42591 QA pairs from the text of Sahih alBukhari and Sahih Muslim. While BERT achieved an EM score of 75.87 and DeBERTa one of 89.77, our model scored 97.85 and thus surpassed them by 8.08 on an absolute scale, all while adding approximately 8 inference overhead due to parameter efficient gating. The qualitative evaluation noted better extraction and discrimination and theological precision. This study presents Hadith QA systems that are efficient, interpretable, and accurate and that scale provide educational materials with necessary theological nuance.[4] AIC CTU@AVerImaTeC: dual-retriever RAG for image-text fact checking
Herbert Ullrich,Jan Drchal
Main category: cs.CL
TL;DR: 本文介绍了作者在AVerImaTeC共享任务中获得第三名的系统,该系统结合了去年的检索增强生成(RAG)流程与反向图像搜索(RIS)模块,仅需一次多模态大语言模型调用即可完成事实核查,成本低廉且易于复现和调整。
Details
Motivation: 为提供一种低成本、易复现、易调整的事实核查系统,并作为进一步实验的入门起点。 Method: 将文本检索(基于相似性搜索)、图像检索(基于API调用的反向图像搜索)和生成(使用GPT5.1)三个解耦模块组合成RAG+RIS系统。 Result: 在AVerImaTeC共享任务中获第三名,单次事实核查平均成本仅0.013美元,性能具有竞争力。 Conclusion: 该系统简洁高效、成本低、可复现性强,适合作为多模态事实核查研究的基准或起点;作者开源了代码、提示词、向量库及成本分析与改进方向。 Abstract: In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year's retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module. Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just $0.013 on average using GPT5.1 via OpenAI Batch API. Our system is also easy to reproduce and tweak, consisting of only three decoupled modules - a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 - which is why we suggest it as an accesible starting point for further experimentation. We publish its code and prompts, as well as our vector stores and insights into the scheme's running costs and directions for further improvement.[5] OpaqueToolsBench: Learning Nuances of Tool Behavior Through Interaction
Skyler Hallinan,Thejas Venkatesh,Xiang Ren,Sai Praneeth Karimireddy,Ashwin Paranjape,Yuhao Zhang,Jack Hessel
Main category: cs.CL
TL;DR: 本文提出了一种名为ToolObserver的框架,通过观察工具调用轨迹的执行反馈来迭代优化工具文档,显著提升了大语言模型在不透明工具环境下的性能,并降低了测试时的token消耗。
Details
Motivation: 现有工具调用基准假设工具文档完备清晰,但现实中许多工具(如通用搜索API)往往不透明、缺乏明确的最佳实践和失败模式;因此需研究LLM如何通过交互学习并改进对不透明工具的理解与使用。 Method: 构建了OpaqueToolsBench基准(含函数调用、国际象棋交互、长轨迹搜索三类任务),提出ToolObserver框架:基于工具调用轨迹的执行反馈,迭代式地精化工具文档。 Result: ToolObserver在OpaqueToolsBench各项任务上均优于现有自动文档方法,尤其在困难设定下表现突出;在测试时探索场景中,token消耗仅为最优基线的1/3.5–1/7.5。 Conclusion: 通过执行反馈驱动的文档迭代优化是提升LLM在不透明工具环境中性能的有效且高效途径,ToolObserver为实际部署中的工具学习提供了新范式。 Abstract: Tool-calling is essential for Large Language Model (LLM) agents to complete real-world tasks. While most existing benchmarks assume simple, perfectly documented tools, real-world tools (e.g., general "search" APIs) are often opaque, lacking clear best practices or failure modes. Can LLM agents improve their performance in environments with opaque tools by interacting and subsequently improving documentation? To study this, we create OpaqueToolsBench, a benchmark consisting of three distinct task-oriented environments: general function calling, interactive chess playing, and long-trajectory agentic search. Each environment provides underspecified tools that models must learn to use effectively to complete the task. Results on OpaqueToolsBench suggest existing methods for automatically documenting tools are expensive and unreliable when tools are opaque. To address this, we propose a simple framework, ToolObserver, that iteratively refines tool documentation by observing execution feedback from tool-calling trajectories. Our approach outperforms existing methods on OpaqueToolsBench across datasets, even in relatively hard settings. Furthermore, for test-time tool exploration settings, our method is also efficient, consuming 3.5-7.5x fewer total tokens than the best baseline.[6] Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig,Peter J. Danaher,Xiaohao Yang,Yu-Ting Lin,Ehsan Abedin,Dhruv Grewal,Lan Du
Main category: cs.CL
TL;DR: 本文提出了一种名为Linguistic eXtractor(LX)的微调大语言模型,专用于从消费者生成文本中精准识别16种消费相关情绪及4种评价构念;LX在多项任务上显著优于GPT-4 Turbo等主流模型,并已提供免费无代码Web应用,推动营销研究中消费者感知的自动化、规模化测量。
Details
Motivation: 准确从非结构化文本中测量消费者情绪和评价是营销研究与实践的核心挑战,现有方法在细粒度情绪识别与营销构念对齐方面存在不足。 Method: 构建并微调大型语言模型LX,训练数据为消费者自述文本及其对应16种情绪与4种评价(信任、承诺、推荐、情感)的自我报告标签;采用宏观F1与准确率评估性能,并通过看似不相关的回归(SUR)分析情绪对产品评分与购买行为的影响路径。 Result: LX在开放问卷响应上达81% macro-F1,在Amazon/Yelp第三方标注评论上准确率超95%;实证表明多数情绪通过产品评分间接影响购买,但 discontent 和 peacefulness 等情绪具直接效应;配套免费无代码Web工具已上线。 Conclusion: LX为消费者感知测量建立了新方法论基础,验证了大语言模型在精准提取营销构念上的有效性,显著提升了营销研究中基于文本的洞察力与可扩展性。 Abstract: Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice. This study introduces the Linguistic eXtractor (LX), a fine-tuned, large language model trained on consumer-authored text that also has been labeled with consumers' self-reported ratings of 16 consumption-related emotions and four evaluation constructs: trust, commitment, recommendation, and sentiment. LX consistently outperforms leading models, including GPT-4 Turbo, RoBERTa, and DeepSeek, achieving 81% macro-F1 accuracy on open-ended survey responses and greater than 95% accuracy on third-party-annotated Amazon and Yelp reviews. An application of LX to online retail data, using seemingly unrelated regression, affirms that review-expressed emotions predict product ratings, which in turn predict purchase behavior. Most emotional effects are mediated by product ratings, though some emotions, such as discontent and peacefulness, influence purchase directly, indicating that emotional tone provides meaningful signals beyond star ratings. To support its use, a no-code, cost-free, LX web application is available, enabling scalable analyses of consumer-authored text. In establishing a new methodological foundation for consumer perception measurement, this research demonstrates new methods for leveraging large language models to advance marketing research and practice, thereby achieving validated detection of marketing constructs from consumer data.[7] Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory
Zihao Tang,Xin Yu,Ziyu Xiao,Zengxuan Wen,Zelin Li,Jiaxi Zhou,Hualei Wang,Haohua Wang,Haizhen Huang,Weiwei Deng,Feng Sun,Qi Zhang
Main category: cs.CL
TL;DR: 本文提出Mnemis框架,结合System-1(相似性检索)与System-2(全局选择、层级遍历)两种记忆检索机制,提升LLMs长时记忆的语义与结构相关性,在LoCoMo和LongMemEval-S上达到SOTA性能。
Details
Motivation: 现有基于相似性的记忆检索方法(如RAG、Graph-RAG)在需全局推理或全面覆盖相关信息的场景中表现不足。 Method: 提出Mnemis框架:构建基础图用于相似性检索(System-1),并构建层级图支持自顶向下的语义层级遍历(Global Selection,System-2),二者协同实现更全面的记忆检索。 Result: 在LoCoMo和LongMemEval-S两个长时记忆基准上取得SOTA结果,分别达93.9和91.6(使用GPT-4.1-mini)。 Conclusion: 融合System-1与System-2双路径的记忆架构能显著提升LLMs对历史信息的组织与检索能力,尤其适用于需全局理解与结构化覆盖的任务。 Abstract: AI Memory, specifically how models organizes and retrieves historical messages, becomes increasingly valuable to Large Language Models (LLMs), yet existing methods (RAG and Graph-RAG) primarily retrieve memory through similarity-based mechanisms. While efficient, such System-1-style retrieval struggles with scenarios that require global reasoning or comprehensive coverage of all relevant information. In this work, We propose Mnemis, a novel memory framework that integrates System-1 similarity search with a complementary System-2 mechanism, termed Global Selection. Mnemis organizes memory into a base graph for similarity retrieval and a hierarchical graph that enables top-down, deliberate traversal over semantic hierarchies. By combining the complementary strength from both retrieval routes, Mnemis retrieves memory items that are both semantically and structurally relevant. Mnemis achieves state-of-the-art performance across all compared methods on long-term memory benchmarks, scoring 93.9 on LoCoMo and 91.6 on LongMemEval-S using GPT-4.1-mini.[8] NeuroSymActive: Differentiable Neural-Symbolic Reasoning with Active Exploration for Knowledge Graph Question Answering
Rong Fu,Yang Li,Zeyu Zhang,Jiekai Wu,Yaohua Liu,Shuaishuai Cao,Yangchen Zeng,Yuhang Zhang,Xiaojing Du,Chuang Zhao,Kangning Cui,Simon Fong
Main category: cs.CL
TL;DR: NeuroSymActive is a modular neural-symbolic framework for KGQA that combines differentiable symbolic reasoning with active, value-guided graph exploration to improve accuracy and reduce costly lookups.
Details
Motivation: Large language models struggle with knowledge-intensive, multi-hop queries; knowledge graphs offer factual grounding but are hard to integrate efficiently and robustly with neural models. Method: NeuroSymActive integrates a differentiable neural-symbolic reasoning layer (using soft-unification) with an active exploration controller that uses a neural path evaluator and Monte-Carlo–style policy to prioritize high-value path expansions. Result: NeuroSymActive achieves strong answer accuracy on standard KGQA benchmarks while reducing the number of expensive graph lookups and model calls compared to retrieval-augmented baselines. Conclusion: Combining differentiable symbolic reasoning with active, value-guided exploration enables efficient, accurate, and scalable KGQA without sacrificing neural flexibility or symbolic precision. Abstract: Large pretrained language models and neural reasoning systems have advanced many natural language tasks, yet they remain challenged by knowledge-intensive queries that require precise, structured multi-hop inference. Knowledge graphs provide a compact symbolic substrate for factual grounding, but integrating graph structure with neural models is nontrivial: naively embedding graph facts into prompts leads to inefficiency and fragility, while purely symbolic or search-heavy approaches can be costly in retrievals and lack gradient-based refinement. We introduce NeuroSymActive, a modular framework that combines a differentiable neural-symbolic reasoning layer with an active, value-guided exploration controller for Knowledge Graph Question Answering. The method couples soft-unification style symbolic modules with a neural path evaluator and a Monte-Carlo style exploration policy that prioritizes high-value path expansions. Empirical results on standard KGQA benchmarks show that NeuroSymActive attains strong answer accuracy while reducing the number of expensive graph lookups and model calls compared to common retrieval-augmented baselines.[9] Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz,Dipankar Srirag,Aditya Joshi
Main category: cs.CL
TL;DR: 本文评估了七种主流语言模型对印度英语(en-IN)和澳大利亚英语(en-AU)中俚语的理解能力,构建了两个数据集(web 和 gen),并在三项任务上进行测试,发现模型在判别式任务(TWS)上表现显著优于生成式任务(TWP/TWP*),且对 en-IN 的理解整体优于 en-AU。
Details
Motivation: 语言模型在非标准语言变体(如地区性英语俚语)上的性能存在系统性差距,但其对特定变体俚语的理解能力尚缺乏多语言深入探究,尤其在英语的丰富变体中。 Method: 构建两个互补俚语数据集:web(377条来自Urban Dictionary的网络真实用例)和gen(1492条合成用例);在三种任务(目标词预测TWP、引导式预测TWP*、目标词选择TWS)上评测七种SOTA语言模型。 Result: (1)TWS平均准确率(0.49)远高于TWP/TWP*(0.03);(2)模型在web数据上表现略优于gen;(3)en-IN整体表现优于en-AU,尤其在TWS上准确率从0.44升至0.54;(4)揭示生成式与判别式能力在俚语理解上的根本不对称性。 Conclusion: 当前语言模型对英语方言俚语的理解存在显著局限,判别能力明显强于生成能力,且对不同英语变体(如en-IN vs en-AU)的适应性不均衡,凸显模型在社会语言多样性建模上的不足。 Abstract: Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: \textsc{web}, containing 377 web-sourced usage examples from Urban Dictionary, and \textsc{gen}, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP$^*$) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP$^*$, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on \textsc{web} versus \textsc{gen} datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP$^*$ tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.[10] Orchestration-Free Customer Service Automation: A Privacy-Preserving and Flowchart-Guided Framework
Mengze Hong,Chen Jason Zhang,Zichang Guo,Hanlin Gu,Di Jiang,Li Qing
Main category: cs.CL
TL;DR: 本文提出了一种无需人工编排的客户服务平台自动化框架,基于任务导向流程图(TOFs)实现端到端自动化,并结合本地部署的小型语言模型与基于流程图的去中心化知识蒸馏,解决数据稀缺与隐私问题。
Details
Motivation: 现有客服自动化方法存在模块化系统依赖强、人工编排复杂,或指令模板过于简化、泛化能力差的问题。 Method: 提出任务导向流程图(TOF)作为建模工具,定义其组件与评估指标;设计低成本流程图构建算法,从服务对话中抽取过程性知识;采用本地部署小模型 + 基于TOF的去中心化蒸馏策略。 Result: 在多种客服任务上实验表明,该方法在定量指标和实际应用效果上均优于强基线及市面产品。 Conclusion: TOF框架实现了无需人工干预的端到端客服自动化,兼顾性能、隐私与可部署性,为未来服务自动化提供了新范式。 Abstract: Customer service automation has seen growing demand within digital transformation. Existing approaches either rely on modular system designs with extensive agent orchestration or employ over-simplified instruction schemas, providing limited guidance and poor generalizability. This paper introduces an orchestration-free framework using Task-Oriented Flowcharts (TOFs) to enable end-to-end automation without manual intervention. We first define the components and evaluation metrics for TOFs, then formalize a cost-efficient flowchart construction algorithm to abstract procedural knowledge from service dialogues. We emphasize local deployment of small language models and propose decentralized distillation with flowcharts to mitigate data scarcity and privacy issues in model training. Extensive experiments validate the effectiveness in various service tasks, with superior quantitative and application performance compared to strong baselines and market products. By releasing a web-based system demonstration with case studies, we aim to promote streamlined creation of future service automation.[11] Making Large Language Models Speak Tulu: Structured Prompting for an Extremely Low-Resource Language
Prathamesh Devadiga,Paras Chopra
Main category: cs.CL
TL;DR: 本文研究了大语言模型在缺乏训练数据的语言(如德拉威语系的图鲁语)中进行对话的能力,通过结构化提示而非微调实现基本对话能力。方法包括利用显式语法文档、负向约束抑制相关语言干扰、罗马化标准化及质量可控的自博弈合成数据生成。实验表明该方法显著降低词汇污染并提高语法准确性。
Details
Motivation: 探究大语言模型能否在训练数据几乎为零的语言中进行有效对话,以拓展其多语言支持能力。 Method: 采用结构化提示策略,结合显式语法文档、负向约束、罗马化标准化和自博弈生成高质量合成数据,不进行模型微调。 Result: 在三个主流大模型上测试,词汇污染从80%降至5%,语法准确率达85%;负向约束带来12–18个百分点提升,语法文档效果因模型而异(8–22个百分点)。 Conclusion: 仅靠精心设计的提示工程即可在极低资源语言中激发大模型基础对话能力,负向约束是关键且稳定的提升手段。 Abstract: Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis reveals that negative constraints provide consistent improvements (12--18 percentage points), while grammar documentation effects vary by model architecture (8--22 points).[12] The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems
Xiaoze Liu,Ruowang Zhang,Weichen Yu,Siheng Xiong,Liu He,Feijie Wu,Hoin Jung,Matt Fredrikson,Xiaoqian Wang,Jing Gao
Main category: cs.CL
TL;DR: 本文提出Vision Wormhole框架,利用视觉语言模型的视觉接口实现多智能体间无需文本的高效、模型无关通信,通过通用视觉编解码器和中心辐射拓扑结构提升可扩展性与推理效率。
Details
Motivation: 现有基于大语言模型的多智能体系统受限于离散文本通信的低效性(高运行开销与信息量化损失),而现有潜在状态传输方法难以支持异构模型间的可扩展、模块化通信。 Method: 提出Vision Wormhole框架:1)设计通用视觉编解码器(Universal Visual Codec),将异构推理轨迹映射至共享连续潜在空间;2)将编码后潜表示注入接收方视觉通路,视视觉编码器为通用通信端口;3)采用中心辐射(hub-and-spoke)拓扑降低对齐复杂度;4)使用无标签师生蒸馏目标对齐高速视觉通道与稳健文本推理模式。 Result: 在Qwen-VL、Gemma等异构模型家族上实验表明,Vision Wormhole显著降低端到端实际运行时间,同时保持与标准文本通信多智能体系统相当的推理保真度。 Conclusion: Vision Wormhole为异构多智能体系统提供了一种高效、可扩展、模型无关的文本-free通信范式,突破了传统文本通信瓶颈,推动多智能体协同推理向更高效、更鲁棒方向发展。 Abstract: Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain shackled by the inefficiency of discrete text communication, which imposes significant runtime overhead and information quantization loss. While latent state transfer offers a high-bandwidth alternative, existing approaches either assume homogeneous sender-receiver architectures or rely on pair-specific learned translators, limiting scalability and modularity across diverse model families with disjoint manifolds. In this work, we propose the Vision Wormhole, a novel framework that repurposes the visual interface of Vision-Language Models (VLMs) to enable model-agnostic, text-free communication. By introducing a Universal Visual Codec, we map heterogeneous reasoning traces into a shared continuous latent space and inject them directly into the receiver's visual pathway, effectively treating the vision encoder as a universal port for inter-agent telepathy. Our framework adopts a hub-and-spoke topology to reduce pairwise alignment complexity from O(N^2) to O(N) and leverages a label-free, teacher-student distillation objective to align the high-speed visual channel with the robust reasoning patterns of the text pathway. Extensive experiments across heterogeneous model families (e.g., Qwen-VL, Gemma) demonstrate that the Vision Wormhole reduces end-to-end wall-clock time in controlled comparisons while maintaining reasoning fidelity comparable to standard text-based MAS. Code is available at https://github.com/xz-liu/heterogeneous-latent-mas[13] Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato,Veera Schroderus,Jenna Kanerva,Jenni Kauppi,Virpi Lummaa,Filip Ginter
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)的分类框架,用于对芬兰二战卡累利阿撤离家庭口述史中提取的35万条休闲活动与组织参与实体进行结构化标注,以支持社会整合等量化历史研究。
Details
Motivation: 历史档案数字化虽扩大了社会生活研究规模,但原始文本信息难以直接满足历史学与社会学的量化研究需求;现有提取的35万条活动/组织提及、7.1万个唯一名称过于庞杂,亟需高效、可靠的自动分类方法。 Method: 构建涵盖活动类型、社交性、规律性与体力强度四个维度的分类框架;人工标注黄金标准数据集用于评估;采用多轮推理投票策略,测试开源大语言模型在该框架下的泛化能力。 Result: 开源大语言模型通过投票机制可接近专家判断水平;成功为全部35万条实体打上结构化标签,形成可用于下游社会整合研究的新型结构化资源。 Conclusion: 基于LLM的轻量级、可解释的分类框架能有效支撑人文领域大规模文本的量化分析,为数字人文中的知识建模与自动化标注提供了可行路径。 Abstract: Digitized historical archives make it possible to study everyday social life on a large scale, but the information extracted directly from text often does not directly allow one to answer the research questions posed by historians or sociologists in a quantitative manner. We address this problem in a large collection of Finnish World War II Karelian evacuee family interviews. Prior work extracted more than 350K mentions of leisure time activities and organizational memberships from these interviews, yielding 71K unique activity and organization names -- far too many to analyze directly. We develop a categorization framework that captures key aspects of participation (the kind of activity/organization, how social it typically is, how regularly it happens, and how physically demanding it is). We annotate a gold-standard set to allow for a reliable evaluation, and then test whether large language models can apply the same schema at scale. Using a simple voting approach across multiple model runs, we find that an open-weight LLM can closely match expert judgments. Finally, we apply the method to label the 350K entities, producing a structured resource for downstream studies of social integration and related outcomes.[14] TAROT: Test-driven and Capability-adaptive Curriculum Reinforcement Fine-tuning for Code Generation with Large Language Models
Chansung Park,Juyong Jiang,Fan Wang,Sayak Paul,Jiasi Shen,Jing Tang,Jianguo Li
Main category: cs.CL
TL;DR: 本文提出TAROT方法,通过构建四层测试套件并解耦课程进展与原始奖励分数,实现能力自适应的强化微调,从而提升大语言模型生成代码的功能正确性与鲁棒性。
Details
Motivation: 现有强化微调方法忽视测试用例的异构难度和粒度,导致奖励信号分布不均、梯度更新偏差,难以有效激发LLM深层推理能力以生成算法复杂且健壮的代码。 Method: 提出Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning(TAROT),为每个问题构建基础、中级、复杂、边缘四层测试套件,并解耦课程进度与原始奖励,实现能力条件下的评估与多策略课程选择。 Result: 实验表明:模型固有能力决定最优课程策略——能力较弱模型受益于易到难课程,能力强的模型则在先难后易课程下表现更优;TAROT显著提升生成代码的功能正确性与鲁棒性。 Conclusion: TAROT提供了一种可复现、能力自适应的课程强化微调框架,能根据模型能力动态定制训练课程,有效缓解现有RFT方法因测试难度不均衡导致的优化不稳定问题。 Abstract: Large Language Models (LLMs) are changing the coding paradigm, known as vibe coding, yet synthesizing algorithmically sophisticated and robust code still remains a critical challenge. Incentivizing the deep reasoning capabilities of LLMs is essential to overcoming this hurdle. Reinforcement Fine-Tuning (RFT) has emerged as a promising strategy to address this need. However, most existing approaches overlook the heterogeneous difficulty and granularity inherent in test cases, leading to an imbalanced distribution of reward signals and consequently biased gradient updates during training. To address this, we propose Test-driven and cApability-adaptive cuRriculum reinfOrcement fine-Tuning (TAROT). TAROT systematically constructs, for each problem, a four-tier test suite (basic, intermediate, complex, edge), providing a controlled difficulty landscape for curriculum design and evaluation. Crucially, TAROT decouples curriculum progression from raw reward scores, enabling capability-conditioned evaluation and principled selection from a portfolio of curriculum policies rather than incidental test-case difficulty composition. This design fosters stable optimization and more efficient competency acquisition. Extensive experimental results reveal that the optimal curriculum for RFT in code generation is closely tied to a model's inherent capability, with less capable models achieving greater gains with an easy-to-hard progression, whereas more competent models excel under a hard-first curriculum. TAROT provides a reproducible method that adaptively tailors curriculum design to a model's capability, thereby consistently improving the functional correctness and robustness of the generated code. All code and data are released to foster reproducibility and advance community research at https://github.com/deep-diver/TAROT.[15] In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations
Mohammad Aflah Khan,Mahsa Amani,Soumi Das,Bishwamittra Ghosh,Qinyuan Wu,Krishna P. Gummadi,Manish Gupta,Abhilasha Ravichander
Main category: cs.CL
TL;DR: 本文研究了大型语言模型(LLM)代理在信息检索与呈现过程中存在的系统性来源偏好问题,发现多个主流LLM存在稳定、可预测且难以通过提示消除的源偏向,影响信息公平性与用户认知。
Details
Motivation: 现有研究多关注LLM生成内容的偏见,而忽视其在信息筛选与呈现阶段(即‘注意力分配’)可能引入的来源偏好偏见;作者旨在揭示并量化这种潜在但关键的偏见机制。 Method: 通过在12个来自6家厂商的LLM上开展受控实验,涵盖合成任务与真实世界任务,系统测试模型对不同来源(如媒体、期刊)信息的选择倾向,并分析上下文框架、显式去偏提示等因素对其影响。 Result: 多个LLM表现出强且一致的来源偏好;该偏好敏感于上下文 framing,有时甚至压倒内容质量本身;显式提示无法有效消除;该现象可解释先前观察到的新闻推荐左倾偏差。 Conclusion: LLM代理的信息选择过程存在未被充分认识的系统性来源偏见,亟需深入探究其成因,并设计提升透明度与用户可控性的干预机制。 Abstract: Agents based on Large Language Models (LLMs) are increasingly being deployed as interfaces to information on online platforms. These agents filter, prioritize, and synthesize information retrieved from the platforms' back-end databases or via web search. In these scenarios, LLM agents govern the information users receive, by drawing users' attention to particular instances of retrieved information at the expense of others. While much prior work has focused on biases in the information LLMs themselves generate, less attention has been paid to the factors that influence what information LLMs select and present to users. We hypothesize that when information is attributed to specific sources (e.g., particular publishers, journals, or platforms), current LLMs exhibit systematic latent source preferences- that is, they prioritize information from some sources over others. Through controlled experiments on twelve LLMs from six model providers, spanning both synthetic and real-world tasks, we find that several models consistently exhibit strong and predictable source preferences. These preferences are sensitive to contextual framing, can outweigh the influence of content itself, and persist despite explicit prompting to avoid them. They also help explain phenomena such as the observed left-leaning skew in news recommendations in prior work. Our findings advocate for deeper investigation into the origins of these preferences, as well as for mechanisms that provide users with transparency and control over the biases guiding LLM-powered agents.[16] Towards Expectation Detection in Language: A Case Study on Treatment Expectations in Reddit
Aswathy Velutharambath,Amelie Wührl
Main category: cs.CL
TL;DR: 本文提出了‘期望检测’这一新任务,构建了RedHOTExpect数据集(4.5K Reddit医疗相关帖子),利用大语言模型半自动标注并人工验证,分析患者在线表达的治疗期望的语言特征与内容倾向,发现身体疾病类帖子更显乐观与主动表达,且多聚焦于预期获益而非风险。
Details
Motivation: 患者对治疗的期望显著影响疗效,但临床外(如Reddit等在线平台)中未被充分研究;NLP领域尚无针对期望的建模工作,亟需定义该任务并提供数据支撑。 Method: 提出Expectation Detection任务;构建RedHOTExpect语料库(Reddit医疗子版块);采用LLM进行银标标注并人工校验(准确率~78%);开展语言模式分析与主题挖掘。 Result: 发现身体疾病相关帖子比心理健康类更倾向使用乐观和主动表达框架;患者主要讨论预期收益,较少提及负面结果;识别出表征期望的关键语言模式。 Conclusion: Expectation Detection是NLP中一个有意义的新方向,RedHOTExpect为医疗意见挖掘、产品设计等应用提供了首个公开可用的期望标注语料,证实了在线患者文本中蕴含可建模的、有差异化的期望表达。 Abstract: Patients' expectations towards their treatment have a substantial effect on the treatments' success. While primarily studied in clinical settings, online patient platforms like medical subreddits may hold complementary insights: treatment expectations that patients feel unnecessary or uncomfortable to share elsewhere. Despite this, no studies examine what type of expectations users discuss online and how they express them. Presumably this is because expectations have not been studied in natural language processing (NLP) before. Therefore, we introduce the task of Expectation Detection, arguing that expectations are relevant for many applications, including opinion mining and product design. Subsequently, we present a case study for the medical domain, where expectations are particularly crucial to extract. We contribute RedHOTExpect, a corpus of Reddit posts (4.5K posts) to study expectations in this context. We use a large language model (LLM) to silver-label the data and validate its quality manually (label accuracy ~78%). Based on this, we analyze which linguistic patterns characterize expectations and explore what patients expect and why. We find that optimism and proactive framing are more pronounced in posts about physical or treatment-related illnesses compared to mental-health contexts, and that in our dataset, patients mostly discuss benefits rather than negative outcomes. The RedHOTExpect corpus can be obtained from https://www.ims.uni-stuttgart.de/data/RedHOTExpect[17] LuxMT Technical Report
Nils Rehlinger
Main category: cs.CL
TL;DR: 本文介绍了LuxMT,一个基于Gemma 3 27B并针对卢森堡语到法语和英语翻译微调的机器翻译系统;构建了基于旅游杂志Luci的新基准测试集;使用LuxAlign和议会记录作为训练数据,并用LuxEmbedder过滤低质量句对;结果显示LuxMT显著优于基线模型,甚至在未训练的卢森堡语到德语任务上也表现良好;LuxEmbedder还被探索作为质量评估指标,显示与参考指标强相关,但需谨慎使用。
Details
Motivation: 解决卢森堡语(LB)机器翻译资源稀缺问题,构建高质量专用翻译系统及评估基准。 Method: 基于Gemma 3 27B模型进行多语言(LB→FR/EN)微调;构建含人工翻译的Luci基准;使用LuxAlign和议会记录+Google翻译扩充训练数据;利用自研LuxEmbedder(LB句嵌入)过滤低等价性平行句对;评估LuxEmbedder作为质量估计指标的相关性。 Result: LuxMT在LB→FR/EN上显著超越Gemma 3基线;意外地在未见的LB→DE任务上也表现优异;LuxEmbedder与主流参考指标(如BLEU、COMET)呈强相关性。 Conclusion: LuxMT是首个高性能卢森堡语专用翻译系统;LuxEmbedder具备潜力作为无参考质量评估指标,但需进一步验证其鲁棒性与适用边界。 Abstract: We introduce LuxMT, a machine translation system based on Gemma 3 27B and fine-tuned for translation from Luxembourgish (LB) into French (FR) and English (EN). To assess translation performance, we construct a novel benchmark covering LB-FR, LB-EN, and LB-FR using human-translated data from Luci, a tourist magazine about Luxembourg. Training data stems from LuxAlign, a parallel corpus of multilingual Luxembourgish news articles, and LB parliamentary transcripts augmented with Google Translate. We filter the data using LuxEmbedder, LB sentence embeddings, to remove low-equivalence segment-pairs. Overall, LuxMT's results suggest strong improvements over the Gemma 3 baseline, even for translating LB to German (DE), despite the training data not containing any DE. We also explore LuxEmbedder's potential to be used as a quality estimation metric and find strong correlations with other reference-based metrics. However, we call for further research to fully assess the metric's utility and advise using it with caution.[18] Fine-Refine: Iterative Fine-grained Refinement for Mitigating Dialogue Hallucination
Xiangyan Chen,Yujian Gan,Matthew Purver
Main category: cs.CL
TL;DR: 本文提出Fine-Refine框架,通过将对话响应细粒度分解、外部知识验证与流畅性评估,迭代修正事实性错误,显著提升对话系统事实准确性。
Details
Motivation: 现有对话系统精炼方法仅在响应层面操作,忽视单个响应中可能包含多个可验证或不可验证的事实,导致幻觉问题难以根治。 Method: Fine-Refine:将响应分解为原子单元,利用外部知识逐单元验证事实性,并通过困惑度评估语言流畅性,再进行迭代式细粒度修正。 Result: 在HybriDialogue和OpendialKG数据集上,对话事实得分最高提升7.63点,事实覆盖率(NEI比例)也改善,仅轻微牺牲对话质量。 Conclusion: 细粒度验证与修正能更有效地缓解LLM对话中的幻觉问题,在事实性与自然性间取得更好平衡。 Abstract: The tendency for hallucination in current large language models (LLMs) negatively impacts dialogue systems. Such hallucinations produce factually incorrect responses that may mislead users and undermine system trust. Existing refinement methods for dialogue systems typically operate at the response level, overlooking the fact that a single response may contain multiple verifiable or unverifiable facts. To address this gap, we propose Fine-Refine, a fine-grained refinement framework that decomposes responses into atomic units, verifies each unit using external knowledge, assesses fluency via perplexity, and iteratively corrects granular errors. We evaluate factuality across the HybriDialogue and OpendialKG datasets in terms of factual accuracy (fact score) and coverage (Not Enough Information Proportion), and experiments show that Fine-Refine substantially improves factuality, achieving up to a 7.63-point gain in dialogue fact score, with a small trade-off in dialogue quality.[19] DependencyAI: Detecting AI Generated Text through Dependency Parsing
Sara Ahmed,Tracy Hammond
Main category: cs.CL
TL;DR: 本文提出DependencyAI,一种仅基于依存关系标签检测AI生成文本的简单可解释方法,在多种设置下表现优异,并揭示了区分AI与人类文本的关键句法结构。
Details
Motivation: 随着大语言模型(LLMs)日益普及,亟需可靠方法检测AI生成文本以缓解潜在风险。 Method: 提出DependencyAI方法,仅利用语言依存关系标签进行AI生成文本检测,并通过特征重要性分析增强可解释性。 Result: 在单语、多生成器和多语言设置下均取得具有竞争力的性能;发现某些模型在未见领域存在系统性过预测现象;依存关系本身即为鲁棒检测信号。 Conclusion: DependencyAI是一种语言学基础扎实、可解释性强且无需神经网络的强基线方法。 Abstract: As large language models (LLMs) become increasingly prevalent, reliable methods for detecting AI-generated text are critical for mitigating potential risks. We introduce DependencyAI, a simple and interpretable approach for detecting AI-generated text using only the labels of linguistic dependency relations. Our method achieves competitive performance across monolingual, multi-generator, and multilingual settings. To increase interpretability, we analyze feature importance to reveal syntactic structures that distinguish AI-generated from human-written text. We also observe a systematic overprediction of certain models on unseen domains, suggesting that generator-specific writing styles may affect cross-domain generalization. Overall, our results demonstrate that dependency relations alone provide a robust signal for AI-generated text detection, establishing DependencyAI as a strong linguistically grounded, interpretable, and non-neural network baseline.[20] ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns
Ziyu Zhao,Tong Zhu,Zhi Zhang,Tiantian Fan,Jinluan Yang,Kun Kuang,Zhongyu Wei,Fei Wu,Yu Cheng
Main category: cs.CL
TL;DR: 本文提出ExpertWeaver,一种无需训练的dense-to-MoE转换框架,利用GLU机制中固有的神经元激活模式,将预训练稠密模型高效转化为高质量稀疏MoE模型,在动态结构剪枝和降级初始化两方面均显著优于现有方法。
Details
Motivation: 现有稠密转MoE方法破坏了原始稠密模型内在的激活模式,导致专家构建次优;且从头训练MoE成本过高,亟需更优的转换策略。 Method: 发现GLU机制蕴含粗粒度MoE结构(含通用神经元与专用神经元),据此提出无需训练的ExpertWeaver框架:按激活模式划分神经元,分层自适应构建共享专家与路由专用专家。 Result: ExpertWeaver在训练-free前提下,同时显著优于现有动态结构剪枝与downcycling方法,提升MoE性能与推理效率。 Conclusion: GLU激活模式为dense-to-MoE提供了天然蓝图;ExpertWeaver验证了基于内在结构感知的无训练转换是高效构建高质量MoE的有效新范式。 Abstract: Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.[21] ZeroSyl: Simple Zero-Resource Syllable Tokenization for Spoken Language Modeling
Nicol Visser,Simon Malan,Danel Slabbert,Herman Kamper
Main category: cs.CL
TL;DR: ZeroSyl是一种无需训练的方法,直接从冻结的WavLM模型中提取音节边界和嵌入,用于纯语音语言建模,性能优于现有音节 tokenizer。
Details
Motivation: 纯语音语言模型面临自监督语音编码器产生的离散token序列过长的问题,现有音节类方法(如Sylber、SyllableLM)依赖复杂多阶段训练流程,亟需更简单高效方案。 Method: 利用冻结WavLM中间层特征的L2范数进行音节边界检测;对所得片段均值池化后用K-means离散化,生成音节级token并训练语言模型。 Result: ZeroSyl在词汇、句法和叙事基准上均超越先前音节tokenizer;缩放实验表明其音节单元在句法建模中比更细粒度单元具有更优的扩展性。 Conclusion: ZeroSyl以零训练成本实现了高性能音节分割与建模,验证了冻结语音模型中蕴含的音节结构信息可被简单几何特征(L2范数)有效利用,为纯语音语言建模提供了新范式。 Abstract: Pure speech language models aim to learn language directly from raw audio without textual resources. A key challenge is that discrete tokens from self-supervised speech encoders result in excessively long sequences, motivating recent work on syllable-like units. However, methods like Sylber and SyllableLM rely on intricate multi-stage training pipelines. We propose ZeroSyl, a simple training-free method to extract syllable boundaries and embeddings directly from a frozen WavLM model. Using L2 norms of features in WavLM's intermediate layers, ZeroSyl achieves competitive syllable segmentation performance. The resulting segments are mean-pooled, discretized using K-means, and used to train a language model. ZeroSyl outperforms prior syllabic tokenizers across lexical, syntactic, and narrative benchmarks. Scaling experiments show that while finer-grained units are beneficial for lexical tasks, our discovered syllabic units exhibit better scaling behavior for syntactic modeling.[22] Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
Tim Fischer,Chris Biemann
Main category: cs.CL
TL;DR: 本文介绍了Perspectives,一个面向数字人文学者的交互式话语分析工具扩展,支持对大规模非结构化文档集合进行探索与组织。
Details
Motivation: 为解决数字人文学者在处理大规模非结构化文档时缺乏灵活、可交互、可解释的分析工具的问题。 Method: 提出一种以‘分析视角’(aspects)为中心的文档聚类流程,结合文档重写提示、指令嵌入(instruction-based embeddings)及人机协同的聚类优化与嵌入模型微调机制。 Result: 实现了Perspectives系统,支持用户通过交互式文档地图发现主题、情感等语义类别,并完成数据预处理与洞察生成。 Conclusion: Perspectives有效提升了数字人文研究中对非结构化文本的探索性分析能力,强调人在环路的设计显著增强了分析的可控性与可解释性。 Abstract: This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities. We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives's interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.[23] jina-embeddings-v5-text: Task-Targeted Embedding Distillation
Mohammad Kalim Akram,Saba Sturua,Nastia Havriushenko,Quentin Herreros,Michael Günther,Maximilian Werk,Han Xiao
Main category: cs.CL
TL;DR: 本文提出了一种结合模型蒸馏与任务特定对比损失的新训练范式,用于构建紧凑且高性能的文本嵌入模型(如jina-embeddings-v5-text-small/nano),在小模型上优于纯对比学习或纯蒸馏方法,并支持长文本、截断鲁棒性和二值量化。
Details
Motivation: 提升小规模文本嵌入模型在语义相似性任务中的性能,克服纯对比学习或纯蒸馏方法在小模型上的局限性。 Method: 融合模型蒸馏技术与任务特定的对比损失函数进行多阶段训练。 Result: 所提方法训练出的jina-embeddings-v5-text-small和nano模型在同类尺寸模型中达到或超越SOTA;支持最长32k token多语言长文本,且嵌入对截断和二值量化具有鲁棒性。 Conclusion: 结合蒸馏与任务特定对比损失的训练范式更适用于小型嵌入模型,在性能、效率与实用性方面取得更好平衡。 Abstract: Text embedding models are widely used for semantic similarity tasks, including information retrieval, clustering, and classification. General-purpose models are typically trained with single- or multi-stage processes using contrastive loss functions. We introduce a novel training regimen that combines model distillation techniques with task-specific contrastive loss to produce compact, high-performance embedding models. Our findings suggest that this approach is more effective for training small models than purely contrastive or distillation-based training paradigms alone. Benchmark scores for the resulting models, jina-embeddings-v5-text-small and jina-embeddings-v5-text-nano, exceed or match the state-of-the-art for models of similar size. jina-embeddings-v5-text models additionally support long texts (up to 32k tokens) in many languages, and generate embeddings that remain robust under truncation and binary quantization. Model weights are publicly available, hopefully inspiring further advances in embedding model development.[24] Beyond Static Pipelines: Learning Dynamic Workflows for Text-to-SQL
Yihan Wang,Peiyu Liu,Runyu Chen,Wei Xu
Main category: cs.CL
TL;DR: 本文提出SquRL框架,通过强化学习使大语言模型能自适应地构建Text-to-SQL工作流,显著提升在复杂及分布外查询上的性能。
Details
Motivation: 现有Text-to-SQL方法依赖静态工作流,难以应对分布外和长尾场景;需避免用户手动试错选择方法,转而让系统在推理时自适应构建工作流。 Method: 提出基于强化学习的SquRL框架,设计规则化奖励函数,并引入动态Actor掩码和伪奖励两种训练机制,以增强LLM在动态工作流构建中的推理能力。 Result: 在主流Text-to-SQL基准上,动态工作流构造持续优于最优静态方法,尤其在复杂和分布外查询上增益显著。 Conclusion: 动态策略因利用候选工作流间的异质性而始终优于静态策略;SquRL为Text-to-SQL提供了更鲁棒、可扩展的自适应解决方案。 Abstract: Text-to-SQL has recently achieved impressive progress, yet remains difficult to apply effectively in real-world scenarios. This gap stems from the reliance on single static workflows, fundamentally limiting scalability to out-of-distribution and long-tail scenarios. Instead of requiring users to select suitable methods through extensive experimentation, we attempt to enable systems to adaptively construct workflows at inference time. Through theoretical and empirical analysis, we demonstrate that optimal dynamic policies consistently outperform the best static workflow, with performance gains fundamentally driven by heterogeneity across candidate workflows. Motivated by this, we propose SquRL, a reinforcement learning framework that enhances LLMs' reasoning capability in adaptive workflow construction. We design a rule-based reward function and introduce two effective training mechanisms: dynamic actor masking to encourage broader exploration, and pseudo rewards to improve training efficiency. Experiments on widely-used Text-to-SQL benchmarks demonstrate that dynamic workflow construction consistently outperforms the best static workflow methods, with especially pronounced gains on complex and out-of-distribution queries. The codes are available at https://github.com/Satissss/SquRL[25] Clinically Inspired Symptom-Guided Depression Detection from Emotion-Aware Speech Representations
Chaithra Nerella,Chiranjeevi Yarra
Main category: cs.CL
TL;DR: 本文提出了一种症状特异性、临床启发式的语音抑郁严重程度评估框架,利用症状引导的跨注意力机制将PHQ-8问卷条目与情绪感知语音表征对齐,并引入可学习的症状特异性参数以自适应调整注意力分布锐度,在EDAI C数据集上取得SOTA性能并具备可解释性。
Details
Motivation: 现有抑郁预测方法多为二分类或整体严重度评分,缺乏对症状特异性信息的建模,难以支持临床筛查所需的症状级分析。 Method: 提出症状引导的跨注意力机制,将PHQ-8问卷条目与情绪感知语音表征对齐;引入可学习的症状特异性参数,自适应控制注意力分布的锐度。 Result: 在标准临床风格数据集EDAI C上性能优于先前工作;注意力分析显示模型更关注含多种抑郁症状线索的语句,验证了模型可解释性。 Conclusion: 症状引导与情绪感知建模对基于语音的抑郁筛查至关重要,该框架兼顾性能提升与临床可解释性。 Abstract: Depression manifests through a diverse set of symptoms such as sleep disturbance, loss of interest, and concentration difficulties. However, most existing works treat depression prediction either as a binary label or an overall severity score without explicitly modeling symptom-specific information. This limits their ability to provide symptom-level analysis relevant to clinical screening. To address this, we propose a symptom-specific and clinically inspired framework for depression severity estimation from speech. Our approach uses a symptom-guided cross-attention mechanism that aligns PHQ-8 questionnaire items with emotion-aware speech representations to identify which segments of a participant's speech are more important to each symptom. To account for differences in how symptoms are expressed over time, we introduce a learnable symptom-specific parameter that adaptively controls the sharpness of attention distributions. Our results on EDAIC, a standard clinical-style dataset, demonstrate improved performance outperforming prior works. Further, analyzing the attention distributions showed that higher attention is assigned to utterances containing cues related to multiple depressive symptoms, highlighting the interpretability of our approach. These findings outline the importance of symptom-guided and emotion-aware modeling for speech-based depression screening.[26] STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens
Shiqi Liu,Zeyu He,Guojian Zhan,Letian Tao,Zhilong Zheng,Jiang Wu,Yinuo Wang,Yang Guan,Kehua Sheng,Bo Zhang,Keqiang Li,Jingliang Duan,Shengbo Eben Li
Main category: cs.CL
TL;DR: 本文提出了一种新的强化学习微调方法STAPO,通过识别并抑制对推理结果贡献极小却继承全序列奖励的‘虚假token’,来提升大语言模型推理训练的稳定性与性能。
Details
Motivation: 现有RL微调方法依赖启发式技巧(如熵正则化、重加权),易出现后期性能崩溃和训练不稳定;作者发现训练不稳定性主要源于极少数(约0.01%)'虚假token'引发的梯度异常放大。 Method: 基于token级策略梯度与token概率及局部策略熵的负相关性分析,理论推导出虚假token的存在机制;提出Spurious-Token-Aware Policy Optimization(STAPO),在优化中选择性屏蔽虚假token的梯度更新,并在有效token上重归一化损失。 Result: 在六个数学推理基准上,使用Qwen 1.7B/8B/14B模型,STAPO显著提升熵稳定性,平均性能比GRPO、20-Entropy和JustRL高出7.13%。 Conclusion: 虚假token是导致RL微调不稳定的主因,STAPO通过感知并抑制其梯度贡献,为大模型推理能力的稳定提升提供了新范式。 Abstract: Reinforcement Learning (RL) has significantly improved large language model reasoning, but existing RL fine-tuning methods rely heavily on heuristic techniques such as entropy regularization and reweighting to maintain stability. In practice, they often experience late-stage performance collapse, leading to degraded reasoning quality and unstable training. We derive that the magnitude of token-wise policy gradients in RL is negatively correlated with token probability and local policy entropy. Building on this result, we prove that training instability is driven by a tiny fraction of tokens, approximately 0.01\%, which we term \emph{spurious tokens}. When such tokens appear in correct responses, they contribute little to the reasoning outcome but inherit the full sequence-level reward, leading to abnormally amplified gradient updates. Motivated by this observation, we propose Spurious-Token-Aware Policy Optimization (STAPO) for large-scale model refining, which selectively masks such updates and renormalizes the loss over valid tokens. Across six mathematical reasoning benchmarks using Qwen 1.7B, 8B, and 14B base models, STAPO consistently demonstrates superior entropy stability and achieves an average performance improvement of 7.13\% over GRPO, 20-Entropy and JustRL.[27] LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khaled Khamis,Hesham Ali
Main category: cs.CL
TL;DR: 本文提出了NileTTS,首个公开的埃及阿拉伯语TTS数据集(38小时,双说话人),采用LLM生成文本+语音合成+自动转录的合成流程构建,并基于XTTS v2微调,推动资源匮乏方言的语音合成研究。
Details
Motivation: 埃及阿拉伯语作为使用最广泛的阿拉伯方言,严重缺乏TTS资源,现有工作主要集中于现代标准阿拉伯语(MSA)和海湾方言。 Method: 提出一种新颖的合成数据生成流程:利用大语言模型(LLM)生成埃及阿拉伯语文本,通过语音合成工具转为语音,再经自动转录与说话人区分,并辅以人工质量校验;在此基础上微调XTTS v2模型。 Result: 构建了首个公开埃及阿拉伯语TTS数据集(NileTTS,38小时,两说话人,多领域);发布了可复现的合成数据生成流程;开源了微调后的TTS模型。 Conclusion: NileTTS填补了埃及阿拉伯语TTS资源空白,其合成数据方法为其他低资源方言TTS提供了可扩展范式,所有资源均已公开以促进相关研究。 Abstract: Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data generation pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.[28] Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Edirlei Soares de Lima,Marco A. Casanova,Antonio L. Furtado
Main category: cs.CL
TL;DR: 本文提出了一种基于弗莱四类型叙事理论与荣格原型理论融合的16种体裁特异性角色功能框架,并利用6个大语言模型在40部叙事作品上验证了其有效性,平均平衡准确率达82.5%,为计算叙事学与交互式故事生成提供了新基础。
Details
Motivation: 现有计算叙事研究多关注弗莱四类型中的情节模式,而忽视角色功能;本文旨在填补这一空白,构建兼顾原型普遍性与体裁特异性的角色功能分析框架。 Method: 基于荣格心理结构理论抽象出四种普适角色功能(主角、导师、对手、同伴),再结合弗莱四类型(喜剧、浪漫、悲剧、讽刺)的典型文本,将其细化为16种体裁特异性角色;通过6个SOTA大语言模型,在40部叙事作品上对160个正样本和30个负样本进行角色-功能对应判断,采用平衡准确率与Fleiss' κ评估效果。 Result: LLMs平均平衡准确率为82.5%,跨模型一致性中等(κ=0.600);各体裁准确率介于72.7%–89.9%,各角色准确率差异显著(52.5%–99.2%);定性分析表明性能差异反映真实叙事特性,如浪漫体裁的功能分布与讽刺体裁的原型颠覆。 Conclusion: 该角色功能框架能有效捕捉系统性叙事结构,证明大语言模型可支撑计算叙事学研究,并为叙事生成与交互式讲故事应用提供可扩展基础。 Abstract: Northrop Frye's theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather than character functions. In this paper, we present a new character function framework that complements pattern-based analysis by examining how archetypal roles manifest differently across Frye's genres. Drawing on Jungian archetype theory, we derive four universal character functions (protagonist, mentor, antagonist, companion) by mapping them to Jung's psychic structure components. These functions are then specialized into sixteen genre-specific roles based on prototypical works. To validate this framework, we conducted a multi-model study using six state-of-the-art Large Language Models (LLMs) to evaluate character-role correspondences across 40 narrative works. The validation employed both positive samples (160 valid correspondences) and negative samples (30 invalid correspondences) to evaluate whether models both recognize valid correspondences and reject invalid ones. LLMs achieved substantial performance (mean balanced accuracy of 82.5%) with strong inter-model agreement (Fleiss' $κ$ = 0.600), demonstrating that the proposed correspondences capture systematic structural patterns. Performance varied by genre (ranging from 72.7% to 89.9%) and role (52.5% to 99.2%), with qualitative analysis revealing that variations reflect genuine narrative properties, including functional distribution in romance and deliberate archetypal subversion in satire. This character-based approach demonstrates the potential of LLM-supported methods for computational narratology and provides a foundation for future development of narrative generation methods and interactive storytelling applications.[29] A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
Meirav Segal,Noa Linder,Omer Antverg,Gil Gekker,Tomer Fichman,Omri Bodenheimer,Edan Maor,Omer Nevo
Main category: cs.CL
TL;DR: 本文提出了一种基于内容的网络拒绝策略框架,通过五个技术维度显式建模攻击风险与防御收益的权衡,以解决现有LLM在网络安全任务中拒绝机制不一致、过度限制和易受绕过的问题。
Details
Motivation: 现有基于主题或攻击性分类的拒绝机制在网络安全任务中存在决策不一致、过度限制合法防御者、且易被混淆或分段请求绕过的问题。 Method: 提出一个基于内容的五维框架(攻击行为贡献、攻击风险、技术复杂度、防御收益、合法用户预期使用频率),以技术实质而非用户意图来刻画请求,并用于设计和审计网络拒绝策略。 Result: 该框架能解决当前前沿大模型拒绝行为的不一致性,并支持组织构建可调、风险感知的拒绝策略。 Conclusion: 有效的拒绝机制需显式权衡攻击风险与防御收益,而非仅依赖意图或攻击性判断;内容驱动的多维框架是更鲁棒、可审计、可调节的解决方案。 Abstract: Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.[30] Rethinking Metrics for Lexical Semantic Change Detection
Roksana Goworek,Haim Dubossarsky
Main category: cs.CL
TL;DR: 本文提出两种新的语义变化度量方法AMD和SAMD,用于基于上下文化语言模型嵌入的词汇语义变化检测(LSCD),实验表明AMD在多种条件下更鲁棒,SAMD在专用编码器下表现更优。
Details
Motivation: 现有LSCD方法主要依赖APD和PRT等少数语义变化度量,缺乏多样性与鲁棒性,尤其在降维或使用非专用编码器时性能受限。 Method: 提出Average Minimum Distance (AMD)和Symmetric Average Minimum Distance (SAMD)两种新度量,基于词在不同时期用法间的局部对应关系量化语义变化,并在多语言、多模型、多表示空间上进行验证。 Result: AMD在维度缩减和非专用编码器下通常更鲁棒;SAMD在专用编码器下表现最优;两者均优于传统APD和PRT。 Conclusion: LSCD应拓展语义变化度量的选择,AMD是一种适用于上下文化嵌入分析的稳健新选项。 Abstract: Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.[31] Causal Effect Estimation with Latent Textual Treatments
Omri Feldman,Amar Venugopal,Jann Spiess,Amir Feder
Main category: cs.CL
TL;DR: 本文提出了一种端到端的文本干预因果效应估计流程,结合稀疏自编码器(SAEs)进行假设生成与文本引导,并通过协变量残差化缓解文本作为处理变量时的估计偏误。
Details
Motivation: 文本作为处理变量(text-as-treatment)的因果效应估计面临文本天然混杂处理与协变量信息的问题,导致朴素估计存在显著偏差,亟需兼顾计算与统计挑战的系统性解决方案。 Method: 提出基于稀疏自编码器(SAEs)的假设生成与文本引导模块,结合协变量残差化的稳健因果估计方法,构建端到端的文本干预生成与因果效应估计流程。 Result: 实证结果表明该流程能有效激发目标文本特征的变化,并显著降低因果效应估计误差,优于朴素估计方法。 Conclusion: 该管道为文本作为处理变量的因果推断提供了稳健、可解释且可扩展的基础框架。 Abstract: Understanding the causal effects of text on downstream outcomes is a central task in many applications. Estimating such effects requires researchers to run controlled experiments that systematically vary textual features. While large language models (LLMs) hold promise for generating text, producing and evaluating controlled variation requires more careful attention. In this paper, we present an end-to-end pipeline for the generation and causal estimation of latent textual interventions. Our work first performs hypothesis generation and steering via sparse autoencoders (SAEs), followed by robust causal estimation. Our pipeline addresses both computational and statistical challenges in text-as-treatment experiments. We demonstrate that naive estimation of causal effects suffers from significant bias as text inherently conflates treatment and covariate information. We describe the estimation bias induced in this setting and propose a solution based on covariate residualization. Our empirical results show that our pipeline effectively induces variation in target features and mitigates estimation error, providing a robust foundation for causal effect estimation in text-as-treatment settings.[32] Under-resourced studies of under-resourced languages: lemmatization and POS-tagging with LLM annotators for historical Armenian, Georgian, Greek and Syriac
Chahan Vidal-Gorène,Bastien Kindt,Florian Cafiero
Main category: cs.CL
TL;DR: 本文评估了大型语言模型(如GPT-4和Mistral)在四种低资源古代语言(古希腊语、古典亚美尼亚语、古格鲁吉亚语、叙利亚语)的词形还原与词性标注任务中的零样本和少样本能力,并发现其在多数情况下可媲美或超越专用RNN基线模型。
Details
Motivation: 低资源语言在NLP任务(如词形还原和词性标注)中长期面临数据稀缺挑战,亟需无需大量标注数据的有效方法。 Method: 在自建的跨领域对齐基准数据集上,对GPT-4系列及Mistral等开源大模型进行零样本和少样本评测,并与专用RNN模型PIE对比。 Result: LLMs在少样本设置下于多数语言的词性标注和词形还原任务中达到与PIE相当甚至更优性能;但复杂形态和非拉丁文字语言仍具挑战。 Conclusion: LLMs无需微调即可作为低资源语言语言学标注的可行起点和有效辅助工具。 Abstract: Low-resource languages pose persistent challenges for Natural Language Processing tasks such as lemmatization and part-of-speech (POS) tagging. This paper investigates the capacity of recent large language models (LLMs), including GPT-4 variants and open-weight Mistral models, to address these tasks in few-shot and zero-shot settings for four historically and linguistically diverse under-resourced languages: Ancient Greek, Classical Armenian, Old Georgian, and Syriac. Using a novel benchmark comprising aligned training and out-of-domain test corpora, we evaluate the performance of foundation models across lemmatization and POS-tagging, and compare them with PIE, a task-specific RNN baseline. Our results demonstrate that LLMs, even without fine-tuning, achieve competitive or superior performance in POS-tagging and lemmatization across most languages in few-shot settings. Significant challenges persist for languages characterized by complex morphology and non-Latin scripts, but we demonstrate that LLMs are a credible and relevant option for initiating linguistic annotation tasks in the absence of data, serving as an effective aid for annotation.[33] Beyond Binary Classification: Detecting Fine-Grained Sexism in Social Media Videos
Laura De Grazia,Danae Sánchez Villegas,Desmond Elliott,Mireia Farrús,Mariona Taulé
Main category: cs.CL
TL;DR: 本文提出FineMuSe——一个西班牙语的多模态性别歧视检测数据集,包含二元和细粒度标注,并构建了涵盖性别歧视形式、非歧视及反讽幽默修辞的层次化分类体系;实验表明多模态大语言模型在识别细微性别歧视方面接近人工水平,但在通过视觉线索识别共现的多种歧视类型时仍存在困难。
Details
Motivation: 现有自动性别歧视检测工具多局限于二元分类,难以识别更微妙、需上下文理解的性别歧视表现。 Method: 构建西班牙语多模态性别歧视数据集FineMuSe(含二元与细粒度标注);设计覆盖性别歧视类型、非歧视及讽刺/幽默修辞的层次化分类体系;系统评估多种大语言模型在二元与细粒度任务上的性能。 Result: 多模态大语言模型在识别细微性别歧视方面表现接近人类标注者,但在识别由视觉线索传达的多种共现性别歧视类型时效果较差。 Conclusion: 细粒度、多模态、上下文敏感的标注与建模对提升性别歧视检测能力至关重要,当前多模态大模型在视觉驱动的复合歧视识别上仍有明显局限。 Abstract: Online sexism appears in various forms, which makes its detection challenging. Although automated tools can enhance the identification of sexist content, they are often restricted to binary classification. Consequently, more subtle manifestations of sexism may remain undetected due to the lack of fine-grained, context-sensitive labels. To address this issue, we make the following contributions: (1) we present FineMuSe, a new multimodal sexism detection dataset in Spanish that includes both binary and fine-grained annotations; (2) we introduce a comprehensive hierarchical taxonomy that encompasses forms of sexism, non-sexism, and rhetorical devices of irony and humor; and (3) we evaluate a wide range of LLMs for both binary and fine-grained sexism detection. Our findings indicate that multimodal LLMs perform competitively with human annotators in identifying nuanced forms of sexism; however, they struggle to capture co-occurring sexist types when these are conveyed through visual cues.[34] ChartEditBench: Evaluating Grounded Multi-Turn Chart Editing in Multimodal Language Models
Manav Nitin Kapadnis,Lawanya Baghel,Atharva Naik,Carolyn Rosé
Main category: cs.CL
TL;DR: 本文提出了ChartEditBench基准,用于评估多模态大语言模型(MLLMs)在多轮、视觉引导下的图表编辑能力,并设计了融合执行验证、像素级相似度与逻辑校验的鲁棒评估框架;实验表明现有MLLMs在多轮编辑中因错误累积和上下文丢失而显著退化。
Details
Motivation: 现有MLLMs在单轮图表生成上表现良好,但在真实探索性数据分析所需的多轮、上下文感知的图表迭代编辑任务中能力尚不明确。 Method: 构建了包含5000条难度可控修改链的ChartEditBench基准,并提出融合执行保真度检查、像素级视觉相似度和逻辑代码验证的综合评估框架。 Result: 实验发现当前SOTA MLLMs在多轮编辑中性能显著下降,尤其在数据相关变换中频繁出现执行失败,而在风格类编辑中仍保持较强能力。 Conclusion: ChartEditBench为面向意图、视觉接地的多模态编程提供了具有挑战性的新测试平台,揭示了MLLMs在持续交互式可视化编辑中的关键短板。 Abstract: While Multimodal Large Language Models (MLLMs) perform strongly on single-turn chart generation, their ability to support real-world exploratory data analysis remains underexplored. In practice, users iteratively refine visualizations through multi-turn interactions that require maintaining common ground, tracking prior edits, and adapting to evolving preferences. We introduce ChartEditBench, a benchmark for incremental, visually grounded chart editing via code, comprising 5,000 difficulty-controlled modification chains and a rigorously human-verified subset. Unlike prior one-shot benchmarks, ChartEditBench evaluates sustained, context-aware editing. We further propose a robust evaluation framework that mitigates limitations of LLM-as-a-Judge metrics by integrating execution-based fidelity checks, pixel-level visual similarity, and logical code verification. Experiments with state-of-the-art MLLMs reveal substantial degradation in multi-turn settings due to error accumulation and breakdowns in shared context, with strong performance on stylistic edits but frequent execution failures on data-centric transformations. ChartEditBench, establishes a challenging testbed for grounded, intent-aware multimodal programming.[35] ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
Yahia Alqurnawi,Preetom Biswas,Anmol Rao,Tejas Anvekar,Chitta Baral,Vivek Gupta
Main category: cs.CL
TL;DR: 本文研究多模态大语言模型(mLLMs)在结构化数据(如表格)中回答问题时提供细粒度归因(即指出支持答案的具体行列)的能力,发现当前模型在归因准确性上远低于问答准确率,尤其在JSON格式和文本表格中表现差,且不同模型家族差异显著。
Details
Motivation: 用户不仅需要mLLMs给出正确答案,还需知道答案来源(即结构化数据中的具体行列),以满足透明性和可追溯性需求。 Method: 评估多个mLLMs在不同表格格式(Markdown、JSON、图像)和提示策略下的结构化数据归因能力,重点衡量其定位支持答案的行与列的准确性。 Result: 归因准确率显著低于问答准确率;JSON输入下归因接近随机;模型更擅长引用行而非列;文本格式比图像更难处理;不同模型家族表现差异明显。 Conclusion: 当前mLLMs在结构化数据的细粒度归因上不可靠,限制了其在需高透明度和可追溯性的实际应用中的使用。 Abstract: Multimodal Large Language Models (mLLMs) are often used to answer questions in structured data such as tables in Markdown, JSON, and images. While these models can often give correct answers, users also need to know where those answers come from. In this work, we study structured data attribution/citation, which is the ability of the models to point to the specific rows and columns that support an answer. We evaluate several mLLMs across different table formats and prompting strategies. Our results show a clear gap between question answering and evidence attribution. Although question answering accuracy remains moderate, attribution accuracy is much lower, near random for JSON inputs, across all models. We also find that models are more reliable at citing rows than columns, and struggle more with textual formats than images. Finally, we observe notable differences across model families. Overall, our findings show that current mLLMs are unreliable at providing fine-grained, trustworthy attribution for structured data, which limits their usage in applications requiring transparency and traceability.[36] *-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle,Léane Jourdan,Daisy Munson,Pierre Alain,Jonathan Chevelu,Arnaud Delhay,Damien Lolive
Main category: cs.CL
TL;DR: 本文提出*-PLUIE,一种基于困惑度的、任务特定提示的LLM-as-a-judge方法,无需文本生成即可评估生成文本质量,在保持低计算成本的同时提升与人工评分的一致性。
Details
Motivation: 现有LLM-as-a-judge方法计算开销大且需后处理,亟需更高效、轻量的自动评估指标。 Method: 在ParaPLUIE(基于困惑度估计‘是/否’答案置信度)基础上,设计任务特定的提示变体*-PLUIE,并评估其与人类判断的相关性。 Result: *-PLUIE(尤其是个性化版本)在多个任务上展现出比基线更强的人类评分相关性,同时维持低计算成本。 Conclusion: 基于困惑度与任务定制提示的*-PLUIE是一种高效、可靠且可扩展的LLM-judge替代方案。 Abstract: Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.[37] Avey-B
Devang Acharya,Mohammad Hammoud
Main category: cs.CL
TL;DR: 本文提出了一种针对Avey模型的encoder-only重构方案,通过解耦静态/动态参数化、稳定性导向的归一化和神经压缩等创新,在token分类和信息检索任务上超越了四种主流Transformer编码器,且在长文本场景下扩展性更优。
Details
Motivation: 在计算和内存受限的工业NLP场景中,亟需高效紧凑的双向编码器;而新兴的无注意力机制Avey虽具潜力,但需适配encoder-only范式以发挥其优势。 Method: 对Avey进行encoder-only范式重构,引入解耦的静态与动态参数化、稳定性导向的归一化机制及神经压缩技术。 Result: 重构后的Avey在标准token分类和信息检索基准上持续优于四种主流Transformer编码器,并在长上下文场景中展现出更优的可扩展性。 Conclusion: Avey经encoder-only重构后,可在保持无注意力优势的同时,成为BERT类模型的有力替代,在资源受限场景中具备实用价值。 Abstract: Compact pretrained bidirectional encoders remain the backbone of industrial NLP under tight compute and memory budgets. Their effectiveness stems from self-attention's ability to deliver high-quality bidirectional contextualization with sequence-level parallelism, as popularized by BERT-style architectures. Recently, Avey was introduced as an autoregressive, attention-free alternative that naturally admits an encoder-only adaptation. In this paper, we reformulate Avey for the encoder-only paradigm and propose several innovations to its architecture, including decoupled static and dynamic parameterizations, stability-oriented normalization, and neural compression. Results show that this reformulated architecture compares favorably to four widely used Transformer-based encoders, consistently outperforming them on standard token-classification and information-retrieval benchmarks while scaling more efficiently to long contexts.cs.CV [Back]
[38] GRAFNet: Multiscale Retinal Processing via Guided Cortical Attention Feedback for Enhancing Medical Image Polyp Segmentation
Abdul Joseph Fofanah,Lian Wen,Alpha Alimamy Kamara,Zhongyi Zhang,David Chen,Albert Patrick Sankoh
Main category: cs.CV
TL;DR: 本文提出GRAFNet,一种受生物视觉系统启发的深度学习架构,用于结肠镜检查中的息肉分割,通过三个模块(GAAM、MSRM、GCAFM)实现多尺度、解剖约束和迭代优化,显著提升分割精度与泛化能力。
Details
Motivation: 现有深度学习方法在息肉分割中存在单向处理、多尺度融合弱、缺乏解剖约束等问题,导致假阳性和假阴性高,难以满足临床准确性和可信性需求。 Method: 提出GRAFNet架构,包含引导非对称注意力模块(GAAM)、多尺度视网膜模块(MSRM)和引导皮层注意力反馈模块(GCAFM),统一于息肉编码-解码模块(PEDM)中,引入分辨率自适应反馈以保证空间-语义一致性。 Result: 在五个公开数据集(Kvasir-SEG、CVC-300、CVC-ColonDB、CVC-Clinic、PolypGen)上达到SOTA性能,Dice系数提升3–8%,泛化能力提高10–20%,并提供可解释决策路径。 Conclusion: GRAFNet将神经计算原理引入医学图像分割,为AI模型在临床应用中兼顾高精度与可信推理提供了新范式。 Abstract: Accurate polyp segmentation in colonoscopy is essential for cancer prevention but remains challenging due to: (1) high morphological variability (from flat to protruding lesions), (2) strong visual similarity to normal structures such as folds and vessels, and (3) the need for robust multi-scale detection. Existing deep learning approaches suffer from unidirectional processing, weak multi-scale fusion, and the absence of anatomical constraints, often leading to false positives (over-segmentation of normal structures) and false negatives (missed subtle flat lesions). We propose GRAFNet, a biologically inspired architecture that emulates the hierarchical organisation of the human visual system. GRAFNet integrates three key modules: (1) a Guided Asymmetric Attention Module (GAAM) that mimics orientation-tuned cortical neurones to emphasise polyp boundaries, (2) a MultiScale Retinal Module (MSRM) that replicates retinal ganglion cell pathways for parallel multi-feature analysis, and (3) a Guided Cortical Attention Feedback Module (GCAFM) that applies predictive coding for iterative refinement. These are unified in a Polyp Encoder-Decoder Module (PEDM) that enforces spatial-semantic consistency via resolution-adaptive feedback. Extensive experiments on five public benchmarks (Kvasir-SEG, CVC-300, CVC-ColonDB, CVC-Clinic, and PolypGen) demonstrate consistent state-of-the-art performance, with 3-8% Dice improvements and 10-20% higher generalisation over leading methods, while offering interpretable decision pathways. This work establishes a paradigm in which neural computation principles bridge the gap between AI accuracy and clinically trustworthy reasoning. Code is available at https://github.com/afofanah/GRAFNet.[39] Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition
Shiyu Xuan,Dongkai Wang,Zechao Li,Jinhui Tang
Main category: cs.CV
TL;DR: 本文提出了一种解耦的零样本人-物交互(HOI)检测框架,将目标检测与交互识别分离,并利用多模态大语言模型(MLLMs)实现无需训练的确定性零样本交互识别,显著提升了泛化性和灵活性。
Details
Motivation: 现有零样本HOI检测方法将交互识别与特定检测器强耦合,且依赖粗粒度视觉语言模型特征,难以泛化到未见交互类型。 Method: 提出解耦框架,将目标检测与交互识别分离;设计基于视觉问答的确定性生成方法实现训练-free零样本IR;引入空间感知池化模块融合外观与成对空间线索;采用单次前向传播的确定性匹配方法预测所有候选交互。 Result: 在HICO-DET和V-COCO数据集上取得最优零样本性能,具备强跨数据集泛化能力,并可灵活接入任意目标检测器而无需重训练。 Conclusion: 解耦设计与MLLM驱动的确定性IR方法有效克服了传统方法的耦合性与泛化瓶颈,为零样本HOI检测提供了更通用、高效的新范式。 Abstract: Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.[40] MB-DSMIL-CL-PL: Scalable Weakly Supervised Ovarian Cancer Subtype Classification and Localisation Using Contrastive and Prototype Learning with Frozen Patch Features
Marcus Jenkins,Jasenka Mazibrada,Bogdan Leahu,Michal Mackiewicz
Main category: cs.CV
TL;DR: 本文提出了一种基于对比学习和原型学习、结合特征空间增强的新型方法,用于卵巢癌组织病理图像的亚型分类与定位,在保持使用冻结补丁特征的同时显著提升了性能。
Details
Motivation: 英国病理科诊断工作量增加,推动AI辅助诊断发展;传统方法依赖预计算冻结特征,而端到端方法虽精度高但可扩展性差、实验耗时。 Method: 采用对比学习与原型学习框架,对预计算的冻结图像特征进行特征空间增强,实现亚型分类与定位。 Result: 相比DSMIL,实例级和切片级F1分数分别提升70.4%和15.3%,实例定位AUC提升16.9%,切片分类AUC提升2.3%。 Conclusion: 该方法在不牺牲可扩展性和训练效率的前提下,显著提升了卵巢癌亚型识别与定位的准确率,验证了冻结特征+特征空间增强策略的有效性。 Abstract: The study of histopathological subtypes is valuable for the personalisation of effective treatment strategies for ovarian cancer. However, increasing diagnostic workloads present a challenge for UK pathology departments, leading to the rise in AI approaches. While traditional approaches in this field have relied on pre-computed, frozen image features, recent advances have shifted towards end-to-end feature extraction, providing an improvement in accuracy but at the expense of significantly reduced scalability during training and time-consuming experimentation. In this paper, we propose a new approach for subtype classification and localisation in ovarian cancer histopathology images using contrastive and prototype learning with pre-computed, frozen features via feature-space augmentations. Compared to DSMIL, our method achieves an improvement of 70.4\% and 15.3\% in F1 score for instance- and slide-level classification, respectively, along with AUC gains of 16.9\% for instance localisation and 2.3\% for slide classification, while maintaining the use of frozen patch features.[41] Loss Knows Best: Detecting Annotation Errors in Videos via Loss Trajectories
Praditha Alwis,Soumyadeep Chandra,Deepak Ravikumar,Kaushik Roy
Main category: cs.CV
TL;DR: 本文提出一种基于累积样本损失(CSL)的模型无关方法,用于检测视频数据集中标注错误(如错标和时序错乱),通过分析训练过程中各帧损失轨迹识别难学样本,从而定位潜在标注错误。
Details
Motivation: 高质量视频数据集对动作识别等任务至关重要,但现实数据集常存在错标和时序错乱等标注错误,尤其在相位标注任务中危害严重,亟需无需真值的通用错误检测方法。 Method: 定义并计算每帧的累积样本损失(CSL),即该帧在多个训练检查点下的平均损失;利用视频分割模型各epoch保存的权重评估测试视频中每帧的损失轨迹;将CSL持续偏高或异常的帧标记为可能的标注错误。 Result: 在EgoPER和Cholec80数据集上验证了方法有效性,能准确识别错标和帧顺序错乱等细微不一致,且无需标注错误的真值标签,具备跨数据集泛化能力。 Conclusion: 该CSL驱动的标注错误检测方法为视频数据集审计提供了可靠、通用的新工具,有助于提升视频机器学习模型的训练鲁棒性与可靠性。 Abstract: High-quality video datasets are foundational for training robust models in tasks like action recognition, phase detection, and event segmentation. However, many real-world video datasets suffer from annotation errors such as *mislabeling*, where segments are assigned incorrect class labels, and *disordering*, where the temporal sequence does not follow the correct progression. These errors are particularly harmful in phase-annotated tasks, where temporal consistency is critical. We propose a novel, model-agnostic method for detecting annotation errors by analyzing the Cumulative Sample Loss (CSL)--defined as the average loss a frame incurs when passing through model checkpoints saved across training epochs. This per-frame loss trajectory acts as a dynamic fingerprint of frame-level learnability. Mislabeled or disordered frames tend to show consistently high or irregular loss patterns, as they remain difficult for the model to learn throughout training, while correctly labeled frames typically converge to low loss early. To compute CSL, we train a video segmentation model and store its weights at each epoch. These checkpoints are then used to evaluate the loss of each frame in a test video. Frames with persistently high CSL are flagged as likely candidates for annotation errors, including mislabeling or temporal misalignment. Our method does not require ground truth on annotation errors and is generalizable across datasets. Experiments on EgoPER and Cholec80 demonstrate strong detection performance, effectively identifying subtle inconsistencies such as mislabeling and frame disordering. The proposed approach provides a powerful tool for dataset auditing and improving training reliability in video-based machine learning.[42] Distributional Deep Learning for Super-Resolution of 4D Flow MRI under Domain Shift
Xiaoyi Wen,Fei Jiang
Main category: cs.CV
TL;DR: 本文提出了一种面向4D Flow MRI超分辨率的分布式深度学习框架,以解决临床中因成像机制差异导致的域偏移问题,通过CFD仿真数据预训练和小规模真实配对数据微调,显著提升了模型泛化能力与重建性能。
Details
Motivation: 传统超分辨率方法依赖于简单下采样生成的配对数据,在临床实际中因低分辨率数据来源于复杂成像机制,存在严重域偏移,导致模型泛化差。 Method: 提出分布式深度学习框架:先在高分辨率CFD仿真及其下采样数据上预训练,再用小规模配对的4D Flow MRI与CFD数据进行微调;并推导了分布估计器的理论性质。 Result: 在真实4D Flow MRI数据上验证,所提方法显著优于传统深度学习超分辨率方法,有效缓解域偏移,提升分辨率重建质量及血流动力学指标(如血管壁应力)的准确性。 Conclusion: 分布式学习可增强模型鲁棒性与域泛化能力,为临床真实场景下的医学图像超分辨率提供了更可靠、实用的新范式。 Abstract: Super-resolution is widely used in medical imaging to enhance low-quality data, reducing scan time and improving abnormality detection. Conventional super-resolution approaches typically rely on paired datasets of downsampled and original high resolution images, training models to reconstruct high resolution images from their artificially degraded counterparts. However, in real-world clinical settings, low resolution data often arise from acquisition mechanisms that differ significantly from simple downsampling. As a result, these inputs may lie outside the domain of the training data, leading to poor model generalization due to domain shift. To address this limitation, we propose a distributional deep learning framework that improves model robustness and domain generalization. We develop this approch for enhancing the resolution of 4D Flow MRI (4DF). This is a novel imaging modality that captures hemodynamic flow velocity and clinically relevant metrics such as vessel wall stress. These metrics are critical for assessing aneurysm rupture risk. Our model is initially trained on high resolution computational fluid dynamics (CFD) simulations and their downsampled counterparts. It is then fine-tuned on a small, harmonized dataset of paired 4D Flow MRI and CFD samples. We derive the theoretical properties of our distributional estimators and demonstrate that our framework significantly outperforms traditional deep learning approaches through real data applications. This highlights the effectiveness of distributional learning in addressing domain shift and improving super-resolution performance in clinically realistic scenarios.[43] Time-Archival Camera Virtualization for Sports and Visual Performances
Yunxiao Zhang,William Stone,Suryansh Kumar
Main category: cs.CV
TL;DR: 本文提出一种基于神经体渲染的相机虚拟化方法,通过建模动态场景中多视角间的刚性变换,提升渲染质量并支持时间归档,适用于体育转播等实时动态场景。
Details
Motivation: 现有动态场景新视角合成方法(如4DGS)难以处理快速非刚性运动、多主体独立运动及缺乏时间归档能力,限制了其在体育直播和舞台表演等场景的应用。 Method: 采用神经体渲染框架,将动态场景建模为多个同步相机视图间的时间一致刚性变换,进行神经表征学习,实现高质量渲染与时间归档。 Result: 在动态场景下实现了空间-时间一致、高保真的新视角合成,并首次支持任意历史时刻的回溯式渲染,满足体育转播中的回放、分析与存档需求。 Conclusion: 重拾神经体渲染范式可有效弥补当前动态3DGS方法的不足,在保持实时性的同时增强运动鲁棒性与时间可追溯性,为相机虚拟化提供更实用的解决方案。 Abstract: Camera virtualization -- an emerging solution to novel view synthesis -- holds transformative potential for visual entertainment, live performances, and sports broadcasting by enabling the generation of photorealistic images from novel viewpoints using images from a limited set of calibrated multiple static physical cameras. Despite recent advances, achieving spatially and temporally coherent and photorealistic rendering of dynamic scenes with efficient time-archival capabilities, particularly in fast-paced sports and stage performances, remains challenging for existing approaches. Recent methods based on 3D Gaussian Splatting (3DGS) for dynamic scenes could offer real-time view-synthesis results. Yet, they are hindered by their dependence on accurate 3D point clouds from the structure-from-motion method and their inability to handle large, non-rigid, rapid motions of different subjects (e.g., flips, jumps, articulations, sudden player-to-player transitions). Moreover, independent motions of multiple subjects can break the Gaussian-tracking assumptions commonly used in 4DGS, ST-GS, and other dynamic splatting variants. This paper advocates reconsidering a neural volume rendering formulation for camera virtualization and efficient time-archival capabilities, making it useful for sports broadcasting and related applications. By modeling a dynamic scene as rigid transformations across multiple synchronized camera views at a given time, our method performs neural representation learning, providing enhanced visual rendering quality at test time. A key contribution of our approach is its support for time-archival, i.e., users can revisit any past temporal instance of a dynamic scene and can perform novel view synthesis, enabling retrospective rendering for replay, analysis, and archival of live events, a functionality absent in existing neural rendering approaches and novel view synthesis...[44] How to Train Your Long-Context Visual Document Model
Austin Veselka
Main category: cs.CV
TL;DR: 本文首次系统性地研究了训练长上下文视觉语言模型(最高达344K token)的方法,聚焦于长文档视觉问答任务,并验证其向长文本任务的迁移能力;通过持续预训练、监督微调和偏好优化等策略,在24B/32B参数模型上实现了MMLongBenchDoc上的SOTA性能,并提出多项关键发现与数据集改进。
Details
Motivation: 现有高性能开源长上下文多模态模型(如Qwen3 VL、GLM 4.5/6V)缺乏可复现的训练配方与数据流程,亟需系统性研究以填补该空白。 Method: 对24B和32B参数模型开展系统性研究,涵盖持续预训练、监督微调和偏好优化;设计合成数据流水线;引入页面索引机制;在多种长上下文评估基准(尤其是MMLongBenchDoc)上进行广泛消融与评测。 Result: 在MMLongBenchDoc上达到24B/32B规模下的SOTA性能;发现匹配评估长度的训练更优、页面索引显著提升性能、合成数据支持自增强、视觉长上下文训练可反向提升文本长上下文能力;发布修正版基准MMLBD-C。 Conclusion: 长上下文视觉语言模型的训练需兼顾上下文长度匹配、结构化提示(如页索引)与高质量合成数据;视觉与文本长上下文能力存在双向迁移潜力;可复现的训练范式与可靠基准对领域发展至关重要。 Abstract: We present the first comprehensive, large-scale study of training long-context vision language models up to 344K context, targeting long-document visual question answering with measured transfer to long-context text. While several such strong are open-weight, namely Qwen3 VL and GLM 4.5/6V, their training recipes and data pipelines are not reproducible. We systematically study continued pretraining, supervised finetuning, and preference optimization for 24B and 32B parameter models, backed by extensive LC evaluations and ablations to bridge this gap, and achieve state-of-the-art performance on MMLongBenchDoc for both parameter scales. In addition to this, our key findings include: (i) training on context lengths that match evaluation context lengths outperforms training on longer contexts, (ii) training and evaluating with page indices provides a simple, high-impact boost to long-document performance, (iii) our synthetic data pipelines enable self-improvement via continued pretraining and supervised finetuning, and (iv) we extend the known text-to-visual long context transfer to the reverse, showing that visual long context training transfers to long-context text performance. We also release MMLBD-C, a manually corrected version of MMLongBenchDoc to reduce erroneous and low quality examples in the benchmark.[45] Accelerating Large-Scale Dataset Distillation via Exploration-Exploitation Optimization
Muhammad J. Alahmadi,Peng Gao,Feiyi Wang,Dongkuan,Xu
Main category: cs.CV
TL;DR: 本文提出了一种名为Exploration-Exploitation Distillation (E^2D) 的高效数据集蒸馏方法,通过全图像初始化、探索阶段(均匀更新+高损区域识别)和利用阶段(聚焦更新)的两阶段优化策略,在保持甚至提升精度的同时显著加速训练,大幅缩小精度与效率之间的权衡差距。
Details
Motivation: 现有解耦式数据集蒸馏方法在大规模场景下仍存在精度与效率的权衡:基于优化的方法精度高但计算开销大,无优化方法高效但精度低。 Method: 提出E^2D方法:1)全图像初始化以保留语义完整性和特征多样性;2)两阶段优化:探索阶段进行均匀更新并定位高损失区域;3)利用阶段仅对高损失区域集中更新,减少冗余计算。 Result: 在ImageNet-1K上超越SOTA且快18倍;在ImageNet-21K上精度大幅提升且快4.3倍。 Conclusion: 有针对性地减少冗余更新而非暴力优化,可有效兼顾大规模数据集蒸馏的精度与效率。 Abstract: Dataset distillation compresses the original data into compact synthetic datasets, reducing training time and storage while retaining model performance, enabling deployment under limited resources. Although recent decoupling-based distillation methods enable dataset distillation at large-scale, they continue to face an efficiency gap: optimization-based decoupling methods achieve higher accuracy but demand intensive computation, whereas optimization-free decoupling methods are efficient but sacrifice accuracy. To overcome this trade-off, we propose Exploration-Exploitation Distillation (E^2D), a simple, practical method that minimizes redundant computation through an efficient pipeline that begins with full-image initialization to preserve semantic integrity and feature diversity. It then uses a two-phase optimization strategy: an exploration phase that performs uniform updates and identifies high-loss regions, and an exploitation phase that focuses updates on these regions to accelerate convergence. We evaluate E^2D on large-scale benchmarks, surpassing the state-of-the-art on ImageNet-1K while being 18x faster, and on ImageNet-21K, our method substantially improves accuracy while remaining 4.3x faster. These results demonstrate that targeted, redundancy-reducing updates, rather than brute-force optimization, bridge the gap between accuracy and efficiency in large-scale dataset distillation. Code is available at https://github.com/ncsu-dk-lab.[46] Visual Persuasion: What Influences Decisions of Vision-Language Models?
Manuel Cherep,Pranav M R,Pattie Maes,Nikhil Singh
Main category: cs.CV
TL;DR: 本文提出了一种通过受控图像选择任务和系统性图像扰动来研究视觉语言模型(VLMs)视觉偏好的新框架,利用视觉效用的隐式建模与揭示偏好方法,结合视觉提示优化与自动可解释性分析,揭示VLM在图像决策中的结构化偏好与潜在安全风险。
Details
Motivation: 当前VLMs在网页图像上大规模做视觉决策(如点击、推荐、购买),但对其视觉偏好结构缺乏系统理解;亟需一种可控、可解释的方法来主动发现其潜在偏差与安全漏洞。 Method: 将VLMs置于受控图像选择任务中,以常见图像(如商品图)为起点,借鉴文本提示优化思想,利用图像生成模型进行视觉提示优化(如构图、光照、背景等合理修改),通过比较不同编辑后图像的选择概率变化,推断其隐式视觉效用;并构建自动可解释性流水线识别驱动选择的视觉主题。 Result: 在前沿VLMs上的大规模实验表明,优化后的图像编辑能显著提升选择概率;自动解释管道成功识别出跨任务一致的视觉偏好主题(如特定背景、光照风格),验证了该框架的有效性与泛化性。 Conclusion: 该框架为理解、审计和治理图像驱动AI代理提供了实用高效的新范式,支持对视觉偏好与安全风险的主动发现与干预,而非依赖事后隐式暴露。 Abstract: The web is littered with images, once created for human consumption and now increasingly interpreted by agents using vision-language models (VLMs). These agents make visual decisions at scale, deciding what to click, recommend, or buy. Yet, we know little about the structure of their visual preferences. We introduce a framework for studying this by placing VLMs in controlled image-based choice tasks and systematically perturbing their inputs. Our key idea is to treat the agent's decision function as a latent visual utility that can be inferred through revealed preference: choices between systematically edited images. Starting from common images, such as product photos, we propose methods for visual prompt optimization, adapting text optimization methods to iteratively propose and apply visually plausible modifications using an image generation model (such as in composition, lighting, or background). We then evaluate which edits increase selection probability. Through large-scale experiments on frontier VLMs, we demonstrate that optimized edits significantly shift choice probabilities in head-to-head comparisons. We develop an automatic interpretability pipeline to explain these preferences, identifying consistent visual themes that drive selection. We argue that this approach offers a practical and efficient way to surface visual vulnerabilities, safety concerns that might otherwise be discovered implicitly in the wild, supporting more proactive auditing and governance of image-based AI agents.[47] Consistency-Preserving Diverse Video Generation
Xinshuang Liu,Runfa Blark Li,Truong Nguyen
Main category: cs.CV
TL;DR: 本文提出了一种面向流匹配视频生成器的联合采样框架,在低样本量下提升跨视频多样性,同时保持视频内时间一致性,且无需昂贵的视频解码器反向传播。
Details
Motivation: 文本到视频生成成本高,通常每提示仅生成少量样本;在低样本场景下,需最大化每批样本的价值,即提高跨视频多样性,但现有方法常牺牲时间一致性或依赖耗时的视频解码器梯度计算。 Method: 提出基于流匹配视频生成器的联合采样框架:先进行多样性驱动更新,再剔除损害时间一致性的更新分量;所有优化均在轻量级潜在空间模型中完成,规避图像空间梯度与视频解码器反向传播。 Result: 在先进文本到视频流匹配模型上实验表明,所提方法在多样性上媲美强联合采样基线,同时显著提升时间一致性和色彩自然度。 Conclusion: 该框架有效平衡了低样本下的跨视频多样性与视频内时间一致性,兼顾效率与质量,为高效文本到视频生成提供了新思路。 Abstract: Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity comparable to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Code will be released.[48] Training-Free Zero-Shot Anomaly Detection in 3D Brain MRI with 2D Foundation Models
Tai Le-Gia,Jaehyun Ahn
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的零样本异常检测(ZSAD)框架,用于3D脑MRI,通过聚合多轴2D切片特征构建局部体素级token,恢复立方空间上下文,并直接接入基于距离的批处理异常检测流程。
Details
Motivation: 现有零样本异常检测方法在3D医学影像中受限于仅使用逐层切片特征或视觉-语言模型,难以建模体素级空间结构。 Method: 利用预训练2D基础模型处理多个正交轴向切片,聚合生成局部3D patch token,保留立方空间上下文,结合无监督的批处理距离度量进行异常检测。 Result: 实现了无需微调、提示或监督的高效3D ZSAD,在标准GPU上可快速计算,显著提升了3D脑MRI异常检测性能与实用性。 Conclusion: 证明了训练自由、批处理式的ZSAD可成功从2D扩展至完整3D MRI体积,为体素级异常检测提供了一种简洁且鲁棒的新范式。 Abstract: Zero-shot anomaly detection (ZSAD) has gained increasing attention in medical imaging as a way to identify abnormalities without task-specific supervision, but most advances remain limited to 2D datasets. Extending ZSAD to 3D medical images has proven challenging, with existing methods relying on slice-wise features and vision-language models, which fail to capture volumetric structure. In this paper, we introduce a fully training-free framework for ZSAD in 3D brain MRI that constructs localized volumetric tokens by aggregating multi-axis slices processed by 2D foundation models. These 3D patch tokens restore cubic spatial context and integrate directly with distance-based, batch-level anomaly detection pipelines. The framework provides compact 3D representations that are practical to compute on standard GPUs and require no fine-tuning, prompts, or supervision. Our results show that training-free, batch-based ZSAD can be effectively extended from 2D encoders to full 3D MRI volumes, offering a simple and robust approach for volumetric anomaly detection.[49] Sparrow: Text-Anchored Window Attention with Visual-Semantic Glimpsing for Speculative Decoding in Video LLMs
Libo Zhang,Zhaoning Zhang,Wangyang Hong,Peng Qiao,Dongsheng Li
Main category: cs.CV
TL;DR: 本文提出Sparrow框架,通过视觉感知的文本锚定窗口注意力、中间层视觉状态桥接和多令牌预测策略,解决视频大语言模型(Vid-LLMs)中推测解码导致的性能崩溃问题,在25k视觉令牌下仍实现2.82倍平均加速。
Details
Motivation: 推测解码在视频大语言模型(Vid-LLMs)中因注意力稀释和负向视觉增益而严重失效;同时观察到视觉语义内化现象,使原始视觉输入在深层推理中结构冗余。 Method: 提出Sparrow框架:1)利用视觉感知的文本锚定窗口注意力与隐藏状态复用,将视觉计算完全卸载至目标模型;2)通过中间层视觉状态桥接训练草稿模型,引入语义丰富的中间状态以滤除低层视觉噪声;3)采用多令牌预测策略缓解训练-推理分布偏移。 Result: Sparrow在25k视觉令牌下实现平均2.82倍加速,有效缓解长序列下的性能退化,支持实时长视频任务。 Conclusion: Sparrow为Vid-LLMs中高效、稳定的推测解码提供了实用且可扩展的解决方案,关键在于利用视觉语义内化特性重构视觉-文本协同机制。 Abstract: Although speculative decoding is widely used to accelerate Vision-Language Models (VLMs) inference, it faces severe performance collapse when applied to Video Large Language Models (Vid-LLMs). The draft model typically falls into the trap of attention dilution and negative visual gain due to key-value cache explosion and context window mismatches. We observe a visual semantic internalization phenomenon in Vid-LLMs, indicating that critical visual semantics are implicitly encoded into text hidden states during deep-layer interactions, which renders raw visual inputs structurally redundant during deep inference. To address this, we propose the Sparrow framework, which first utilizes visually-aware text-anchored window attention via hidden state reuse to fully offload visual computation to the target model, and leverages intermediate-layer visual state bridging to train the draft model with semantic-rich intermediate states, thereby filtering out low-level visual noise. Additionally, a multi-token prediction strategy is introduced to bridge the training-inference distribution shift. Experiments show that Sparrow achieves an average speedup of 2.82x even with 25k visual tokens, effectively resolving the performance degradation in long sequences and offering a practical solution for real-time long video tasks.[50] EventMemAgent: Hierarchical Event-Centric Memory for Online Video Understanding with Adaptive Tool Use
Siwei Wen,Zhangcheng Wang,Xingjian Zhang,Lei Huang,Wenjun Wu
Main category: cs.CV
TL;DR: 本文提出EventMemAgent,一种基于分层记忆模块的主动在线视频代理框架,通过短时记忆检测事件边界并动态采样,长时记忆按事件归档历史观测,并结合多粒度感知工具包和代理强化学习,解决在线视频理解中无限流与模型有限上下文窗口的矛盾。
Details
Motivation: 在线视频理解面临无限视觉流输入与多模态大语言模型(MLLMs)有限上下文窗口的根本冲突,现有被动处理方法难以兼顾长程上下文与细粒度细节。 Method: 提出EventMemAgent框架:1)双层记忆机制——短时记忆采用事件粒度的水库采样动态处理帧缓冲,长时记忆按事件结构化存档;2)集成多粒度感知工具包实现主动迭代证据采集;3)采用代理强化学习(Agentic RL)端到端内化推理与工具使用策略。 Result: 在多个在线视频基准测试中取得具有竞争力的结果。 Conclusion: EventMemAgent通过主动感知、分层记忆与强化学习协同,有效缓解了流式视频理解中上下文长度与细节保留之间的权衡,为构建可持续演进的在线视频智能体提供了新范式。 Abstract: Online video understanding requires models to perform continuous perception and long-range reasoning within potentially infinite visual streams. Its fundamental challenge lies in the conflict between the unbounded nature of streaming media input and the limited context window of Multimodal Large Language Models (MLLMs). Current methods primarily rely on passive processing, which often face a trade-off between maintaining long-range context and capturing the fine-grained details necessary for complex tasks. To address this, we introduce EventMemAgent, an active online video agent framework based on a hierarchical memory module. Our framework employs a dual-layer strategy for online videos: short-term memory detects event boundaries and utilizes event-granular reservoir sampling to process streaming video frames within a fixed-length buffer dynamically; long-term memory structuredly archives past observations on an event-by-event basis. Furthermore, we integrate a multi-granular perception toolkit for active, iterative evidence capture and employ Agentic Reinforcement Learning (Agentic RL) to end-to-end internalize reasoning and tool-use strategies into the agent's intrinsic capabilities. Experiments show that EventMemAgent achieves competitive results on online video benchmarks. The code will be released here: https://github.com/lingcco/EventMemAgent.[51] Effective and Robust Multimodal Medical Image Analysis
Joy Dhar,Nayyar Zaidi,Maryam Haghighat
Main category: cs.CV
TL;DR: 本文提出了一种新型多模态融合学习方法MAIL及其鲁棒版本Robust-MAIL,通过多注意力机制有效提取模态特异性和跨模态互补信息,并增强对抗鲁棒性,在20个公开数据集上显著提升性能并降低计算开销。
Details
Motivation: 现有多模态融合方法存在泛化性差、计算成本高、对抗鲁棒性弱三大问题,难以满足多疾病分析与资源受限医疗场景需求。 Method: 提出MAIL网络,包含残差学习注意力模块(捕获多尺度模态特异性特征)和跨模态交叉注意力模块(学习互补共享表征);进一步设计Robust-MAIL,引入随机投影滤波器和调制注意力噪声以提升对抗鲁棒性。 Result: 在20个公开数据集上,MAIL和Robust-MAIL相较现有方法最高提升9.34%性能,计算成本最多降低78.3%,且具备更强的对抗鲁棒性。 Conclusion: MAIL系列方法在性能、效率与鲁棒性三方面取得平衡,为多模态医学图像分析提供了更可靠、高效且安全的新范式。 Abstract: Multimodal Fusion Learning (MFL), leveraging disparate data from various imaging modalities (e.g., MRI, CT, SPECT), has shown great potential for addressing medical problems such as skin cancer and brain tumor prediction. However, existing MFL methods face three key limitations: a) they often specialize in specific modalities, and overlook effective shared complementary information across diverse modalities, hence limiting their generalizability for multi-disease analysis; b) they rely on computationally expensive models, restricting their applicability in resource-limited settings; and c) they lack robustness against adversarial attacks, compromising reliability in medical AI applications. To address these limitations, we propose a novel Multi-Attention Integration Learning (MAIL) network, incorporating two key components: a) an efficient residual learning attention block for capturing refined modality-specific multi-scale patterns and b) an efficient multimodal cross-attention module for learning enriched complementary shared representations across diverse modalities. Furthermore, to ensure adversarial robustness, we extend MAIL network to design Robust-MAIL by incorporating random projection filters and modulated attention noise. Extensive evaluations on 20 public datasets show that both MAIL and Robust-MAIL outperform existing methods, achieving performance gains of up to 9.34% while reducing computational costs by up to 78.3%. These results highlight the superiority of our approaches, ensuring more reliable predictions than top competitors. Code: https://github.com/misti1203/MAIL-Robust-MAIL.[52] CREMD: Crowd-Sourced Emotional Multimodal Dogs Dataset
Jinho Baek,Houwei Cao,Kate Blackwell
Main category: cs.CV
TL;DR: 本文提出了CREMD数据集,通过分析不同呈现模式(如上下文、音频、视频)和标注者特征(如养狗经历、性别、专业经验)对狗情绪识别的影响,揭示了视觉上下文显著提升标注一致性,而音频对标注一致性影响不明确但能提高标注者信心。
Details
Motivation: 准确解读狗的情绪具有挑战性,因为情绪评估具有主观性且缺乏标准化的真值方法。 Method: 构建了包含923个视频片段的CREMD数据集,以三种模式(无上下文无音频、有上下文无音频、有上下文有音频)呈现,并收集了来自不同背景标注者的标注结果进行分析。 Result: (1)添加视觉上下文显著提升标注一致性,但音频影响不明确;(2)非养狗者和男性标注者一致性高于养狗者和女性标注者,专业人士一致性更高;(3)音频显著提升标注者对愤怒和恐惧情绪识别的信心。 Conclusion: 视觉上下文是提升狗情绪标注一致性的关键因素,音频虽未显著提升一致性,但增强了标注者信心,未来研究需完善音频实验设计。 Abstract: Dog emotion recognition plays a crucial role in enhancing human-animal interactions, veterinary care, and the development of automated systems for monitoring canine well-being. However, accurately interpreting dog emotions is challenging due to the subjective nature of emotional assessments and the absence of standardized ground truth methods. We present the CREMD (Crowd-sourced Emotional Multimodal Dogs Dataset), a comprehensive dataset exploring how different presentation modes (e.g., context, audio, video) and annotator characteristics (e.g., dog ownership, gender, professional experience) influence the perception and labeling of dog emotions. The dataset consists of 923 video clips presented in three distinct modes: without context or audio, with context but no audio, and with both context and audio. We analyze annotations from diverse participants, including dog owners, professionals, and individuals with varying demographic backgrounds and experience levels, to identify factors that influence reliable dog emotion recognition. Our findings reveal several key insights: (1) while adding visual context significantly improved annotation agreement, our findings regarding audio cues are inconclusive due to design limitations (specifically, the absence of a no-context-with-audio condition and limited clean audio availability); (2) contrary to expectations, non-owners and male annotators showed higher agreement levels than dog owners and female annotators, respectively, while professionals showed higher agreement levels, aligned with our initial hypothesis; and (3) the presence of audio substantially increased annotators' confidence in identifying specific emotions, particularly anger and fear.[53] DAV-GSWT: Diffusion-Active-View Sampling for Data-Efficient Gaussian Splatting Wang Tiles
Rong Fu,Jiekai Wu,Haiyun Wei,Yee Tan Jia,Wenxin Zhang,Yang Li,Xiaowen Ma,Wangyu Wu,Simon Fong
Main category: cs.CV
TL;DR: 本文提出DAV-GSWT框架,结合扩散先验与主动视角采样,仅需少量输入即可生成高质量、可无缝拼接的3D高斯泼溅王砖(Gaussian Splatting Wang Tiles),显著降低数据需求并保持视觉质量与交互性能。
Details
Motivation: 现有基于王砖的3D高斯泼溅场景生成方法依赖密集采样的示例重建,数据效率低,难以扩展到大规模环境。 Method: 提出DAV-GSWT框架:融合分层不确定性量化机制与生成式扩散模型,实现主动视图选择与结构细节幻觉,以合成高保真、可无缝拼接的高斯泼溅王砖。 Result: 实验表明该系统大幅减少所需输入数据量,同时维持视觉完整性与实时交互性能,适用于大规模虚拟环境。 Conclusion: DAV-GSWT实现了数据高效、高质量的大规模场景合成,为神经渲染与程序化生成的结合提供了新范式。 Abstract: The emergence of 3D Gaussian Splatting has fundamentally redefined the capabilities of photorealistic neural rendering by enabling high-throughput synthesis of complex environments. While procedural methods like Wang Tiles have recently been integrated to facilitate the generation of expansive landscapes, these systems typically remain constrained by a reliance on densely sampled exemplar reconstructions. We present DAV-GSWT, a data-efficient framework that leverages diffusion priors and active view sampling to synthesize high-fidelity Gaussian Splatting Wang Tiles from minimal input observations. By integrating a hierarchical uncertainty quantification mechanism with generative diffusion models, our approach autonomously identifies the most informative viewpoints while hallucinating missing structural details to ensure seamless tile transitions. Experimental results indicate that our system significantly reduces the required data volume while maintaining the visual integrity and interactive performance necessary for large-scale virtual environments.[54] GMAIL: Generative Modality Alignment for generated Image Learning
Shentong Mo,Sukmin Yun
Main category: cs.CV
TL;DR: 本文提出GMAIL框架,通过将生成图像视为独立模态并在潜在空间中对齐真实与生成图像,提升视觉-语言模型在多种任务上的性能。
Details
Motivation: 生成图像虽丰富但与真实图像存在模态差异,直接混用可能导致模式崩溃,需区分使用。 Method: 提出GMAIL框架:先用跨模态对齐损失在生成图像上微调模型,再用对齐后的模型训练视觉-语言模型。 Result: 在图像描述、零样本图像检索与分类、长描述检索等任务上显著提升性能,并增强LLaVA等大模型的描述能力,且呈现良好的生成数据扩展趋势。 Conclusion: GMAIL通过显式建模生成图像为独立模态并实现潜在空间对齐,有效释放生成数据潜力,兼容多种视觉-语言模型。 Abstract: Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.[55] Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation
Shuwei Li,Lei Tan,Robby T. Tan
Main category: cs.CV
TL;DR: 本文提出一种新框架,通过检测并抑制无配对日-夜图像翻译中的语义幻觉(如错误生成交通标志、车辆和人造光效),提升下游任务性能;方法包括双头判别器进行幻觉检测和基于类别原型的特征空间约束,结合薛定谔桥模型实现迭代优化;实验表明在BDD100K数据集上显著提升检测mAP,尤其对易幻觉类别(如交通灯)增益达31.7%。
Details
Motivation: 日-夜无配对图像翻译因外观差异大、缺乏像素级监督而困难,现有方法易产生语义幻觉(如误生成交通标志、车辆及灯光效果),严重损害下游任务性能。 Method: 提出基于薛定谔桥的翻译框架,引入双头判别器(兼具判别与语义分割功能)检测背景区域中的幻觉内容,并构建基于真实目标域标注对象的类别特异性原型作为语义锚点,在特征空间中迭代将检测到的幻觉特征推离对应原型以抑制幻觉。 Result: 在BDD100K数据集上,日-夜域自适应的mAP提升15.5%,其中交通灯等易幻觉类别的mAP提升达31.7%;定性与定量结果均优于现有方法。 Conclusion: 通过联合幻觉检测与原型引导的特征约束,本方法有效缓解无配对图像翻译中的语义幻觉问题,显著提升翻译结果对下游视觉任务的可用性与鲁棒性。 Abstract: Day-to-night unpaired image translation is important to downstream tasks but remains challenging due to large appearance shifts and the lack of direct pixel-level supervision. Existing methods often introduce semantic hallucinations, where objects from target classes such as traffic signs and vehicles, as well as man-made light effects, are incorrectly synthesized. These hallucinations significantly degrade downstream performance. We propose a novel framework that detects and suppresses hallucinations of target-class features during unpaired translation. To detect hallucination, we design a dual-head discriminator that additionally performs semantic segmentation to identify hallucinated content in background regions. To suppress these hallucinations, we introduce class-specific prototypes, constructed by aggregating features of annotated target-domain objects, which act as semantic anchors for each class. Built upon a Schrodinger Bridge-based translation model, our framework performs iterative refinement, where detected hallucination features are explicitly pushed away from class prototypes in feature space, thus preserving object semantics across the translation trajectory.Experiments show that our method outperforms existing approaches both qualitatively and quantitatively. On the BDD100K dataset, it improves mAP by 15.5% for day-to-night domain adaptation, with a notable 31.7% gain for classes such as traffic lights that are prone to hallucinations.[56] Efficient Generative Modeling beyond Memoryless Diffusion via Adjoint Schrödinger Bridge Matching
Jeongwoo Shin,Jinhwan Sul,Joonseok Lee,Jaewong Choi,Jaemoo Choi
Main category: cs.CV
TL;DR: 本文提出Adjoint Schrödinger Bridge Matching(ASBM),一种通过两阶段优化采样轨迹的生成建模框架,显著提升扩散模型在高维数据上的稳定性、采样效率与生成保真度。
Details
Motivation: 传统扩散模型因记忆less前向过程导致轨迹高度弯曲、分数目标噪声大,缺乏对高效、最优生成路径的建模能力。 Method: ASBM分两阶段:第一阶段将Schrödinger Bridge前向动力学视为耦合构造问题,从‘数据到能量采样’视角学习将数据映射至能量定义的先验;第二阶段用匹配损失监督学习最优耦合诱导的反向生成动力学。 Result: ASBM在图像生成任务中以更少采样步数获得更高保真度;可蒸馏为单步生成器,并在高维数据上展现出更强的稳定性和扩展性。 Conclusion: ASBM通过引入非记忆less动态和最优耦合机制,有效缓解了传统扩散模型的路径弯曲与训练不稳定性问题,为高效生成建模提供了新范式。 Abstract: Diffusion models often yield highly curved trajectories and noisy score targets due to an uninformative, memoryless forward process that induces independent data-noise coupling. We propose Adjoint Schrödinger Bridge Matching (ASBM), a generative modeling framework that recovers optimal trajectories in high dimensions via two stages. First, we view the Schrödinger Bridge (SB) forward dynamic as a coupling construction problem and learn it through a data-to-energy sampling perspective that transports data to an energy-defined prior. Then, we learn the backward generative dynamic with a simple matching loss supervised by the induced optimal coupling. By operating in a non-memoryless regime, ASBM produces significantly straighter and more efficient sampling paths. Compared to prior works, ASBM scales to high-dimensional data with notably improved stability and efficiency. Extensive experiments on image generation show that ASBM improves fidelity with fewer sampling steps. We further showcase the effectiveness of our optimal trajectory via distillation to a one-step generator.[57] Emergent Morphing Attack Detection in Open Multi-modal Large Language Models
Marija Ivanovska,Vitomir Štruc
Main category: cs.CV
TL;DR: 本文首次系统评估了开源多模态大语言模型(MLLMs)在零样本人脸融合攻击检测(MAD)任务中的能力,发现无需微调的LLaVA1.6-Mistral-7B已超越专用MAD方法23% EER,揭示多模态预训练隐式学习了融合伪影特征。
Details
Motivation: 现有MAD系统依赖任务特定训练、泛化差;而开源MLLMs在视觉-语言推理上表现强,但其在生物特征取证中的潜力尚未被探索。 Method: 采用标准化、可复现协议,对多种公开权重的开源MLLMs进行单图像、零样本MAD评估,不进行任何微调或领域适配。 Result: 多个MLLMs展现出显著零样本判别能力,其中LLaVA1.6-Mistral-7B达到SOTA性能,EER较最强专用基线降低至少23%。 Conclusion: 多模态预训练能隐式编码人脸细微不一致性,使MLLMs成为可复现、可解释且具竞争力的生物特征安全与图像取证新基础,并为轻量适配提供新路径。 Abstract: Face morphing attacks threaten biometric verification, yet most morphing attack detection (MAD) systems require task-specific training and generalize poorly to unseen attack types. Meanwhile, open-source multimodal large language models (MLLMs) have demonstrated strong visual-linguistic reasoning, but their potential in biometric forensics remains underexplored. In this paper, we present the first systematic zero-shot evaluation of open-source MLLMs for single-image MAD, using publicly available weights and a standardized, reproducible protocol. Across diverse morphing techniques, many MLLMs show non-trivial discriminative ability without any fine-tuning or domain adaptation, and LLaVA1.6-Mistral-7B achieves state-of-the-art performance, surpassing highly competitive task-specific MAD baselines by at least 23% in terms of equal error rate (EER). The results indicate that multimodal pretraining can implicitly encode fine-grained facial inconsistencies indicative of morphing artifacts, enabling zero-shot forensic sensitivity. Our findings position open-source MLLMs as reproducible, interpretable, and competitive foundations for biometric security and forensic image analysis. This emergent capability also highlights new opportunities to develop state-of-the-art MAD systems through targeted fine-tuning or lightweight adaptation, further improving accuracy and efficiency while preserving interpretability. To support future research, all code and evaluation protocols will be released upon publication.[58] RPT-SR: Regional Prior attention Transformer for infrared image Super-Resolution
Youngwan Jin,Incheol Park,Yagiz Nalcakan,Hyeongjin Ju,Sanghyeop Yeo,Shiho Kim
Main category: cs.CV
TL;DR: 本文提出了一种面向红外图像超分辨率的区域先验注意力Transformer(RPT-SR),通过引入可学习的区域先验token与局部token融合,显式建模固定视角场景中的空间先验,显著提升LWIR和SWIR波段的超分性能。
Details
Motivation: 通用超分模型(如ViT)在固定视角红外成像场景(如监控、自动驾驶)中未利用强而稳定的场景空间先验,导致冗余学习与性能受限。 Method: 提出RPT-SR架构,采用双token框架:(1)可学习的区域先验token作为场景全局结构的持久记忆;(2)局部token表征当前帧内容;二者联合参与注意力计算,使先验动态调控局部重建。 Result: 在涵盖LWIR和SWIR波段的多个红外数据集上达到新SOTA性能,验证了方法的泛化性与有效性。 Conclusion: 显式编码并利用场景空间先验可显著提升红外图像超分辨率性能,RPT-SR为静态视角成像任务提供了高效、可扩展的新范式。 Abstract: General-purpose super-resolution models, particularly Vision Transformers, have achieved remarkable success but exhibit fundamental inefficiencies in common infrared imaging scenarios like surveillance and autonomous driving, which operate from fixed or nearly-static viewpoints. These models fail to exploit the strong, persistent spatial priors inherent in such scenes, leading to redundant learning and suboptimal performance. To address this, we propose the Regional Prior attention Transformer for infrared image Super-Resolution (RPT-SR), a novel architecture that explicitly encodes scene layout information into the attention mechanism. Our core contribution is a dual-token framework that fuses (1) learnable, regional prior tokens, which act as a persistent memory for the scene's global structure, with (2) local tokens that capture the frame-specific content of the current input. By utilizing these tokens into an attention, our model allows the priors to dynamically modulate the local reconstruction process. Extensive experiments validate our approach. While most prior works focus on a single infrared band, we demonstrate the broad applicability and versatility of RPT-SR by establishing new state-of-the-art performance across diverse datasets covering both Long-Wave (LWIR) and Short-Wave (SWIR) spectra[59] LEADER: Lightweight End-to-End Attention-Gated Dual Autoencoder for Robust Minutiae Extraction
Raffaele Cappelli,Matteo Ferrara
Main category: cs.CV
TL;DR: 本文提出LEADER,一种轻量级端到端深度学习模型,直接从原始指纹图像提取包含位置、方向和类型的细节特征,无需预处理与后处理;采用新型真值编码与注意力门控双自编码器结构,在精度、泛化性与推理速度上均达领先水平。
Details
Motivation: 现有指纹细节提取方法大多依赖分离的预处理和后处理步骤,缺乏真正端到端的深度学习方案;同时,模型在跨域(如从清晰指纹到潜指纹)泛化能力及计算效率方面仍有不足。 Method: 提出LEADER模型:基于轻量级双自编码器架构,引入注意力门控机制;设计'城堡-护城河-城墙'真值编码方式;集成非极大值抑制与角度解码,实现端到端 minutiae 描述符预测(含位置、方向、类型),参数仅0.9M。 Result: 在NIST SD27数据集上F1-score比专用潜指纹提取器高34%;样本级排名平均为2.07,47%样本排名第一;内部表征可解释性强,对应分割掩码、方向场等传统指纹特征;GPU/CPU推理耗时分别为15ms/322ms,快于主流商业软件。 Conclusion: LEADER实现了高精度、强泛化、高效率、可解释且完全端到端的指纹细节提取,推动了指纹识别向更实用、更鲁棒的深度学习范式演进;代码与预训练权重已开源。 Abstract: Minutiae extraction, a fundamental stage in fingerprint recognition, is increasingly shifting toward deep learning. However, truly end-to-end methods that eliminate separate preprocessing and postprocessing steps remain scarce. This paper introduces LEADER (Lightweight End-to-end Attention-gated Dual autoencodER), a neural network that maps raw fingerprint images to minutiae descriptors, including location, direction, and type. The proposed architecture integrates non-maximum suppression and angular decoding to enable complete end-to-end inference using only 0.9M parameters. It employs a novel "Castle-Moat-Rampart" ground-truth encoding and a dual-autoencoder structure, interconnected through an attention-gating mechanism. Experimental evaluations demonstrate state-of-the-art accuracy on plain fingerprints and robust cross-domain generalization to latent impressions. Specifically, LEADER attains a 34% higher F1-score on the NIST SD27 dataset compared to specialized latent minutiae extractors. Sample-level analysis on this challenging benchmark reveals an average rank of 2.07 among all compared methods, with LEADER securing the first-place position in 47% of the samples-more than doubling the frequency of the second-best extractor. The internal representations learned by the model align with established fingerprint domain features, such as segmentation masks, orientation fields, frequency maps, and skeletons. Inference requires 15ms on GPU and 322ms on CPU, outperforming leading commercial software in computational efficiency. The source code and pre-trained weights are publicly released to facilitate reproducibility.[60] Semantic-Guided 3D Gaussian Splatting for Transient Object Removal
Aditi Prabakaran,Priyesh Shukla
Main category: cs.CV
TL;DR: 本文提出了一种基于语义过滤的框架,利用视觉-语言模型(如CLIP)识别并去除多视角捕获中的瞬态物体,从而缓解3D高斯点绘(3DGS)重建中的鬼影伪影,相比运动启发式方法更鲁棒,且内存开销小、保持实时渲染性能。
Details
Motivation: 瞬态物体在非受控多视角拍摄中导致3D高斯点绘重建出现鬼影伪影;现有方法或依赖高内存消耗的场景分解,或依赖易受视差模糊影响的运动启发式策略。 Method: 构建语义过滤框架:利用CLIP计算每个高斯椭球在训练迭代中对干扰类文本提示的相似度得分,累积后按校准阈值进行不透明度正则化与周期性剪枝。 Result: 在RobustNeRF基准上四个序列中均一致优于原始3DGS,重建质量提升明显;内存开销极小,保持实时渲染性能;阈值校准与基线对比验证了语义引导策略的有效性与实用性。 Conclusion: 语义分类可独立于运动模式识别物体类别,有效解决视差模糊问题;该方法为具有可预测干扰类别的场景提供了一种实用、轻量、鲁棒的瞬态物体去除方案。 Abstract: Transient objects in casual multi-view captures cause ghosting artifacts in 3D Gaussian Splatting (3DGS) reconstruction. Existing solutions relied on scene decomposition at significant memory cost or on motion-based heuristics that were vulnerable to parallax ambiguity. A semantic filtering framework was proposed for category-aware transient removal using vision-language models. CLIP similarity scores between rendered views and distractor text prompts were accumulated per-Gaussian across training iterations. Gaussians exceeding a calibrated threshold underwent opacity regularization and periodic pruning. Unlike motion-based approaches, semantic classification resolved parallax ambiguity by identifying object categories independently of motion patterns. Experiments on the RobustNeRF benchmark demonstrated consistent improvement in reconstruction quality over vanilla 3DGS across four sequences, while maintaining minimal memory overhead and real-time rendering performance. Threshold calibration and comparisons with baselines validated semantic guidance as a practical strategy for transient removal in scenarios with predictable distractor categories.[61] Advanced Acceptance Score: A Holistic Measure for Biometric Quantification
Aman Verma,Seshan Srirangarajan,Sumantra Dutta Roy
Main category: cs.CV
TL;DR: 本文提出了一套全面的手势生物特征评分质量评估体系,以排序准确性、分数趋势一致性及身份特征解耦性为核心,构建了综合性的'高级接受度分数'作为评估指标,并在多个数据集和SOTA模型上验证了其有效性与可靠性。
Details
Motivation: 现有生物特征容量评估方法依赖错误率,无法反映评分本身的优劣,缺乏对分数质量的直接度量。 Method: 提出基于排序顺序与相关性的评估框架,包含排名偏差、高/低排名手势的分数奖励、输出与真实分数趋势的一致性补偿,以及身份特征解耦性的折扣机制;通过加权整合形成'高级接受度分数'。 Result: 在三个数据集和五个SOTA模型上的实验表明,该指标选出的最优分数优于其他现有评估指标,且与现有指标具有相关性,验证了其有效性与可靠性。 Conclusion: 所提出的评估体系为手势生物特征评分质量提供了更全面、更合理的量化手段,具备实用价值与推广潜力。 Abstract: Quantifying biometric characteristics within hand gestures involve derivation of fitness scores from a gesture and identity aware feature space. However, evaluating the quality of these scores remains an open question. Existing biometric capacity estimation literature relies upon error rates. But these rates do not indicate goodness of scores. Thus, in this manuscript we present an exhaustive set of evaluation measures. We firstly identify ranking order and relevance of output scores as the primary basis for evaluation. In particular, we consider both rank deviation as well as rewards for: (i) higher scores of high ranked gestures and (ii) lower scores of low ranked gestures. We also compensate for correspondence between trends of output and ground truth scores. Finally, we account for disentanglement between identity features of gestures as a discounting factor. Integrating these elements with adequate weighting, we formulate advanced acceptance score as a holistic evaluation measure. To assess effectivity of the proposed we perform in-depth experimentation over three datasets with five state-of-the-art (SOTA) models. Results show that the optimal score selected with our measure is more appropriate than existing other measures. Also, our proposed measure depicts correlation with existing measures. This further validates its reliability. We have made our \href{https://github.com/AmanVerma2307/MeasureSuite}{code} public.[62] Dynamic Training-Free Fusion of Subject and Style LoRAs
Qinglong Cao,Yuntian Chen,Chao Ma,Xiaokang Yang
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的动态LoRA融合框架,通过前向过程中的KL散度自适应选择特征、反向去噪阶段中基于CLIP/DINO分数的梯度修正,实现主体与风格的协同生成。
Details
Motivation: 现有LoRA融合方法多采用静态统计启发式加权,偏离LoRA自适应调整特征的初衷,且忽略输入采样的随机性。 Method: 提出动态无训练融合框架:前向时在每层LoRA应用处计算基模型与主体/风格LoRA特征间的KL散度,自适应融合权重;反向去噪阶段利用CLIP和DINO等指标的梯度进行隐空间动态修正。 Result: 在多种主体-风格组合上显著优于现有SOTA LoRA融合方法,定性和定量结果均更优,且无需任何微调或重训练。 Conclusion: 特征级动态选择与指标引导的隐空间调整相结合,可在整个扩散过程中实现一致、可控的主体-风格合成,是一种高效、通用、免训练的LoRA融合新范式。 Abstract: Recent studies have explored the combination of multiple LoRAs to simultaneously generate user-specified subjects and styles. However, most existing approaches fuse LoRA weights using static statistical heuristics that deviate from LoRA's original purpose of learning adaptive feature adjustments and ignore the randomness of sampled inputs. To address this, we propose a dynamic training-free fusion framework that operates throughout the generation process. During the forward pass, at each LoRA-applied layer, we dynamically compute the KL divergence between the base model's original features and those produced by subject and style LoRAs, respectively, and adaptively select the most appropriate weights for fusion. In the reverse denoising stage, we further refine the generation trajectory by dynamically applying gradient-based corrections derived from objective metrics such as CLIP and DINO scores, providing continuous semantic and stylistic guidance. By integrating these two complementary mechanisms-feature-level selection and metric-guided latent adjustment-across the entire diffusion timeline, our method dynamically achieves coherent subject-style synthesis without any retraining. Extensive experiments across diverse subject-style combinations demonstrate that our approach consistently outperforms state-of-the-art LoRA fusion methods both qualitatively and quantitatively.[63] Revealing and Enhancing Core Visual Regions: Harnessing Internal Attention Dynamics for Hallucination Mitigation in LVLMs
Guangtao Lyu,Qi Liu,Chenghao Xu,Jiexi Yan,Muli Yang,Xueting Li,Fen Fang,Cheng Deng
Main category: cs.CV
TL;DR: 本文提出了一种无需训练的注意力干预方法PADE,利用LVLM内部的正向注意力动力学(PAD)识别关键视觉区域,通过自适应缩放和系统标记补偿来减少幻觉、提升视觉定位能力。
Details
Motivation: 现有无训练方法存在计算开销大、易受注意力汇现象干扰等问题,需更鲁棒且高效的干预机制。 Method: 提出PADE:构建PAD图识别语义核心视觉区域;采用每头中位绝对偏差缩放自适应调控干预强度;引入系统标记补偿以维持对复杂指令的关注和长程输出一致性。 Result: 在多个LVLM和基准测试上验证了PADE能有效提升视觉接地能力并降低幻觉率。 Conclusion: 利用模型内部注意力动力学进行干预是提升多模态推理可靠性的一条有效新路径。 Abstract: LVLMs have achieved strong multimodal reasoning capabilities but remain prone to hallucinations, producing outputs inconsistent with visual inputs or user instructions. Existing training-free methods, including contrastive decoding and auxiliary expert models, which incur several times more computational overhead and may introduce potential interference, as well as static internal signal enhancement, are often vulnerable to the attention sink phenomenon. We find that internal Positive Attention Dynamics (PAD) in LVLMs naturally reveal semantically core visual regions under the distortions of attention sinks. Based on this, we propose Positive Attention Dynamics Enhancement (PADE), a training-free attention intervention that constructs a PAD map to identify semantically core visual regions, applies per-head Median Absolute Deviation Scaling to adaptively control the intervention strength, and leverages System-Token Compensation to maintain attention to complex user instructions and support long-term output consistency. Experiments on multiple LVLMs and benchmarks show that PADE improves visual grounding and reduces hallucinations, validating the effectiveness of leveraging internal attention dynamics for reliable multimodal reasoning.[64] Intracoronary Optical Coherence Tomography Image Processing and Vessel Classification Using Machine Learning
Amal Lahchim,Lambros Athanasiou
Main category: cs.CV
TL;DR: 本文提出了一种全自动OCT图像血管分割与分类方法,结合预处理、导丝伪影去除、极坐标转换、K-means聚类与局部特征提取,并用逻辑回归和SVM进行像素级分类,达到近乎完美的精度(F1=1.00,准确率99.68%),兼具高精度与低计算开销。
Details
Motivation: OCT图像存在噪声、伪影和复杂组织结构,导致血管分割与分类困难,亟需全自动、高精度且低人工依赖的分析方法。 Method: 集成图像预处理、导丝伪影去除、极坐标转直角坐标、无监督K-means聚类及局部特征提取,并基于这些特征训练逻辑回归和SVM模型实现像素级血管分类。 Result: 精度、召回率和F1分数均高达1.00,总体分类准确率达99.68%,边界检测准确且计算复杂度低。 Conclusion: 该全自动管道为OCT图像分析提供了可靠高效方案,适用于临床决策支持和实时医学图像处理。 Abstract: Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.[65] An Industrial Dataset for Scene Acquisitions and Functional Schematics Alignment
Flavien Armangeon,Thibaud Ehret,Enric Meinhardt-Llopis,Rafael Grompone von Gioi,Guillaume Thibault,Marc Petit,Gabriele Facciolo
Main category: cs.CV
TL;DR: 本文提出IRIS-v2数据集,用于支持工业设施中功能示意图与2D/3D场景采集数据的自动对齐研究,并结合分割与图匹配方法提升对齐效率。
Details
Motivation: 老旧工业设施缺乏原生数字模型,而人工基于图像和LiDAR数据进行示意图与实景对齐费时费力、难以规模化;同时示意图与现实存在不一致,且缺乏公开工业数据集,导致该问题具有挑战性且研究不足。 Method: 构建包含图像、点云、2D标注框与分割掩码、CAD模型、3D管道布线信息及P&ID的综合性数据集IRIS-v2,并在实际案例中融合语义分割与图匹配技术进行对齐实验。 Result: 实现了面向工业设施的功能示意图与多模态场景数据的对齐,显著减少了人工对齐所需时间。 Conclusion: IRIS-v2为数字孪生中工业场景对齐任务提供了关键数据支撑,所提出的结合分割与图匹配的方法为自动化对齐提供了可行路径。 Abstract: Aligning functional schematics with 2D and 3D scene acquisitions is crucial for building digital twins, especially for old industrial facilities that lack native digital models. Current manual alignment using images and LiDAR data does not scale due to tediousness and complexity of industrial sites. Inconsistencies between schematics and reality, and the scarcity of public industrial datasets, make the problem both challenging and underexplored. This paper introduces IRIS-v2, a comprehensive dataset to support further research. It includes images, point clouds, 2D annotated boxes and segmentation masks, a CAD model, 3D pipe routing information, and the P&ID (Piping and Instrumentation Diagram). The alignment is experimented on a practical case study, aiming at reducing the time required for this task by combining segmentation and graph matching.[66] Concept-Enhanced Multimodal RAG: Towards Interpretable and Accurate Radiology Report Generation
Marco Salmè,Federico Siciliano,Fabrizio Silvestri,Paolo Soda,Rosa Sicilia,Valerio Guarrasi
Main category: cs.CV
TL;DR: 本文提出Concept-Enhanced Multimodal RAG(CEMRAG)框架,将可解释的临床概念分解与多模态检索增强生成相结合,提升放射学报告生成中模型的可解释性与事实准确性,打破二者权衡的传统假设。
Details
Motivation: 现有视觉-语言模型在放射学报告生成中面临可解释性差和幻觉问题,且可解释性与准确性常被分别优化,缺乏统一框架。 Method: 提出CEMRAG框架,将视觉表征解耦为可解释临床概念,并与多模态RAG融合,利用增强上下文提示提升生成质量;在MIMIC-CXR和IU X-Ray数据集上验证多种VLM架构、训练策略与检索配置。 Result: CEMRAG在临床准确性和标准NLP指标上均一致优于传统RAG和纯概念方法,验证了可解释性与性能可协同提升。 Conclusion: 透明的视觉概念不仅能提升模型可解释性,还能增强诊断准确性;模块化设计为构建临床可信AI提供了新路径。 Abstract: Radiology Report Generation (RRG) through Vision-Language Models (VLMs) promises to reduce documentation burden, improve reporting consistency, and accelerate clinical workflows. However, their clinical adoption remains limited by the lack of interpretability and the tendency to hallucinate findings misaligned with imaging evidence. Existing research typically treats interpretability and accuracy as separate objectives, with concept-based explainability techniques focusing primarily on transparency, while Retrieval-Augmented Generation (RAG) methods targeting factual grounding through external retrieval. We present Concept-Enhanced Multimodal RAG (CEMRAG), a unified framework that decomposes visual representations into interpretable clinical concepts and integrates them with multimodal RAG. This approach exploits enriched contextual prompts for RRG, improving both interpretability and factual accuracy. Experiments on MIMIC-CXR and IU X-Ray across multiple VLM architectures, training regimes, and retrieval configurations demonstrate consistent improvements over both conventional RAG and concept-only baselines on clinical accuracy metrics and standard NLP measures. These results challenge the assumed trade-off between interpretability and performance, showing that transparent visual concepts can enhance rather than compromise diagnostic accuracy in medical VLMs. Our modular design decomposes interpretability into visual transparency and structured language model conditioning, providing a principled pathway toward clinically trustworthy AI-assisted radiology.[67] A Novel Public Dataset for Strawberry (Fragaria x ananassa) Ripeness Detection and Comparative Evaluation of YOLO-Based Models
Mustafa Yurdakul,Zeynep Sena Bastug,Ali Emre Gok,Sakir Taşdemir
Main category: cs.CV
TL;DR: 本文提出一个公开的草莓成熟度数据集,并在YOLOv8/v9/11系列模型上进行对比实验,验证了小中型模型在该任务上的平衡高效性,为智慧农业提供基准参考。
Details
Motivation: 传统视觉评估草莓成熟度主观性强、误差大;且现有研究缺乏公开、全面的数据集,难以横向比较。 Method: 构建了一个包含566张图像、1201个标注目标的公开草莓成熟度数据集(采集自土耳其两个温室,光照与环境条件多变);并在该数据集上对比测试YOLOv8、YOLOv9和YOLO11系列模型的检测性能。 Result: YOLOv9c取得最高精度(90.94%),YOLO11s取得最高召回率(83.74%),YOLOv8s在mAP@50指标上最优(86.09%);小中型模型整体表现更均衡高效。 Conclusion: 所构建数据集填补了公开资源空白,实验结果为草莓成熟度智能识别提供了可复现基准,并支持智慧农业落地应用。 Abstract: The strawberry (Fragaria x ananassa), known worldwide for its economic value and nutritional richness, is a widely cultivated fruit. Determining the correct ripeness level during the harvest period is crucial for both preventing losses for producers and ensuring consumers receive a quality product. However, traditional methods, i.e., visual assessments alone, can be subjective and have a high margin of error. Therefore, computer-assisted systems are needed. However, the scarcity of comprehensive datasets accessible to everyone in the literature makes it difficult to compare studies in this field. In this study, a new and publicly available strawberry ripeness dataset, consisting of 566 images and 1,201 labeled objects, prepared under variable light and environmental conditions in two different greenhouses in Turkey, is presented to the literature. Comparative tests conducted on the data set using YOLOv8, YOLOv9, and YOLO11-based models showed that the highest precision value was 90.94% in the YOLOv9c model, while the highest recall value was 83.74% in the YOLO11s model. In terms of the general performance criterion mAP@50, YOLOv8s was the best performing model with a success rate of 86.09%. The results show that small and medium-sized models work more balanced and efficiently on this type of dataset, while also establishing a fundamental reference point for smart agriculture applications.[68] Bayesian Optimization for Design Parameters of 3D Image Data Analysis
David Exler,Joaquin Eduardo Urrutia Gómez,Martin Krüger,Maike Schliephake,John Jbeily,Mario Vitacolonna,Rüdiger Rudolf,Markus Reischl
Main category: cs.CV
TL;DR: 本文提出了一种名为3D数据分析优化流水线的方法,通过两阶段贝叶斯优化自动选择和调参分割与分类模型,并引入新分割质量指标和辅助标注流程,提升3D生物医学图像分析效率。
Details
Motivation: 在大规模3D生物医学图像分析中,手动选择合适模型和调参是主要瓶颈,亟需自动化、高效且低标注成本的解决方案。 Method: 提出两阶段贝叶斯优化流水线:第一阶段基于领域适配的合成基准数据集选择分割模型并优化后处理参数,引入新分割质量指标作为目标函数;第二阶段优化分类器设计(如编码器、分类头、先验知识融合、预训练策略),并集成辅助类标注工作流,从分割结果中提取预测实例供人工逐个确认。 Result: 在四个案例研究中,该流水线能高效为各数据集识别出有效的模型结构与参数配置,显著降低人工干预与标注负担。 Conclusion: 该优化流水线为3D生物医学图像的分割与分类提供了可复现、自适应且低标注依赖的端到端自动化方案,具有良好的实践推广价值。 Abstract: Deep learning-based segmentation and classification are crucial to large-scale biomedical imaging, particularly for 3D data, where manual analysis is impractical. Although many methods exist, selecting suitable models and tuning parameters remains a major bottleneck in practice. Hence, we introduce the 3D data Analysis Optimization Pipeline, a method designed to facilitate the design and parameterization of segmentation and classification using two Bayesian Optimization stages. First, the pipeline selects a segmentation model and optimizes postprocessing parameters using a domain-adapted syntactic benchmark dataset. To ensure a concise evaluation of segmentation performance, we introduce a segmentation quality metric that serves as the objective function. Second, the pipeline optimizes design choices of a classifier, such as encoder and classifier head architectures, incorporation of prior knowledge, and pretraining strategies. To reduce manual annotation effort, this stage includes an assisted class-annotation workflow that extracts predicted instances from the segmentation results and sequentially presents them to the operator, eliminating the need for manual tracking. In four case studies, the 3D data Analysis Optimization Pipeline efficiently identifies effective model and parameter configurations for individual datasets.[69] Criteria-first, semantics-later: reproducible structure discovery in image-based sciences
Jan Bumberger
Main category: cs.CV
TL;DR: 本文提出了一种'标准先行、语义后置'的图像分析新范式,以克服传统语义优先方法在开放科学发现、跨传感器/跨站点可比性及长期监测中因本体漂移导致的系统性失败。
Details
Motivation: 传统语义优先的图像分析范式在开放科学探索、跨平台数据可比性和长期生态监测等关键场景下因领域本体和标签集随时间与环境漂移而失效。 Method: 提出'标准先行、语义后置'的演绎倒置框架:将基于显式优化准则(而非局部本体)的语义无关结构提取(如稳定划分、结构场或层次)与下游语义映射分离,构建领域通用、可复现的分析骨架;理论基础包括控制论、观测量即区分、以及信息论中信息与意义的分离。 Result: 建立了统一的、语义无关的结构发现框架,支持稳定、可复现的结构提取,并将语义映射显式化为下游任务,从而支持多重解释与跨本体对齐;该框架已在多领域证据中验证其在标签不可扩展时的普适性。 Conclusion: 结构发现应作为可复现科学的第一分析层,其产物应被视作FAIR、AI-ready的数字对象,服务于长期监测与数字孪生;验证标准需超越分类精度,关注结构本身的稳健性与可重用性。 Abstract: Across the natural and life sciences, images have become a primary measurement modality, yet the dominant analytic paradigm remains semantics-first. Structure is recovered by predicting or enforcing domain-specific labels. This paradigm fails systematically under the conditions that make image-based science most valuable, including open-ended scientific discovery, cross-sensor and cross-site comparability, and long-term monitoring in which domain ontologies and associated label sets drift culturally, institutionally, and ecologically. A deductive inversion is proposed in the form of criteria-first and semantics-later. A unified framework for criteria-first structure discovery is introduced. It separates criterion-defined, semantics-free structure extraction from downstream semantic mapping into domain ontologies or vocabularies and provides a domain-general scaffold for reproducible analysis across image-based sciences. Reproducible science requires that the first analytic layer perform criterion-driven, semantics-free structure discovery, yielding stable partitions, structural fields, or hierarchies defined by explicit optimality criteria rather than local domain ontologies. Semantics is not discarded; it is relocated downstream as an explicit mapping from the discovered structural product to a domain ontology or vocabulary, enabling plural interpretations and explicit crosswalks without rewriting upstream extraction. Grounded in cybernetics, observation-as-distinction, and information theory's separation of information from meaning, the argument is supported by cross-domain evidence showing that criteria-first components recur whenever labels do not scale. Finally, consequences are outlined for validation beyond class accuracy and for treating structural products as FAIR, AI-ready digital objects for long-term monitoring and digital twins.[70] ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT
Hyunchan Moon,Cheonjun Park,Steven L. Waslander
Main category: cs.CV
TL;DR: ToaSt是一种针对视觉Transformer(ViT)的解耦式轻量化框架,通过针对多头自注意力模块的耦合头结构化剪枝和前馈网络的Token通道选择(TCS)策略,在保持甚至提升精度的同时显著降低计算量。
Details
Motivation: ViT虽性能优异但计算开销大;现有剪枝与token压缩方法存在重训练时间长或全局传播导致优化困难等问题。 Method: 提出ToaSt框架:1)对多头自注意力模块采用耦合头结构化剪枝,利用注意力机制特性增强鲁棒性;2)对占FLOPs超60%的前馈网络引入Token Channel Selection(TCS),提升压缩比并避免全局传播问题。 Result: 在9个主流ViT模型(如DeiT、ViT-MAE、Swin)上验证有效;ViT-MAE-Huge实现88.52%准确率(+1.64%)与39.4% FLOPs下降;下游COCO检测mAP达52.2(优于基线51.9)。 Conclusion: ToaSt通过组件级定制化压缩策略,实现了精度与效率的更优平衡,具备良好泛化性和实用性。 Abstract: Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining times and global propagation that creates optimization challenges, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60\% of FLOPs), we introduce Token Channel Selection (TCS) that enhances compression ratios while avoiding global propagation issues. Our analysis reveals TCS effectively filters redundant noise during selection. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52\% accuracy (+1.64 \%) with 39.4\% FLOPs reduction. ToaSt transfers effectively to downstream tasks, cccccachieving 52.2 versus 51.9 mAP on COCO object detection. Code and models will be released upon acceptance.[71] Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
Shutian Gu,Chengkai Huang,Ruoyu Wang,Lina Yao
Main category: cs.CV
TL;DR: 本文提出了一种无需修改或微调大语言模型(LLM)的检索增强框架,通过在任务(episode)级和步骤(step)级分别引入指令语义检索与导航候选方向剪枝,提升VLN中LLM导航的效率与稳定性。
Details
Motivation: 现有基于提示的LLM导航存在决策低效问题:每步需重复解析指令且在噪声大、冗余多的可选动作中推理,影响性能与稳定性。 Method: 构建双层级检索增强框架:1)episode级——用指令嵌入检索相似成功轨迹作为上下文示例,辅助指令 grounding;2)step级——用模仿学习训练的轻量候选检索器,在LLM推理前剪枝无关导航方向;两模块均独立于LLM、轻量且可插拔。 Result: 在R2R基准上,该方法在Success Rate、Oracle Success Rate和SPL指标上对seen和unseen环境均取得一致提升;消融实验验证了两级检索的互补增益。 Conclusion: 检索增强的决策支持是一种有效且可扩展的策略,能显著提升LLM驱动的视觉-语言导航性能,同时保持LLM原生能力与部署灵活性。 Abstract: Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions and navigate through previously unseen environments. Recent approaches increasingly employ large language models (LLMs) as high-level navigators due to their flexibility and reasoning capability. However, prompt-based LLM navigation often suffers from inefficient decision-making, as the model must repeatedly interpret instructions from scratch and reason over noisy and verbose navigable candidates at each step. In this paper, we propose a retrieval-augmented framework to improve the efficiency and stability of LLM-based VLN without modifying or fine-tuning the underlying language model. Our approach introduces retrieval at two complementary levels. At the episode level, an instruction-level embedding retriever selects semantically similar successful navigation trajectories as in-context exemplars, providing task-specific priors for instruction grounding. At the step level, an imitation-learned candidate retriever prunes irrelevant navigable directions before LLM inference, reducing action ambiguity and prompt complexity. Both retrieval modules are lightweight, modular, and trained independently of the LLM. We evaluate our method on the Room-to-Room (R2R) benchmark. Experimental results demonstrate consistent improvements in Success Rate, Oracle Success Rate, and SPL on both seen and unseen environments. Ablation studies further show that instruction-level exemplar retrieval and candidate pruning contribute complementary benefits to global guidance and step-wise decision efficiency. These results indicate that retrieval-augmented decision support is an effective and scalable strategy for enhancing LLM-based vision-and-language navigation.[72] Spanning the Visual Analogy Space with a Weight Basis of LoRAs
Hila Manor,Rinon Gal,Haggai Maron,Tomer Michaeli,Gal Chechik
Main category: cs.CV
TL;DR: 本文提出LoRWeB方法,通过动态组合学习到的LoRA变换基元来实现视觉类比学习,显著提升了对未见视觉变换的泛化能力。
Details
Motivation: 现有基于单个LoRA模块的视觉类比方法受限于固定适配模块,难以泛化到多样化的视觉变换。 Method: 提出LoRWeB:(1)构建可学习的LoRA模块基底以表征不同视觉变换;(2)设计轻量编码器,根据输入类比对动态选择并加权基底LoRA。 Result: 在多个基准上达到SOTA性能,显著提升对未见视觉变换的泛化能力。 Conclusion: LoRA基分解是实现灵活视觉操作的有前景方向。 Abstract: Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb[73] Language and Geometry Grounded Sparse Voxel Representations for Holistic Scene Understanding
Guile Wu,David Huang,Bingbing Liu,Dongfeng Bai
Main category: cs.CV
TL;DR: 本文提出了一种基于语言与几何引导的稀疏体素表示方法,统一建模3D场景的外观、语义与几何结构,通过多字段协同与跨模态知识蒸馏(语言+几何)提升开放词汇场景理解与重建性能。
Details
Motivation: 现有3D开放词汇场景理解方法过度依赖2D基础模型的语言特征蒸馏,忽视外观、语义与几何之间的协同,导致理解偏离真实几何结构且与重建过程脱节。 Method: 以3D稀疏体素为基本单元,构建外观场、密度场、特征场和置信度场;设计特征调制模块融合多场信息;将2D基础模型的语言特征与几何基础模型的几何知识(通过深度相关性正则化和模式一致性正则化)联合蒸馏至3D场景表示中。 Result: 在整体场景理解与重建任务上,该方法显著优于当前最先进方法。 Conclusion: 语言与几何协同引导的稀疏体素表示可有效统一建模3D场景的外观、语义与几何,提升开放词汇理解与重建的一致性与准确性。 Abstract: Existing 3D open-vocabulary scene understanding methods mostly emphasize distilling language features from 2D foundation models into 3D feature fields, but largely overlook the synergy among scene appearance, semantics, and geometry. As a result, scene understanding often deviates from the underlying geometric structure of scenes and becomes decoupled from the reconstruction process. In this work, we propose a novel approach that leverages language and geometry grounded sparse voxel representations to comprehensively model appearance, semantics, and geometry within a unified framework. Specifically, we use 3D sparse voxels as primitives and employ an appearance field, a density field, a feature field, and a confidence field to holistically represent a 3D scene. To promote synergy among the appearance, density, and feature fields, we construct a feature modulation module and distill language features from a 2D foundation model into our 3D scene model. In addition, we integrate geometric distillation into feature field distillation to transfer geometric knowledge from a geometry foundation model to our 3D scene representations via depth correlation regularization and pattern consistency regularization. These components work together to synergistically model the appearance, semantics, and geometry of the 3D scene within a unified framework. Extensive experiments demonstrate that our approach achieves superior overall performance compared with state-of-the-art methods in holistic scene understanding and reconstruction.[74] RaCo: Ranking and Covariance for Practical Learned Keypoints
Abhiram Shenoi,Philipp Lindenberger,Paul-Edouard Sarlin,Marc Pollefeys
Main category: cs.CV
TL;DR: RaCo是一种轻量级神经网络,用于学习鲁棒且通用的3D视觉关键点,无需共视图像对,通过数据增强实现强旋转鲁棒性,在关键点重复性和两视图匹配任务上达到SOTA性能。
Details
Motivation: 现有方法通常依赖共视图像对或复杂等变网络架构来实现旋转鲁棒性和关键点可靠性,限制了其通用性与效率。 Method: RaCo包含三个核心模块:可重复关键点检测器、可微分排序器(以在有限关键点数下最大化匹配)、以及协方差估计器(用于度量尺度下的空间不确定性量化);仅使用单视角图像块训练,结合大规模旋转等数据增强。 Result: 在多个挑战性数据集上,RaCo在关键点重复性和两视图匹配任务中取得SOTA性能,尤其在大幅面内旋转下表现突出;同时能独立、无监督地估计关键点排序与度量协方差。 Conclusion: RaCo提供了一种简洁有效的方法,在不依赖额外标注或共视图像对的前提下,实现可解释、可重复、具度量不确定性的关键点检测,兼顾性能与实用性。 Abstract: This paper introduces RaCo, a lightweight neural network designed to learn robust and versatile keypoints suitable for a variety of 3D computer vision tasks. The model integrates three key components: the repeatable keypoint detector, a differentiable ranker to maximize matches with a limited number of keypoints, and a covariance estimator to quantify spatial uncertainty in metric scale. Trained on perspective image crops only, RaCo operates without the need for covisible image pairs. It achieves strong rotational robustness through extensive data augmentation, even without the use of computationally expensive equivariant network architectures. The method is evaluated on several challenging datasets, where it demonstrates state-of-the-art performance in keypoint repeatability and two-view matching, particularly under large in-plane rotations. Ultimately, RaCo provides an effective and simple strategy to independently estimate keypoint ranking and metric covariance without additional labels, detecting interpretable and repeatable interest points. The code is available at https://github.com/cvg/RaCo.[75] Understanding vs. Generation: Navigating Optimization Dilemma in Multimodal Models
Sen Ye,Mengde Xu,Shuyang Gu,Di He,Liwei Wang,Han Hu
Main category: cs.CV
TL;DR: 本文提出Reason-Reflect-Refine(R3)框架,将单步生成任务重构为“生成-理解-再生成”的多步过程,以缓解多模态模型中生成与理解能力之间的权衡冲突。
Details
Motivation: 当前多模态模型在增强生成能力时往往削弱理解能力,反之亦然;作者认为其根源在于生成与理解之间存在潜在冲突和竞争性优化。 Method: 提出R3框架,通过显式地在生成过程中调用模型的理解能力,将生成任务分解为推理(Reason)、反思(Reflect)和精炼(Refine)三个阶段。 Result: 在多个基准上验证了R3能同时提升生成质量与相关理解能力,缓解了二者间的优化困境。 Conclusion: R3为构建下一代统一多模态模型提供了新思路,表明生成与理解可协同增强而非相互制约。 Abstract: Current research in multimodal models faces a key challenge where enhancing generative capabilities often comes at the expense of understanding, and vice versa. We analyzed this trade-off and identify the primary cause might be the potential conflict between generation and understanding, which creates a competitive dynamic within the model. To address this, we propose the Reason-Reflect-Refine (R3) framework. This innovative algorithm re-frames the single-step generation task into a multi-step process of "generate-understand-regenerate". By explicitly leveraging the model's understanding capability during generation, we successfully mitigate the optimization dilemma, achieved stronger generation results and improved understanding ability which are related to the generation process. This offers valuable insights for designing next-generation unified multimodal models. Code is available at https://github.com/sen-ye/R3.[76] NeRFscopy: Neural Radiance Fields for in-vivo Time-Varying Tissues from Endoscopy
Laura Salort-Benejam,Antonio Agudo
Main category: cs.CV
TL;DR: 本文提出NeRFscopy,一种自监督的端到端神经渲染方法,用于从单目内窥镜视频中实现可变形组织的动态3D重建与新视角合成,无需模板或预训练模型。
Details
Motivation: 内窥镜视频3D重建面临组织形变、单目成像、光照变化、遮挡和未知相机轨迹等挑战,亟需鲁棒的动态重建方法以提升诊断与手术效果。 Method: NeRFscopy构建了一个可变形神经辐射场,包含规范空间的辐射场和由SE(3)变换参数化的时间依赖形变场,并通过精心设计的图像一致性损失,仅从单目视频自监督学习3D隐式模型。 Result: 在多种具挑战性的内窥镜场景下,NeRFscopy在新视角合成任务中精度显著优于现有方法。 Conclusion: NeRFscopy为可变形内窥镜组织提供了高效、自监督、无需先验的3D重建新范式,推动了神经渲染在临床医学影像中的实用化进展。 Abstract: Endoscopy is essential in medical imaging, used for diagnosis, prognosis and treatment. Developing a robust dynamic 3D reconstruction pipeline for endoscopic videos could enhance visualization, improve diagnostic accuracy, aid in treatment planning, and guide surgery procedures. However, challenges arise due to the deformable nature of the tissues, the use of monocular cameras, illumination changes, occlusions and unknown camera trajectories. Inspired by neural rendering, we introduce NeRFscopy, a self-supervised pipeline for novel view synthesis and 3D reconstruction of deformable endoscopic tissues from a monocular video. NeRFscopy includes a deformable model with a canonical radiance field and a time-dependent deformation field parameterized by SE(3) transformations. In addition, the color images are efficiently exploited by introducing sophisticated terms to learn a 3D implicit model without assuming any template or pre-trained model, solely from data. NeRFscopy achieves accurate results in terms of novel view synthesis, outperforming competing methods across various challenging endoscopy scenes.[77] Meteorological data and Sky Images meets Neural Models for Photovoltaic Power Forecasting
Ines Montoya-Espinagosa,Antonio Agudo
Main category: cs.CV
TL;DR: 本文提出了一种结合天空图像、光伏历史数据和气象数据的多模态混合深度学习方法,用于提升短临(nowcasting)与中长期光伏功率预测精度,尤其在多云天气下显著改善坡度事件预测与鲁棒性。
Details
Motivation: 光伏出力具有强变异性,传统预测方法难以应对多云等复杂天气下的准确预测需求,亟需融合多源数据以提升预测可靠性与电网调度能力。 Method: 构建基于深度神经网络的多模态混合模型,联合输入天空图像、历史光伏功率序列及多种气象变量(如长波向下辐射、风速与太阳位置组合),并分别支持短临与中长期预测任务。 Result: 引入特定气象变量(尤其是地表长波向下辐射及风速与太阳位置组合)显著提升了短临与中长期预测精度,尤其在多云条件下效果突出;模型对坡度事件预测更准确,鲁棒性增强。 Conclusion: 融合多源异构数据(图像、时序、气象)可有效提升光伏预测模型的可靠性、可解释性与实用性,为电网高效运行和太阳能波动管理提供有力支撑。 Abstract: Due to the rise in the use of renewable energies as an alternative to traditional ones, and especially solar energy, there is increasing interest in studying how to address photovoltaic forecasting in the face of the challenge of variability in photovoltaic energy production, using different methodologies. This work develops a hybrid approach for short and long-term forecasting based on two studies with the same purpose. A multimodal approach that combines images of the sky and photovoltaic energy history with meteorological data is proposed. The main goal is to improve the accuracy of ramp event prediction, increase the robustness of forecasts in cloudy conditions, and extend capabilities beyond nowcasting, to support more efficient operation of the power grid and better management of solar variability. Deep neural models are used for both nowcasting and forecasting solutions, incorporating individual and multiple meteorological variables, as well as an analytical solar position. The results demonstrate that the inclusion of meteorological data, particularly the surface long-wave, radiation downwards, and the combination of wind and solar position, significantly improves current predictions in both nowcasting and forecasting tasks, especially on cloudy days. This study highlights the importance of integrating diverse data sources to improve the reliability and interpretability of solar energy prediction models.[78] Context-aware Skin Cancer Epithelial Cell Classification with Scalable Graph Transformers
Lucas Sancéré,Noémie Moreau,Katarzyna Bozek
Main category: cs.CV
TL;DR: 本文提出了一种基于全切片图像(WSI)细胞图的可扩展图Transformer方法,用于区分皮肤鳞状细胞癌中形态相似的健康与肿瘤上皮细胞,相比传统图像方法,在单张及多张WSI上均取得更高平衡准确率。
Details
Motivation: 现有基于CNN和ViT的WSI分析方法依赖补丁表示,丢失组织层面的上下文信息;而健康与肿瘤上皮细胞形态高度相似,图像方法难以区分,亟需建模细胞间空间与类型关系。 Method: 构建全WSI细胞图,以细胞为节点、空间邻近性为边,采用图Transformer(SGFormer和DIFFormer)进行分类;节点特征融合形态、纹理及非上皮细胞类型信息;多WSI实验中对大图采样子图(来自2560×2560图像块)。 Result: 单WSI上:SGFormer和DIFFormer平衡准确率分别为85.2±1.5%和85.1±2.5%,优于最佳图像方法(81.2±3.0%);多WSI上:DIFFormer达83.6±1.9%,显著优于CellViT256(78.1±0.5%)。 Conclusion: 图Transformer能有效利用细胞级结构与上下文信息,在细粒度WSI分类任务中优于主流图像模型,验证了建模组织层级关系的重要性。 Abstract: Whole-slide images (WSIs) from cancer patients contain rich information that can be used for medical diagnosis or to follow treatment progress. To automate their analysis, numerous deep learning methods based on convolutional neural networks and Vision Transformers have been developed and have achieved strong performance in segmentation and classification tasks. However, due to the large size and complex cellular organization of WSIs, these models rely on patch-based representations, losing vital tissue-level context. We propose using scalable Graph Transformers on a full-WSI cell graph for classification. We evaluate this methodology on a challenging task: the classification of healthy versus tumor epithelial cells in cutaneous squamous cell carcinoma (cSCC), where both cell types exhibit very similar morphologies and are therefore difficult to differentiate for image-based approaches. We first compared image-based and graph-based methods on a single WSI. Graph Transformer models SGFormer and DIFFormer achieved balanced accuracies of $85.2 \pm 1.5$ ($\pm$ standard error) and $85.1 \pm 2.5$ in 3-fold cross-validation, respectively, whereas the best image-based method reached $81.2 \pm 3.0$. By evaluating several node feature configurations, we found that the most informative representation combined morphological and texture features as well as the cell classes of non-epithelial cells, highlighting the importance of the surrounding cellular context. We then extended our work to train on several WSIs from several patients. To address the computational constraints of image-based models, we extracted four $2560 \times 2560$ pixel patches from each image and converted them into graphs. In this setting, DIFFormer achieved a balanced accuracy of $83.6 \pm 1.9$ (3-fold cross-validation), while the state-of-the-art image-based model CellViT256 reached $78.1 \pm 0.5$.[79] Task-Agnostic Continual Learning for Chest Radiograph Classification
Muthu Subash Kavitha,Anas Zafar,Amgad Muneer,Jia Wu
Main category: cs.CV
TL;DR: 本文提出CARL-XRay方法,用于胸部X光片分类的增量持续学习,无需存储原始图像即可实现稳定任务识别与适应,在任务未知部署下优于联合训练。
Details
Motivation: 临床部署中需模型能随新数据集动态更新,而不重新训练旧数据或损害已验证性能。现有方法缺乏对任务标识未知、异构数据流场景的支持。 Method: 提出基于适配器的持续路由学习策略CARL-XRay:固定大容量主干网络,增量添加轻量级任务专用适配器与分类头;引入隐式任务选择器,结合紧凑原型和特征级经验回放进行任务识别与适应。 Result: 在大规模公开胸部X光数据集上验证:任务识别准确率达75.0%(显著高于联合训练的62.5%);AUROC达0.74(oracle)和0.75(task-unknown),参数量显著更少。 Conclusion: CARL-XRay为临床持续部署提供了无需原始图像存储、避免全量重训练的实用替代方案,兼顾性能稳定性与计算效率。 Abstract: Clinical deployment of chest radiograph classifiers requires models that can be updated as new datasets become available without retraining on previously ob- served data or degrading validated performance. We study, for the first time, a task-incremental continual learning setting for chest radiograph classification, in which heterogeneous chest X-ray datasets arrive sequentially and task identifiers are unavailable at inference. We propose a continual adapter-based routing learning strategy for Chest X-rays (CARL-XRay) that maintains a fixed high-capacity backbone and incrementally allocates lightweight task-specific adapters and classifier heads. A latent task selector operates on task-adapted features and leverages both current and historical context preserved through compact prototypes and feature-level experience replay. This design supports stable task identification and adaptation across sequential updates while avoiding raw-image storage. Experiments on large-scale public chest radiograph datasets demonstrate robust performance retention and reliable task-aware inference under continual dataset ingestion. CARL-XRay outperforms joint training under task-unknown deployment, achieving higher routing accuracy (75.0\% vs.\ 62.5\%), while maintaining competitive diagnostic performance with AUROC of 0.74 in the oracle setting with ground-truth task identity and 0.75 under task-unknown inference, using significantly fewer trainable parameters. Finally, the proposed framework provides a practical alternative to joint training and repeated full retraining in continual clinical deployment.[80] VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
Hui Ren,Yuval Alaluf,Omer Bar Tal,Alexander Schwing,Antonio Torralba,Yael Vinker
Main category: cs.CV
TL;DR: 本文提出了一种数据高效的方法,利用预训练文本到视频扩散模型生成具有时序结构的草图绘制过程,通过两阶段微调策略解耦笔画顺序与外观学习,在极少人工草图数据下实现了高质量、可控的序列草图生成。