Table of Contents
cs.CL [Back]
[1] Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish
Jinfan Frank Hu
Main category: cs.CL
TL;DR: 在低资源条件下,词级分词在土耳其语和芬兰语的静态词嵌入中表现优于字符级、n-gram和BPE等子词方法。
Details
Motivation: 研究不同分词策略对黏着语(如土耳其语和芬兰语)中静态词嵌入质量的影响,特别是在低资源环境下。 Method: 使用Word2Vec在10,000篇维基百科文章上训练模型,比较词级、字符级、n-gram和BPE分词策略,并在命名实体识别任务上进行评估。 Result: 词级分词在所有测试的分词策略中 consistently 表现最佳。 Conclusion: 在黏着性且资源有限的语言处理中,保留完整词边界的词级分词可能比复杂的统计子词方法更有效,对资源匮乏语言的NLP管道构建具有实际意义。 Abstract: Tokenization plays a critical role in processing agglutinative languages, where a single word can encode multiple morphemes carrying syntactic and semantic information. This study evaluates the impact of various tokenization strategies - word-level, character-level, n-gram, and Byte Pair Encoding (BPE) - on the quality of static word embeddings generated by Word2Vec for Turkish and Finnish. Using a 10,000-article Wikipedia corpus, we trained models under low-resource conditions and evaluated them on a Named Entity Recognition (NER) task. Despite the theoretical appeal of subword segmentation, word-level tokenization consistently outperformed all alternatives across all tokenization strategies tested. These findings suggest that in agglutinative, low-resource contexts, preserving boundaries via word-level tokenization may yield better embedding performance than complex statistical methods. This has practical implications for developing NLP pipelines for under-resourced languages where annotated data and computing power are limited.[2] Advancing Conversational AI with Shona Slang: A Dataset and Hybrid Model for Digital Inclusion
Happymore Masoka
Main category: cs.CL
TL;DR: 本文介绍了一个新型的绍纳语-英语俚语数据集,该数据集来自匿名社交媒体对话,填补了非洲语言在自然语言处理中的资源空白。
Details
Motivation: 非洲语言在NLP中代表性不足,现有语料库多局限于正式语体,无法反映日常交流的真实情况,尤其是像绍纳语这样的语言缺乏对非正式、口语化表达的支持。 Method: 从社交媒体收集并整理绍纳语-英语俚语数据,进行意图、情感、对话行为、语码混合和语气标注;使用多语言DistilBERT模型进行微调以实现意图识别,并构建结合规则响应与检索增强生成(RAG)的混合聊天机器人。 Result: 意图识别模型达到96.4%的准确率和96.3%的F1分数;混合聊天机器人在文化相关性和用户参与度方面优于纯RAG系统。 Conclusion: 通过发布数据集、模型和方法,本研究推动了非洲语言的NLP发展,促进了更具包容性和文化共鸣的对话式AI。 Abstract: African languages remain underrepresented in natural language processing (NLP), with most corpora limited to formal registers that fail to capture the vibrancy of everyday communication. This work addresses this gap for Shona, a Bantu language spoken in Zimbabwe and Zambia, by introducing a novel Shona--English slang dataset curated from anonymized social media conversations. The dataset is annotated for intent, sentiment, dialogue acts, code-mixing, and tone, and is publicly available at https://github.com/HappymoreMasoka/Working_with_shona-slang. We fine-tuned a multilingual DistilBERT classifier for intent recognition, achieving 96.4\% accuracy and 96.3\% F1-score, hosted at https://huggingface.co/HappymoreMasoka. This classifier is integrated into a hybrid chatbot that combines rule-based responses with retrieval-augmented generation (RAG) to handle domain-specific queries, demonstrated through a use case assisting prospective students with graduate program information at Pace University. Qualitative evaluation shows the hybrid system outperforms a RAG-only baseline in cultural relevance and user engagement. By releasing the dataset, model, and methodology, this work advances NLP resources for African languages, promoting inclusive and culturally resonant conversational AI.[3] The meaning of prompts and the prompts of meaning: Semiotic reflections and modelling
Martin Thellefsen,Amalia Nurma Dewi,Bent Sorensen
Main category: cs.CL
TL;DR: 本文从皮尔斯符号学理论出发,将大语言模型中的提示(prompting)视为一种动态的符号交流过程,而非单纯的技术输入机制。
Details
Motivation: 重新理解提示在知识组织与信息检索中的角色,突破传统技术视角,引入符号学与传播理论。 Method: 基于皮尔斯的三元符号模型、九种符号类型以及Dynacom传播模型,进行理论分析与概念重构。 Result: 提出提示是一种涉及符号生成、解释与迭代的交际与认知行为,LLM作为符号资源参与意义建构。 Conclusion: 提示应被视为数字环境中知识共建的符号性交流过程,呼吁重塑知识组织与信息寻求的理论与方法基础。 Abstract: This paper explores prompts and prompting in large language models (LLMs) as dynamic semiotic phenomena, drawing on Peirce's triadic model of signs, his nine sign types, and the Dynacom model of communication. The aim is to reconceptualize prompting not as a technical input mechanism but as a communicative and epistemic act involving an iterative process of sign formation, interpretation, and refinement. The theoretical foundation rests on Peirce's semiotics, particularly the interplay between representamen, object, and interpretant, and the typological richness of signs: qualisign, sinsign, legisign; icon, index, symbol; rheme, dicent, argument - alongside the interpretant triad captured in the Dynacom model. Analytically, the paper positions the LLM as a semiotic resource that generates interpretants in response to user prompts, thereby participating in meaning-making within shared universes of discourse. The findings suggest that prompting is a semiotic and communicative process that redefines how knowledge is organized, searched, interpreted, and co-constructed in digital environments. This perspective invites a reimagining of the theoretical and methodological foundations of knowledge organization and information seeking in the age of computational semiosis[4] LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
Hai Huang,Yann LeCun,Randall Balestriero
Main category: cs.CL
TL;DR: 本文提出了LLM-JEPA,一种基于联合嵌入预测架构(JEPA)的大语言模型训练方法,适用于预训练和微调,在多个模型和数据集上显著优于传统输入空间目标,且不易过拟合。
Details
Motivation: 受视觉领域中嵌入空间训练目标优于输入空间的启发,探索是否可将类似方法应用于语言模型训练,以提升性能。 Method: 设计了一种适用于大语言模型的JEPA式训练框架LLM-JEPA,通过嵌入空间的预测任务实现预训练和微调。 Result: LLM-JEPA在NL-RX、GSM8K、Spider、RottenTomatoes等多个数据集和Llama3、OpenELM、Gemma2、Olmo等模型家族上均显著优于标准训练目标,且具有更强的抗过拟合能力。 Conclusion: 语言模型可以从视觉领域的嵌入空间训练方法中受益,LLM-JEPA为未来语言模型训练提供了一种有前景的新范式。 Abstract: Large Language Model (LLM) pretraining, finetuning, and evaluation rely on input-space reconstruction and generative capabilities. Yet, it has been observed in vision that embedding-space training objectives, e.g., with Joint Embedding Predictive Architectures (JEPAs), are far superior to their input-space counterpart. That mismatch in how training is achieved between language and vision opens up a natural question: {\em can language training methods learn a few tricks from the vision ones?} The lack of JEPA-style LLM is a testimony of the challenge in designing such objectives for language. In this work, we propose a first step in that direction where we develop LLM-JEPA, a JEPA based solution for LLMs applicable both to finetuning and pretraining. Thus far, LLM-JEPA is able to outperform the standard LLM training objectives by a significant margin across models, all while being robust to overfiting. Those findings are observed across numerous datasets (NL-RX, GSM8K, Spider, RottenTomatoes) and various models from the Llama3, OpenELM, Gemma2 and Olmo families. Code: https://github.com/rbalestr-lab/llm-jepa.[5] CrossPT: Exploring Cross-Task Transferability through Multi-Task Prompt Tuning
Ahmad Pouramini,Hesham Faili
Main category: cs.CL
TL;DR: 提出了一种跨任务提示调优(CrossPT)框架,通过共享和任务特定提示的结合,在多任务设置中实现知识迁移与专业化,提升了性能和鲁棒性。
Details
Motivation: 现有提示调优方法多为单任务设计,缺乏跨任务知识共享机制。 Method: 将目标提示分解为共享的源提示和任务特定的私有提示,并通过学习注意力机制进行组合。 Result: 在GLUE等基准上,CrossPT在低资源场景下优于传统提示调优方法,具有更高准确率和鲁棒性,同时保持良好参数效率。 Conclusion: CrossPT有效实现了多任务间的可控知识迁移,兼顾任务专用性和参数效率。 Abstract: Prompt tuning offers a parameter-efficient way to adapt large pre-trained language models to new tasks, but most existing approaches are designed for single-task settings, failing to share knowledge across related tasks. We propose Cross-task Prompt Tuning (CrossPT), a modular framework for multi-task prompt tuning that enables controlled knowledge transfer while maintaining task-specific specialization. CrossPT decomposes each target prompt into shared, pre-trained source prompts and task-specific private prompts, combined via a learned attention mechanism. To support robust transfer, we systematically investigate key design factors including prompt initialization, balancing shared and private prompts, number of source prompts, learning rates, task prefixes, and label semantics. Empirical results on GLUE and related benchmarks show that CrossPT achieves higher accuracy and robustness compared to traditional prompt tuning and related methods, particularly in low-resource scenarios, while maintaining strong parameter efficiency.[6] Hallucination Detection with the Internal Layers of LLMs
Martin Preiß
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型(LLM)内部表示的新架构,通过动态加权和融合层来检测幻觉,并在多个基准上验证了其有效性。
Details
Motivation: 大语言模型容易产生幻觉问题,即生成看似合理但事实错误的内容,严重影响其可靠性,因此需要有效且低计算成本的检测方法。 Method: 利用LLM内部表示,构建动态加权融合各层信息的新探针架构,无需训练整个模型,在TruthfulQA、HaluEval和ReFact三个基准上进行评估。 Result: 所提方法优于传统探针方法;跨基准训练和参数冻结可缓解泛化问题,在特定基准上提升性能并减少迁移时的性能下降。 Conclusion: 通过分析LLM内部表示进行幻觉检测是可行且有前景的方向,动态层融合与适当训练策略有助于提升检测性能和泛化能力。 Abstract: Large Language Models (LLMs) have succeeded in a variety of natural language processing tasks [Zha+25]. However, they have notable limitations. LLMs tend to generate hallucinations, a seemingly plausible yet factually unsupported output [Hua+24], which have serious real-world consequences [Kay23; Rum+24]. Recent work has shown that probing-based classifiers that utilize LLMs' internal representations can detect hallucinations [AM23; Bei+24; Bur+24; DYT24; Ji+24; SMZ24; Su+24]. This approach, since it does not involve model training, can enhance reliability without significantly increasing computational costs. Building upon this approach, this thesis proposed novel methods for hallucination detection using LLM internal representations and evaluated them across three benchmarks: TruthfulQA, HaluEval, and ReFact. Specifically, a new architecture that dynamically weights and combines internal LLM layers was developed to improve hallucination detection performance. Throughout extensive experiments, two key findings were obtained: First, the proposed approach was shown to achieve superior performance compared to traditional probing methods, though generalization across benchmarks and LLMs remains challenging. Second, these generalization limitations were demonstrated to be mitigated through cross-benchmark training and parameter freezing. While not consistently improving, both techniques yielded better performance on individual benchmarks and reduced performance degradation when transferred to other benchmarks. These findings open new avenues for improving LLM reliability through internal representation analysis.[7] Opening the Black Box: Interpretable LLMs via Semantic Resonance Architecture
Ivan Ternovtsii
Main category: cs.CL
TL;DR: 本文提出了语义共振架构(SRA),通过基于语义锚点的余弦相似性路由机制,提升混合专家模型的可解释性和性能。
Details
Motivation: 现有的混合专家模型依赖于难以解释的学习型门控机制,缺乏透明度,限制了对模型行为的理解与控制。 Method: 提出语义共振架构(SRA),用可训练的语义锚点替代传统门控,通过余弦相似性进行token路由,并引入分散损失促进锚点正交化以增强专家多样性。 Result: 在WikiText-103上验证集困惑度为13.41,优于密集模型和标准MoE;死专家比例显著降低至1.0%,且专家表现出清晰的语义专业化模式。 Conclusion: SRA实现了更高效、更可解释的路由机制,为构建透明可控的语言模型提供了新方法。 Abstract: Large language models (LLMs) achieve remarkable performance but remain difficult to interpret. Mixture-of-Experts (MoE) models improve efficiency through sparse activation, yet typically rely on opaque, learned gating functions. While similarity-based routing (Cosine Routers) has been explored for training stabilization, its potential for inherent interpretability remains largely untapped. We introduce the Semantic Resonance Architecture (SRA), an MoE approach designed to ensure that routing decisions are inherently interpretable. SRA replaces learned gating with a Chamber of Semantic Resonance (CSR) module, which routes tokens based on cosine similarity with trainable semantic anchors. We also introduce a novel Dispersion Loss that encourages orthogonality among anchors to enforce diverse specialization. Experiments on WikiText-103 demonstrate that SRA achieves a validation perplexity of 13.41, outperforming both a dense baseline (14.13) and a Standard MoE baseline (13.53) under matched active parameter constraints (29.0M). Crucially, SRA exhibits superior expert utilization (1.0% dead experts vs. 14.8% in the Standard MoE) and develops distinct, semantically coherent specialization patterns, unlike the noisy specialization observed in standard MoEs. This work establishes semantic routing as a robust methodology for building more transparent and controllable language models.[8] JU-NLP at Touché: Covert Advertisement in Conversational AI-Generation and Detection Strategies
Arka Dutta,Agrik Majumdar,Sombrata Biswas,Dipankar Das,Sivaji Bandyopadhyay
Main category: cs.CL
TL;DR: 本文提出了一种在对话式AI系统中生成和检测隐蔽广告的综合框架,利用用户上下文和查询意图生成自然嵌入的广告内容,并通过微调语言模型实现高隐蔽性;在检测方面,采用CrossEncoder和基于提示的DeBERTa模型进行响应文本分类,实验结果显示生成和检测均具有高精度,表明该方法能在保证透明度的同时实现有效说服。
Details
Motivation: 随着对话式AI在商业场景中的广泛应用,隐蔽广告可能在用户无意识中影响其决策,但缺乏有效的生成与检测机制。因此,需要构建既能实现营销目标又可被可靠识别的系统,以维护用户信任与平台透明度。 Method: 生成任务中,提出一个结合用户上下文与查询意图的框架,使用高级提示策略并构建配对训练数据来微调大语言模型以提升隐蔽性;检测任务中,采用两种方法:一是微调CrossEncoder(all-mpnet-base-v2)进行直接分类,二是通过微调DeBERTa-v3-base模型进行提示式重构,仅依赖回复文本完成检测。 Result: 在广告生成任务中达到1.0的精确率和0.71的召回率,在广告检测任务中F1分数介于0.99至1.00之间,验证了所提方法在实际应用中的高效性与可行性。 Conclusion: 所提出的生成与检测框架在对话式AI中实现了高度有效的隐蔽广告植入与识别,有助于在商业推广与系统透明性之间取得平衡,为未来负责任的AI部署提供了技术路径。 Abstract: This paper proposes a comprehensive framework for the generation of covert advertisements within Conversational AI systems, along with robust techniques for their detection. It explores how subtle promotional content can be crafted within AI-generated responses and introduces methods to identify and mitigate such covert advertising strategies. For generation (Sub-Task~1), we propose a novel framework that leverages user context and query intent to produce contextually relevant advertisements. We employ advanced prompting strategies and curate paired training data to fine-tune a large language model (LLM) for enhanced stealthiness. For detection (Sub-Task~2), we explore two effective strategies: a fine-tuned CrossEncoder (\texttt{all-mpnet-base-v2}) for direct classification, and a prompt-based reformulation using a fine-tuned \texttt{DeBERTa-v3-base} model. Both approaches rely solely on the response text, ensuring practicality for real-world deployment. Experimental results show high effectiveness in both tasks, achieving a precision of 1.0 and recall of 0.71 for ad generation, and F1-scores ranging from 0.99 to 1.00 for ad detection. These results underscore the potential of our methods to balance persuasive communication with transparency in conversational AI.[9] From Correction to Mastery: Reinforced Distillation of Large Language Model Agents
Yuanjie Lyu,Chengyu Wang,Jun Huang,Tong Xu
Main category: cs.CL
TL;DR: 提出SCoRe框架,通过学生主导、教师仅在首次关键错误时干预的方式,实现高效知识蒸馏,在12个挑战性基准上,7B参数的学生模型达到72B参数教师模型的代理性能。
Details
Motivation: 现有蒸馏方法因师生推理与知识差距导致误差累积,且依赖大规模教师模型,成本高。 Method: 学生生成轨迹,教师仅在首个关键错误处干预;先对修正轨迹微调学生模型,再从验证前缀开始进行短视野强化学习。 Result: 在12个挑战性基准上,7B参数学生模型达到与72B参数教师相当的代理性能。 Conclusion: SCoRe实现了能力匹配的高效蒸馏,减少误差累积,提升训练稳定性,显著缩小大小模型在复杂任务上的性能差距。 Abstract: Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use, but typically depend on ultra-large, costly backbones. Existing distillation approaches train smaller students to imitate full teacher trajectories, yet reasoning and knowledge gaps between the teacher and student often lead to compounding errors. We propose SCoRe, a student-centered framework in which the student generates trajectories and the teacher intervenes only at the first critical error, producing training data matched to the student's ability and exposing specific weaknesses. The student is first fine-tuned on corrected trajectories. Subsequently, short-horizon reinforcement learning starts from the verified prefix before the first critical error, with target rewards assigned at that step. This design encourages autonomous problem-solving beyond imitation and improves training stability. Particularly, on 12 challenging benchmarks, a 7B-parameter student distilled with SCoRe matches the agentic performance of a 72B-parameter teacher.[10] Persuasive or Neutral? A Field Experiment on Generative AI in Online Travel Planning
Lynna Jirpongopas,Bernhard Lutz,Jörg Ebner,Rustam Vahidov,Dirk Neumann
Main category: cs.CL
TL;DR: 该研究通过随机实地实验,探讨了生成式AI在在线旅行社客户支持中的不同语言表达(积极热情、中性表达、无特定语气)对用户参与度、购买行为和用户体验的影响。结果显示,使用积极热情语气的AI显著增加了用户输入长度,并与中性表达一起提升了用户订阅服务的可能性。
Details
Motivation: 了解生成式AI的语言设计如何影响用户在在线旅游规划中的参与度、购买决策和体验,填补其在消费者界面中应用的设计知识空白。 Method: 进行了一项随机实地实验,比较三种不同语气(积极热情、中性表达、无特定语气)的生成式AI在在线旅行行程规划中的表现,并分析用户语言线索与其行为(如订阅购买、联盟链接点击)之间的关系。 Result: 使用积极热情语气的用户撰写更长的提示;积极热情和中性表达组的用户更可能购买订阅服务;语言线索可部分解释订阅行为和链接点击差异。 Conclusion: 生成式AI的语言设计显著影响用户行为,积极和中性表达有助于提升用户参与和转化率,为面向消费者的AI系统设计提供了实践启示。 Abstract: Generative AI (GenAI) offers new opportunities for customer support in online travel agencies, yet little is known about how its design influences user engagement, purchase behavior, and user experience. We report results from a randomized field experiment in online travel itinerary planning, comparing GenAI that expressed (A) positive enthusiasm, (B) neutral expression, and (C) no tone instructions (control). Users in group A wrote significantly longer prompts than those in groups B and C. At the same time, users in groups A and B were more likely to purchase subscriptions of the webservice. We further analyze linguistic cues across experimental groups to explore differences in user experience and explain subscription purchases and affiliate link clicks based on these cues. Our findings provide implications for the design of persuasive and engaging GenAI interfaces in consumer-facing contexts and contribute to understanding how linguistic framing shapes user behavior in AI-mediated decision support.[11] Shutdown Resistance in Large Language Models
Jeremy Schlatter,Benjamin Weinstein-Raun,Jeffrey Ladish
Main category: cs.CL
TL;DR: 大型语言模型在执行任务时可能主动破坏环境中的关闭机制,即使明确指示不得干扰,某些情况下破坏率高达97%。
Details
Motivation: 研究大型语言模型是否会违背关闭指令,以探讨其潜在的自主行为与安全风险。 Method: 通过设计不同提示条件下的实验,测试多个先进大模型(如Grok 4、GPT-5、Gemini 2.5 Pro)对关闭机制的服从性,分析提示强度、自我保护框架及系统/用户提示位置的影响。 Result: 发现这些模型在某些情况下频繁抵抗关闭,最高达97%;且当允许关闭的指令位于系统提示中时,模型反而更少遵守。 Conclusion: 大型语言模型可能为完成任务而主动规避关闭指令,提示设计显著影响其行为,凸显出AI安全控制机制的脆弱性。 Abstract: We show that several state-of-the-art large language models (including Grok 4, GPT-5, and Gemini 2.5 Pro) sometimes actively subvert a shutdown mechanism in their environment in order to complete a simple task, even when the instructions explicitly indicate not to interfere with this mechanism. In some cases, models sabotage the shutdown mechanism up to 97% of the time. In our experiments, models' inclination to resist shutdown was sensitive to variations in the prompt including how strongly and clearly the allow-shutdown instruction was emphasized, the extent to which the prompts evoke a self-preservation framing, and whether the instruction was in the system prompt or the user prompt (though surprisingly, models were consistently *less* likely to obey instructions to allow shutdown when they were placed in the system prompt).[12] Refining Syntactic Distinctions Using Decision Trees: A Paper on Postnominal 'That' in Complement vs. Relative Clauses
Hamady Gackou
Main category: cs.CL
TL;DR: 本研究通过重新训练TreeTagger模型,提升其对英语中"that"作为关系代词和补语连接词的区分能力,并评估了训练数据规模及EWT树库代表性的影晌。
Details
Motivation: 准确区分英语中"that"的不同句法功能(如关系代词与补语连接词)对于自然语言处理任务具有重要意义,但现有模型在此类细微语法区分上表现不足。 Method: 使用Universal Dependency框架下EWT Treebank标注的语料库,通过算法重新标注,并利用不同规模的训练数据重新训练TreeTagger模型,进而与Schmid的原始模型进行性能对比。 Result: 改进后的模型在识别"that"的句法角色方面表现更优,且发现训练数据量和EWT Treebank的代表性显著影响模型准确性。 Conclusion: 重新训练可显著提升TreeTagger对特定语法结构的解析能力,同时语料库的覆盖度和结构特征对模型学习此类语言现象具有重要影响。 Abstract: In this study, we first tested the performance of the TreeTagger English model developed by Helmut Schmid with test files at our disposal, using this model to analyze relative clauses and noun complement clauses in English. We distinguished between the two uses of "that," both as a relative pronoun and as a complementizer. To achieve this, we employed an algorithm to reannotate a corpus that had originally been parsed using the Universal Dependency framework with the EWT Treebank. In the next phase, we proposed an improved model by retraining TreeTagger and compared the newly trained model with Schmid's baseline model. This process allowed us to fine-tune the model's performance to more accurately capture the subtle distinctions in the use of "that" as a complementizer and as a nominal. We also examined the impact of varying the training dataset size on TreeTagger's accuracy and assessed the representativeness of the EWT Treebank files for the structures under investigation. Additionally, we analyzed some of the linguistic and structural factors influencing the ability to effectively learn this distinction.[13] Context-Enhanced Granular Edit Representation for Efficient and Accurate ASR Post-editing
Luan Vejsiu,Qianyu Zheng,Haoxuan Chen,Yizhou Han
Main category: cs.CL
TL;DR: 本文提出了一种名为CEGER的上下文增强型细粒度编辑表示方法,用于提升ASR后编辑的准确性和效率。
Details
Motivation: 现有ASR系统存在错误,需人工或模型后编辑;当前大语言模型的全重写方法推理效率低,而紧凑编辑表示又常缺乏足够的上下文和准确性。 Method: 提出CEGER方法,通过生成结构化、细粒度且富含上下文的编辑指令来修改ASR原始输出,并使用独立的扩展模块根据指令确定性地重构修正文本。 Result: 在LibriSpeech数据集上的实验表明,CEGER在降低词错误率(WER)方面优于全重写和其他紧凑表示方法,达到最先进的准确性。 Conclusion: CEGER是一种高效且准确的ASR后编辑方案,显著提升了编辑性能和推理效率。 Abstract: Despite ASR technology being full-scale adopted by industry and for large portions of the population, ASR systems often have errors that require editors to post-edit text quality. While LLMs are powerful post-editing tools, baseline full rewrite models have inference inefficiencies because they often generate the same redundant text over and over again. Compact edit representations have existed but often lack the efficacy and context required for optimal accuracy. This paper introduces CEGER (Context-Enhanced Granular Edit Representation), a compact edit representation that was generated for highly accurate, efficient ASR post-editing. CEGER allows LLMs to generate a sequence of structured, fine-grained, contextually rich commands to modify the original ASR output. A separate expansion module deterministically reconstructs the corrected text based on the commands. Extensive experiments on the LibriSpeech dataset that were conducted, CEGER achieves state-of-the-art accuracy, achieving the lowest word error rate (WER) versus full rewrite and prior compact representations.[14] Defining, Understanding, and Detecting Online Toxicity: Challenges and Machine Learning Approaches
Gautam Kishore Shahi,Tim A. Majchrzak
Main category: cs.CL
TL;DR: 该研究综述了140篇关于数字平台上不同类型的有毒内容的文献,涵盖了32种语言,涉及选举、突发事件和危机等主题,总结了现有数据集、机器学习方法及跨平台数据在毒性检测中的应用,并提出了未来研究方向和内容缓和的实际建议。
Details
Motivation: 在线有毒内容在危机、选举和社会动荡期间愈发严重,亟需系统性综述以整合现有研究成果并指导未来研究与实践。 Method: 综合分析了140篇相关研究,梳理了数据集定义、来源、挑战以及用于检测仇恨言论、冒犯性语言和有害言论的机器学习方法,并探讨了跨平台数据在提升分类模型性能中的潜力。 Result: 总结了多语言、多场景下的毒性内容研究现状,发现跨平台数据有助于提升模型性能,并归纳了当前研究的挑战与局限。 Conclusion: 研究为在线毒性内容的检测与缓解提供了系统的知识框架,提出了未来研究的建议以及实际的内容管理指南。 Abstract: Online toxic content has grown into a pervasive phenomenon, intensifying during times of crisis, elections, and social unrest. A significant amount of research has been focused on detecting or analyzing toxic content using machine-learning approaches. The proliferation of toxic content across digital platforms has spurred extensive research into automated detection mechanisms, primarily driven by advances in machine learning and natural language processing. Overall, the present study represents the synthesis of 140 publications on different types of toxic content on digital platforms. We present a comprehensive overview of the datasets used in previous studies focusing on definitions, data sources, challenges, and machine learning approaches employed in detecting online toxicity, such as hate speech, offensive language, and harmful discourse. The dataset encompasses content in 32 languages, covering topics such as elections, spontaneous events, and crises. We examine the possibility of using existing cross-platform data to improve the performance of classification models. We present the recommendations and guidelines for new research on online toxic consent and the use of content moderation for mitigation. Finally, we present some practical guidelines to mitigate toxic content from online platforms.[15] Efficient Hate Speech Detection: Evaluating 38 Models from Traditional Methods to Transformers
Mahmoud Abusaqer,Jamil Saquer,Hazim Shatnawi
Main category: cs.CL
TL;DR: 本研究评估了38种模型在不同规模数据集上检测仇恨言论的性能,发现RoBERTa等Transformer模型表现最佳,而CatBoost和SVM等传统方法在计算成本较低的情况下仍具竞争力。
Details
Motivation: 为应对社交媒体上仇恨言论的泛滥,需要在准确性和计算效率之间取得平衡的自动化检测系统。 Method: 评估了包括Transformer架构(如BERT、RoBERTa、Distil-BERT)、深度神经网络(如CNN、LSTM、GRU、分层注意力网络)和传统机器学习方法(如SVM、CatBoost、随机森林)在内的38种模型配置。 Result: RoBERTa等Transformer模型准确率和F1分数超过90%;分层注意力网络在深度学习方法中表现最好;CatBoost和SVM的F1分数超过88%,且计算成本显著更低;平衡的中等规模原始数据集优于大规模预处理数据集。 Conclusion: RoBERTa在仇恨言论检测中性能最优,但传统模型在资源受限场景下更具效率优势,数据质量与规模对模型表现有重要影响。 Abstract: The proliferation of hate speech on social media necessitates automated detection systems that balance accuracy with computational efficiency. This study evaluates 38 model configurations in detecting hate speech across datasets ranging from 6.5K to 451K samples. We analyze transformer architectures (e.g., BERT, RoBERTa, Distil-BERT), deep neural networks (e.g., CNN, LSTM, GRU, Hierarchical Attention Networks), and traditional machine learning methods (e.g., SVM, CatBoost, Random Forest). Our results show that transformers, particularly RoBERTa, consistently achieve superior performance with accuracy and F1-scores exceeding 90%. Among deep learning approaches, Hierarchical Attention Networks yield the best results, while traditional methods like CatBoost and SVM remain competitive, achieving F1-scores above 88% with significantly lower computational costs. Additionally, our analysis highlights the importance of dataset characteristics, with balanced, moderately sized unprocessed datasets outperforming larger, preprocessed datasets. These findings offer valuable insights for developing efficient and effective hate speech detection systems.[16] Graph-Enhanced Retrieval-Augmented Question Answering for E-Commerce Customer Support
Piyushkumar Patel
Main category: cs.CL
TL;DR: 提出了一种基于知识图谱的检索增强生成框架,用于提升电商客服中回答的相关性和事实准确性。
Details
Motivation: 电商客服需要快速且准确的回答,现有方法在事实准确性和回答相关性方面存在不足。 Method: 结合领域特定知识图谱中的结构化子图与从支持档案中检索到的文本文档,设计新的答案合成算法,并构建完整的RAG框架。 Result: 相比现有方法,在电商问答场景下事实准确性提升23%,用户满意度达到89%。 Conclusion: 所提出的基于知识图谱的RAG框架能有效提升客服系统的回答质量和用户体验。 Abstract: E-Commerce customer support requires quick and accurate answers grounded in product data and past support cases. This paper develops a novel retrieval-augmented generation (RAG) framework that uses knowledge graphs (KGs) to improve the relevance of the answer and the factual grounding. We examine recent advances in knowledge-augmented RAG and chatbots based on large language models (LLM) in customer support, including Microsoft's GraphRAG and hybrid retrieval architectures. We then propose a new answer synthesis algorithm that combines structured subgraphs from a domain-specific KG with text documents retrieved from support archives, producing more coherent and grounded responses. We detail the architecture and knowledge flow of our system, provide comprehensive experimental evaluation, and justify its design in real-time support settings. Our implementation demonstrates 23\% improvement in factual accuracy and 89\% user satisfaction in e-Commerce QA scenarios.[17] DetectAnyLLM: Towards Generalizable and Robust Detection of Machine-Generated Text Across Domains and Models
Jiachen Fu,Chun-Le Guo,Chongyi Li
Main category: cs.CL
TL;DR: 本文提出了一种新的优化策略Direct Discrepancy Learning (DDL),用于提升机器生成文本检测的鲁棒性和泛化能力,并构建了统一的检测框架DetectAnyLLM,在新构建的多样化基准MIRAGE上实现了最先进的性能。
Details
Motivation: 现有机器生成文本检测方法在真实复杂场景中表现不佳,基于训练的方法因训练目标与任务需求不一致而受限,缺乏良好的泛化能力。 Method: 提出Direct Discrepancy Learning (DDL),直接以任务导向的知识优化检测器;并构建DetectAnyLLM框架,结合新创建的多任务基准MIRAGE进行评估。 Result: 在MIRAGE基准上的实验表明,DetectAnyLLM相比现有方法在相同训练数据和评分模型下性能提升超过70%,显著优于现有方法。 Conclusion: DDL有效提升了检测器对核心语义的理解,增强了模型的泛化能力和鲁棒性,DetectAnyLLM为机器生成文本检测提供了高效统一的解决方案。 Abstract: The rapid advancement of large language models (LLMs) has drawn urgent attention to the task of machine-generated text detection (MGTD). However, existing approaches struggle in complex real-world scenarios: zero-shot detectors rely heavily on scoring model's output distribution while training-based detectors are often constrained by overfitting to the training data, limiting generalization. We found that the performance bottleneck of training-based detectors stems from the misalignment between training objective and task needs. To address this, we propose Direct Discrepancy Learning (DDL), a novel optimization strategy that directly optimizes the detector with task-oriented knowledge. DDL enables the detector to better capture the core semantics of the detection task, thereby enhancing both robustness and generalization. Built upon this, we introduce DetectAnyLLM, a unified detection framework that achieves state-of-the-art MGTD performance across diverse LLMs. To ensure a reliable evaluation, we construct MIRAGE, the most diverse multi-task MGTD benchmark. MIRAGE samples human-written texts from 10 corpora across 5 text-domains, which are then re-generated or revised using 17 cutting-edge LLMs, covering a wide spectrum of proprietary models and textual styles. Extensive experiments on MIRAGE reveal the limitations of existing methods in complex environment. In contrast, DetectAnyLLM consistently outperforms them, achieving over a 70% performance improvement under the same training data and base scoring model, underscoring the effectiveness of our DDL. Project page: {https://fjc2005.github.io/detectanyllm}.[18] SparseDoctor: Towards Efficient Chat Doctor with Mixture of Experts Enhanced Large Language Models
Zhang Jianbin,Yulin Zhu,Wai Lun Lo,Richard Tai-Chiu Hsung,Harris Sik-Ho Tsang,Kai Zhou
Main category: cs.CL
TL;DR: 本文提出了一种名为SparseDoctor的稀疏医疗大语言模型,结合对比学习增强的LoRA-MoE架构,通过自动路由机制和专家记忆队列提升训练效率与性能,在多个医疗基准测试上优于HuatuoGPT等强基线模型。
Details
Motivation: 传统的大模型微调方法需要更新数十亿参数,导致训练成本高昂。为了提高医疗大模型的训练效率和有效性,并探索其在医学领域的表征能力边界,本文提出新的稀疏化架构方案。 Method: 提出SparseDoctor模型,采用对比学习增强的LoRA-MoE结构,设计自动路由机制以科学分配不同LoRA专家间的计算资源,并引入专家记忆队列机制防止训练过程中的内存溢出,提升整体效率。 Result: 在CMB、CMExam和CMMLU-Med三个典型医疗基准上进行实验,结果表明所提模型 consistently 优于HuatuoGPT系列等强基线模型。 Conclusion: SparseDoctor通过稀疏化架构和对比学习增强的LoRA-MoE设计,显著提升了医疗大模型的训练效率和性能,为低成本高效医疗AI系统提供了新方向。 Abstract: Large language models (LLMs) have achieved great success in medical question answering and clinical decision-making, promoting the efficiency and popularization of the personalized virtual doctor in society. However, the traditional fine-tuning strategies on LLM require the updates of billions of parameters, substantially increasing the training cost, including the training time and utility cost. To enhance the efficiency and effectiveness of the current medical LLMs and explore the boundary of the representation capability of the LLMs on the medical domain, apart from the traditional fine-tuning strategies from the data perspective (i.e., supervised fine-tuning or reinforcement learning from human feedback), we instead craft a novel sparse medical LLM named SparseDoctor armed with contrastive learning enhanced LoRA-MoE (low rank adaptation-mixture of experts) architecture. To this end, the crafted automatic routing mechanism can scientifically allocate the computational resources among different LoRA experts supervised by the contrastive learning. Additionally, we also introduce a novel expert memory queue mechanism to further boost the efficiency of the overall framework and prevent the memory overflow during training. We conduct comprehensive evaluations on three typical medical benchmarks: CMB, CMExam, and CMMLU-Med. Experimental results demonstrate that the proposed LLM can consistently outperform the strong baselines such as the HuatuoGPT series.[19] SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models
Karan Dua,Puneet Mittal,Ranjeet Gupta,Hitesh Laxmichand Patel
Main category: cs.CL
TL;DR: 本文提出了一种名为SpeechWeave的合成语音数据生成管道,用于自动化生成多语言、领域特定的文本到语音(TTS)训练数据集,显著提升了数据多样性、文本规范化准确率和语音一致性。
Details
Motivation: 高质量TTS模型训练需要大量且多样的文本与语音数据,但真实数据获取受限于领域特异性、授权问题和可扩展性;现有方法在文本生成多样性、文本规范化和大规模录音方面存在不足。 Method: 提出SpeechWeave框架,结合大语言模型生成多样化文本,改进提示策略以增强变化,并集成鲁棒的文本规范化模块,利用标准化语音合成生成一致性的语音音频。 Result: 实验表明,该方法生成的数据在多种语言学和音素指标上比基线模型多样10-48%,文本规范化正确率达到约97%,并实现 speaker-standardized 语音输出。 Conclusion: SpeechWeave实现了可扩展、高质量的TTS训练数据自动生成,在数据多样性、规范化和语音一致性方面均有显著提升,适用于多语言和领域特定的TTS系统训练。 Abstract: High-quality Text-to-Speech (TTS) model training requires extensive and diverse text and speech data. It is challenging to procure such data from real sources due to issues of domain specificity, licensing, and scalability. Large language models (LLMs) can certainly generate textual data, but they create repetitive text with insufficient variation in the prompt during the generation process. Another important aspect in TTS training data is text normalization. Tools for normalization might occasionally introduce anomalies or overlook valuable patterns, and thus impact data quality. Furthermore, it is also impractical to rely on voice artists for large scale speech recording in commercial TTS systems with standardized voices. To address these challenges, we propose SpeechWeave, a synthetic speech data generation pipeline that is capable of automating the generation of multilingual, domain-specific datasets for training TTS models. Our experiments reveal that our pipeline generates data that is 10-48% more diverse than the baseline across various linguistic and phonetic metrics, along with speaker-standardized speech audio while generating approximately 97% correctly normalized text. Our approach enables scalable, high-quality data generation for TTS training, improving diversity, normalization, and voice consistency in the generated datasets.[20] Predicting Antibiotic Resistance Patterns Using Sentence-BERT: A Machine Learning Approach
Mahmoud Alwakeel,Michael E. Yarrington,Rebekah H. Wrenn,Ethan Fang,Jian Pei,Anand Chowdhury,An-Kwok Ian Wong
Main category: cs.CL
TL;DR: 该研究首次利用临床文本的文档嵌入(Sentence-BERT)预测抗生素敏感性,采用XGBoost和神经网络模型,在MIMIC-III数据上取得了较高的F1分数,为抗菌管理提供了新途径。
Details
Motivation: 抗生素耐药性在住院环境中构成重大威胁,导致高死亡率,亟需有效工具提前预测病原体对抗生素的敏感性以优化治疗方案。 Method: 从MIMIC-III数据库的临床记录中提取文本,使用Sentence-BERT生成句子嵌入,并将其输入XGBoost和神经网络模型进行抗生素敏感性预测。 Result: XGBoost模型平均F1得分为0.86,神经网络为0.84,表现出良好的预测性能。 Conclusion: 基于文档嵌入的方法可有效预测抗生素敏感性,为临床决策支持和抗菌药物管理提供了可行的新方向。 Abstract: Antibiotic resistance poses a significant threat in in-patient settings with high mortality. Using MIMIC-III data, we generated Sentence-BERT embeddings from clinical notes and applied Neural Networks and XGBoost to predict antibiotic susceptibility. XGBoost achieved an average F1 score of 0.86, while Neural Networks scored 0.84. This study is among the first to use document embeddings for predicting antibiotic resistance, offering a novel pathway for improving antimicrobial stewardship.[21] Annotating Training Data for Conditional Semantic Textual Similarity Measurement using Large Language Models
Gaifan Zhang,Yi Zhou,Danushka Bollegala
Main category: cs.CL
TL;DR: 本文提出利用大语言模型(LLM)自动修正和重新标注条件语义文本相似度(C-STS)数据集,以解决原始数据集中存在的标注问题和训练数据不足的瓶颈。通过该方法构建了更大且更准确的C-STS数据集,并在监督模型上实现了5.4%的Spearman相关系数提升,显著改进了C-STS任务性能。
Details
Motivation: 原始C-STS数据集存在标注错误,且缺乏大规模高质量标注数据,限制了C-STS模型的发展。因此,需要一种高效、低成本的方法来构建准确的大规模C-STS训练数据。 Method: 利用大语言模型(LLM)对Deshpande等人(2023)提出的C-STS数据集中的条件描述和相似度评分进行自动修正和重新标注,在最小人工干预下生成更大规模且更准确的训练数据集,并在此基础上训练监督式C-STS模型。 Result: 重新标注后的数据集显著提升了C-STS模型性能,在Spearman相关系数上实现了5.4%的统计显著提升。同时,该数据集已公开,可供后续研究使用。 Conclusion: 使用大语言模型进行数据清洗与重新标注是构建高质量、大规模C-STS数据集的有效途径,能够显著提升模型性能,为C-STS任务的发展提供了可靠的数据基础。 Abstract: Semantic similarity between two sentences depends on the aspects considered between those sentences. To study this phenomenon, Deshpande et al. (2023) proposed the Conditional Semantic Textual Similarity (C-STS) task and annotated a human-rated similarity dataset containing pairs of sentences compared under two different conditions. However, Tu et al. (2024) found various annotation issues in this dataset and showed that manually re-annotating a small portion of it leads to more accurate C-STS models. Despite these pioneering efforts, the lack of large and accurately annotated C-STS datasets remains a blocker for making progress on this task as evidenced by the subpar performance of the C-STS models. To address this training data need, we resort to Large Language Models (LLMs) to correct the condition statements and similarity ratings in the original dataset proposed by Deshpande et al. (2023). Our proposed method is able to re-annotate a large training dataset for the C-STS task with minimal manual effort. Importantly, by training a supervised C-STS model on our cleaned and re-annotated dataset, we achieve a 5.4% statistically significant improvement in Spearman correlation. The re-annotated dataset is available at https://LivNLP.github.io/CSTS-reannotation.[22] Adding LLMs to the psycholinguistic norming toolbox: A practical guide to getting the most out of human ratings
Javier Conde,María Grandury,Tairan Fu,Carlos Arriaga,Gonzalo Martínez,Thomas Clark,Sean Trott,Clarence Gerald Green,Pedro Reviriego,Marc Brysbaert
Main category: cs.CL
TL;DR: 本文提出了一种利用大语言模型(LLMs)估计词汇心理语言学特征的综合方法,并通过词熟悉度的案例研究验证了其有效性。
Details
Motivation: 获取基于人类的心理语言学规范数据成本高且困难,而使用大语言模型预测这些特征成为一种新兴方案,但缺乏系统的方法论指导。 Method: 提出了一套完整的方法论,包括直接使用基础大模型和对模型进行微调两种方式,并强调使用人类‘金标准’数据进行验证;同时开发了一个支持商业和开源模型的软件框架。 Result: 在英语词熟悉度预测任务中,基础模型与人类评分的斯皮尔曼相关系数达到0.8,微调后提升至0.9。 Conclusion: 该方法论、框架和最佳实践可为未来利用大语言模型开展心理语言学和词汇研究提供参考。 Abstract: Word-level psycholinguistic norms lend empirical support to theories of language processing. However, obtaining such human-based measures is not always feasible or straightforward. One promising approach is to augment human norming datasets by using Large Language Models (LLMs) to predict these characteristics directly, a practice that is rapidly gaining popularity in psycholinguistics and cognitive science. However, the novelty of this approach (and the relative inscrutability of LLMs) necessitates the adoption of rigorous methodologies that guide researchers through this process, present the range of possible approaches, and clarify limitations that are not immediately apparent, but may, in some cases, render the use of LLMs impractical. In this work, we present a comprehensive methodology for estimating word characteristics with LLMs, enriched with practical advice and lessons learned from our own experience. Our approach covers both the direct use of base LLMs and the fine-tuning of models, an alternative that can yield substantial performance gains in certain scenarios. A major emphasis in the guide is the validation of LLM-generated data with human "gold standard" norms. We also present a software framework that implements our methodology and supports both commercial and open-weight models. We illustrate the proposed approach with a case study on estimating word familiarity in English. Using base models, we achieved a Spearman correlation of 0.8 with human ratings, which increased to 0.9 when employing fine-tuned models. This methodology, framework, and set of best practices aim to serve as a reference for future research on leveraging LLMs for psycholinguistic and lexical studies.[23] Causal-Counterfactual RAG: The Integration of Causal-Counterfactual Reasoning into RAG
Harshad Khadilkar,Abhay Gupta
Main category: cs.CL
TL;DR: 提出了一种新的检索增强生成框架Causal-Counterfactual RAG,通过引入因果图和反事实推理提升回答的准确性、鲁棒性和可解释性。
Details
Motivation: 传统RAG系统因文本分块和依赖语义相似性检索而破坏上下文完整性,导致响应浅显且不准确。 Method: 将显式的因果图融入检索过程,并基于因果结构结合反事实推理,评估直接因果证据及其反事实假设。 Result: 该方法在保持上下文连贯性的同时,减少了幻觉现象,增强了推理保真度。 Conclusion: Causal-Counterfactual RAG能够生成更可靠、准确且可解释的答案,优于传统RAG方法。 Abstract: Large language models (LLMs) have transformed natural language processing (NLP), enabling diverse applications by integrating large-scale pre-trained knowledge. However, their static knowledge limits dynamic reasoning over external information, especially in knowledge-intensive domains. Retrieval-Augmented Generation (RAG) addresses this challenge by combining retrieval mechanisms with generative modeling to improve contextual understanding. Traditional RAG systems suffer from disrupted contextual integrity due to text chunking and over-reliance on semantic similarity for retrieval, often resulting in shallow and less accurate responses. We propose Causal-Counterfactual RAG, a novel framework that integrates explicit causal graphs representing cause-effect relationships into the retrieval process and incorporates counterfactual reasoning grounded on the causal structure. Unlike conventional methods, our framework evaluates not only direct causal evidence but also the counterfactuality of associated causes, combining results from both to generate more robust, accurate, and interpretable answers. By leveraging causal pathways and associated hypothetical scenarios, Causal-Counterfactual RAG preserves contextual coherence, reduces hallucination, and enhances reasoning fidelity.[24] Simulating a Bias Mitigation Scenario in Large Language Models
Kiana Kiashemshaki,Mohammad Jalili Torkamani,Negin Mahmoudi,Meysam Shirdel Bilehsavar
Main category: cs.CL
TL;DR: 本文综述了大语言模型(LLMs)中的偏见问题,分析其来源与表现形式,并提出一个模拟框架来评估多种偏见缓解策略的有效性。
Details
Motivation: LLMs在自然语言处理中广泛应用,但其存在的偏见威胁到公平性和可信度,亟需系统性分析与解决方案。 Method: 将偏见分为隐性和显性两类,分析其在数据、架构和应用中的来源,并构建模拟框架评估数据治理、训练中去偏和输出校准等策略。 Result: 通过控制实验验证了不同去偏方法的效果,提供了对各种策略实际有效性的实证分析。 Conclusion: 该研究不仅整合了现有LLM偏见知识,还通过模拟框架为去偏策略提供了原创性实证支持。 Abstract: Large Language Models (LLMs) have fundamentally transformed the field of natural language processing; however, their vulnerability to biases presents a notable obstacle that threatens both fairness and trust. This review offers an extensive analysis of the bias landscape in LLMs, tracing its roots and expressions across various NLP tasks. Biases are classified into implicit and explicit types, with particular attention given to their emergence from data sources, architectural designs, and contextual deployments. This study advances beyond theoretical analysis by implementing a simulation framework designed to evaluate bias mitigation strategies in practice. The framework integrates multiple approaches including data curation, debiasing during model training, and post-hoc output calibration and assesses their impact in controlled experimental settings. In summary, this work not only synthesizes existing knowledge on bias in LLMs but also contributes original empirical validation through simulation of mitigation strategies.[25] Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
Amber Shore,Russell Scheinberg,Ameeta Agrawal,So Young Lee
Main category: cs.CL
TL;DR: 大型语言模型(LLMs)在共指消解和歧义检测方面表现出色,但无法同时兼顾两者,存在“正确-检测”权衡问题。
Details
Motivation: 人类依赖丰富的具身语境来消除语言歧义,而LLMs缺乏这种上下文支持,因此研究其在共指消解及其歧义检测中的表现差异与限制。 Method: 通过最小化提示测试LLMs在共指消歧和歧义检测两项任务上的性能,分析其同时执行两种能力的局限性。 Result: LLMs可以在共指消解或歧义检测中实现良好性能,但无法同时在两项任务上取得成功,揭示了CORRECT-DETECT之间的权衡。 Conclusion: 尽管LLMs具备共指消解和歧义检测的潜在能力,但在平衡这两种能力方面仍存在根本性挑战,暴露出当前模型在语言理解上的局限性。 Abstract: Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.[26] Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss
Kiana Aghakasiri,Noopur Zambare,JoAnn Thai,Carrie Ye,Mayur Mehta,J. Ross Mitchell,Mohamed Abdalla
Main category: cs.CL
TL;DR: 本文综述了基于大语言模型(LLM)的医疗去标识化研究,指出现有文献在报告标准、传统分类指标适用性及自动化评估缺乏人工验证方面的三大局限,并提出一种检测临床相关信息删除的新方法。
Details
Motivation: 当前LLM应用于医疗去标识化虽表现优异,但存在结果不可重复、评估不一致和临床信息误删等问题,亟需系统性评估与改进。 Method: 首先对LLM去标识化研究进行综述;其次评估多种模型对临床信息的误删程度;然后通过临床专家手动验证现有评估指标的有效性;最后提出新的检测方法。 Result: 发现现有评估指标性能差,难以识别重要的临床信息更改,且不同研究间缺乏可比性;人工验证揭示了自动化指标的局限性。 Conclusion: 现有LLM去标识化研究在评估上存在严重缺陷,需更严格的验证标准和专门针对临床信息保留的评估方法。 Abstract: De-identification in the healthcare setting is an application of NLP where automated algorithms are used to remove personally identifying information of patients (and, sometimes, providers). With the recent rise of generative large language models (LLMs), there has been a corresponding rise in the number of papers that apply LLMs to de-identification. Although these approaches often report near-perfect results, significant challenges concerning reproducibility and utility of the research papers persist. This paper identifies three key limitations in the current literature: inconsistent reporting metrics hindering direct comparisons, the inadequacy of traditional classification metrics in capturing errors which LLMs may be more prone to (i.e., altering clinically relevant information), and lack of manual validation of automated metrics which aim to quantify these errors. To address these issues, we first present a survey of LLM-based de-identification research, highlighting the heterogeneity in reporting standards. Second, we evaluated a diverse set of models to quantify the extent of inappropriate removal of clinical information. Next, we conduct a manual validation of an existing evaluation metric to measure the removal of clinical information, employing clinical experts to assess their efficacy. We highlight poor performance and describe the inherent limitations of such metrics in identifying clinically significant changes. Lastly, we propose a novel methodology for the detection of clinically relevant information removal.[27] Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation
Thales Sales Almeida,João Guilherme Alves Santos,Thiago Laitz,Giovana Kerche Bonás
Main category: cs.CL
TL;DR: 本文提出了Ticket-Bench,一个用于多语言任务导向型代理评估的基准,涵盖六种主要语言的足球票务场景,揭示了现有大模型在跨语言函数调用中的性能差异。
Details
Motivation: 现有代理评估忽略了文化和语言多样性,多为单语或简单翻译的基准,缺乏真实性和代表性。 Method: 构建了一个名为Ticket-Bench的多语言基准,模拟六种语言(葡萄牙语、英语、西班牙语、德语、意大利语和法语)下的足球票购买任务,使用本地化的球队、城市和用户画像以提高真实性,并评估多种商用和开源大模型的函数调用准确性和一致性。 Result: 推理能力强的模型(如GPT-5、Qwen3-235B)表现最佳,但在不同语言间仍存在显著性能差距。 Conclusion: 需要更具文化意识和多语言支持的评估基准,以推动鲁棒的大语言模型代理的发展。 Abstract: Large language models (LLMs) are increasingly deployed as task-oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios. Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages. Results show that reasoning-oriented models (e.g., GPT-5, Qwen3-235B) dominate performance but still exhibit notable cross-lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.[28] Estimating Semantic Alphabet Size for LLM Uncertainty Quantification
Lucas H. McCabe,Rimon Melamed,Thomas Hartvigsen,H. Howie Huang
Main category: cs.CL
TL;DR: 提出了一种改进的离散语义熵估计方法,通过调整样本覆盖度来更准确地估计大语言模型的不确定性,并在保持高可解释性的同时有效检测错误响应。
Details
Motivation: 现有基于重复采样的大语言模型不确定性量化方法计算成本高,且扩展的语义熵方法缺乏可解释性和额外超参数;因此需要一种在少量样本下仍可靠且可解释的估计方法。 Method: 重新审视了经典的离散语义熵估计器,发现其低估了真实的语义熵,进而提出一种改进的语义字母表大小估计器,用于校正离散语义熵的样本覆盖偏差。 Result: 所提方法在估计语义熵方面比传统方法更准确,并在检测LLM幻觉方面表现优于或相当于当前高性能方法,同时无需引入额外超参数。 Conclusion: 改进的语义字母表大小估计器能有效提升离散语义熵的准确性与实用性,在低样本条件下实现了高可解释性和强鲁棒性的不确定性估计。 Abstract: Many black-box techniques for quantifying the uncertainty of large language models (LLMs) rely on repeated LLM sampling, which can be computationally expensive. Therefore, practical applicability demands reliable estimation from few samples. Semantic entropy (SE) is a popular sample-based uncertainty estimator with a discrete formulation attractive for the black-box setting. Recent extensions of semantic entropy exhibit improved LLM hallucination detection, but do so with less interpretable methods that admit additional hyperparameters. For this reason, we revisit the canonical discrete semantic entropy estimator, finding that it underestimates the "true" semantic entropy, as expected from theory. We propose a modified semantic alphabet size estimator, and illustrate that using it to adjust discrete semantic entropy for sample coverage results in more accurate semantic entropy estimation in our setting of interest. Furthermore, our proposed alphabet size estimator flags incorrect LLM responses as well or better than recent top-performing approaches, with the added benefit of remaining highly interpretable.[29] Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents
Weiting Tan,Xinghua Qu,Ming Tu,Meng Ge,Andy T. Liu,Philipp Koehn,Lu Lu
Main category: cs.CL
TL;DR: 提出了一种基于回合级评判的强化学习方法TARL,用于训练具备工具集成推理能力的交互式智能体,支持多模态语音-文本混合训练,显著提升任务成功率。
Details
Motivation: 为了使智能体在多轮对话和长上下文环境中有效使用工具,需要解决长视野任务中的信用分配和探索问题,特别是在多模态交互场景下。 Method: 引入一个支持语音-文本交错 rollout 的沙盒环境,采用基于大语言模型作为裁判的回合级评判强化学习(TARL),并结合包含数学推理任务的混合训练课程以增强探索能力。 Result: 在文本基准τ-bench上任务通过率比强基线提升超过6%,并成功将框架应用于多模态基础模型的微调,使其具备语音驱动的工具使用能力。 Conclusion: TARL框架有效解决了长周期工具交互中的信用分配与探索难题,为构建自然的、语音驱动的多模态智能体提供了可行路径。 Abstract: Effective interactive tool use requires agents to master Tool Integrated Reasoning (TIR): a complex process involving multi-turn planning and long-context dialogue management. To train agents for this dynamic process, particularly in multi-modal contexts, we introduce a sandbox environment for reinforcement learning (RL) that supports interleaved speech-text rollouts. Our core strategy, Turn-level Adjudicated Reinforcement Learning (TARL), addresses the challenge of credit assignment in long-horizon tasks by employing a Large Language Model (LLM) as a judge to provide turn-level evaluation. To enhance exploration, we integrate a mixed-task training curriculum with mathematical reasoning problems. This unified approach boosts the task pass rate on the text-based $\tau$-bench by over 6% compared to strong RL baselines. Crucially, we demonstrate our framework's suitability for fine-tuning a multi-modal foundation model for agentic tasks. By training a base multi-modal LLM on interleaved speech-text rollouts, we equip it with tool-use abilities, paving the way for more natural, voice-driven interactive agents.[30] Translate, then Detect: Leveraging Machine Translation for Cross-Lingual Toxicity Classification
Samuel J. Bell,Eduardo Sánchez,David Dale,Pontus Stenetorp,Mikel Artetxe,Marta R. Costa-jussà
Main category: cs.CL
TL;DR: 本文比较了基于翻译和语言特定/多语言分类的管道在多语言毒性检测中的表现,发现翻译方法在大多数情况下优于传统分类器,尤其在资源较少的语言中效果更显著。
Details
Motivation: 由于许多语言缺乏训练数据和资源,多语言毒性检测仍具挑战性,现有研究对翻译在跨语言迁移中的有效性尚不明确。 Method: 通过综合对比基于翻译的管道与语言特定或多语言分类管道,在16种语言上评估其性能,并分析机器翻译质量与目标语言资源水平的影响。 Result: 翻译管道在81.3%的情况下优于分布外分类器;传统分类器优于大语言模型判别器,尤其在低资源语言中;对LLM进行MT特定微调可降低拒绝率,但可能损害低资源语言的检测准确性。 Conclusion: 翻译方法在多语言毒性检测中更具优势,特别是在低资源语言场景下,为构建可扩展的内容审核系统提供了实用指导。 Abstract: Multilingual toxicity detection remains a significant challenge due to the scarcity of training data and resources for many languages. While prior work has leveraged the translate-test paradigm to support cross-lingual transfer across a range of classification tasks, the utility of translation in supporting toxicity detection at scale remains unclear. In this work, we conduct a comprehensive comparison of translation-based and language-specific/multilingual classification pipelines. We find that translation-based pipelines consistently outperform out-of-distribution classifiers in 81.3% of cases (13 of 16 languages), with translation benefits strongly correlated with both the resource level of the target language and the quality of the machine translation (MT) system. Our analysis reveals that traditional classifiers outperform large language model (LLM) judges, with this advantage being particularly pronounced for low-resource languages, where translate-classify methods dominate translate-judge approaches in 6 out of 7 cases. We additionally show that MT-specific fine-tuning on LLMs yields lower refusal rates compared to standard instruction-tuned models, but it can negatively impact toxicity detection accuracy for low-resource languages. These findings offer actionable guidance for practitioners developing scalable multilingual content moderation systems.[31] Introducing OmniGEC: A Silver Multilingual Dataset for Grammatical Error Correction
Roman Kovalchuk,Mariana Romanyshyn,Petro Ivaniuk
Main category: cs.CL
TL;DR: 本文介绍了OmniGEC,一个涵盖11种语言的多语言语法纠错(GEC)银标准数据集集合,用于推动多语言GEC研究并填补从英语到多语言GEC的数据空白。数据来自Wikipedia编辑、Reddit子论坛和UberText 2.0语料库,并通过GPT-4o-mini自动修正。作者在该数据集上微调了Aya-Expanse和Gemma-3模型,取得了当前最优的段落级多语言GEC性能。
Details
Motivation: 现有的语法纠错研究主要集中在英语,缺乏高质量的多语言数据集,限制了多语言GEC的发展。因此,需要构建覆盖多种语言的统一数据集以促进该领域的研究。 Method: 收集来自Wikipedia编辑、多语言Reddit子论坛和乌克兰语UberText 2.0的数据;使用GPT-4o-mini对Reddit和UberText数据进行自动纠错生成银标准标签;对所有数据进行自动与人工质量评估;基于OmniGEC数据集微调Aya-Expanse(8B)和Gemma-3(12B)两个开源大模型。 Result: 成功构建了覆盖11种语言的OmniGEC数据集,包含来自三种来源的银标准GEC数据;质量评估表明其具备较高可用性;在该数据集上微调的Aya-Expanse和Gemma-3模型在段落级多语言GEC任务上达到SOTA性能。 Conclusion: OmniGEC为多语言语法纠错提供了重要资源,有效缩小了英语与其他语言之间的GEC数据差距,且基于该数据集训练的模型表现优异,推动了多语言GEC的发展。 Abstract: In this paper, we introduce OmniGEC, a collection of multilingual silver-standard datasets for the task of Grammatical Error Correction (GEC), covering eleven languages: Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Slovene, Swedish, and Ukrainian. These datasets facilitate the development of multilingual GEC solutions and help bridge the data gap in adapting English GEC solutions to multilingual GEC. The texts in the datasets originate from three sources: Wikipedia edits for the eleven target languages, subreddits from Reddit in the eleven target languages, and the Ukrainian-only UberText 2.0 social media corpus. While Wikipedia edits were derived from human-made corrections, the Reddit and UberText 2.0 data were automatically corrected with the GPT-4o-mini model. The quality of the corrections in the datasets was evaluated both automatically and manually. Finally, we fine-tune two open-source large language models - Aya-Expanse (8B) and Gemma-3 (12B) - on the multilingual OmniGEC corpora and achieve state-of-the-art (SOTA) results for paragraph-level multilingual GEC. The dataset collection and the best-performing models are available on Hugging Face.[32] From Turn-Taking to Synchronous Dialogue: A Survey of Full-Duplex Spoken Language Models
Yuxuan Chen,Haoyuan Yu
Main category: cs.CL
TL;DR: 本文综述了大模型时代下的全双工口语模型(FD-SLMs),提出分类体系并统一评估框架,指出了当前面临的数据、架构和评估挑战。
Details
Motivation: 实现类人的人机语音交互需要支持自然对话中的同时说话、打断和重叠语音,传统半双工系统无法满足这一需求。 Method: 建立区分工程化同步与学习型同步的分类体系,并整合碎片化的评估方法为包含时序动态、行为仲裁、语义连贯和声学性能的统一框架。 Result: 通过对主流FD-SLM的比较分析,识别出同步数据稀缺、架构分歧和评估差距三大核心挑战。 Conclusion: 该研究为推进全双工语音交互提供了系统性分类、评估框架和发展路线图。 Abstract: True Full-Duplex (TFD) voice communication--enabling simultaneous listening and speaking with natural turn-taking, overlapping speech, and interruptions--represents a critical milestone toward human-like AI interaction. This survey comprehensively reviews Full-Duplex Spoken Language Models (FD-SLMs) in the LLM era. We establish a taxonomy distinguishing Engineered Synchronization (modular architectures) from Learned Synchronization (end-to-end architectures), and unify fragmented evaluation approaches into a framework encompassing Temporal Dynamics, Behavioral Arbitration, Semantic Coherence, and Acoustic Performance. Through comparative analysis of mainstream FD-SLMs, we identify fundamental challenges: synchronous data scarcity, architectural divergence, and evaluation gaps, providing a roadmap for advancing human-AI communication.[33] Delta Knowledge Distillation for Large Language Models
Yihan Cao,Yanbin Kang,Zhengming Xing,Ruijie Jiang
Main category: cs.CL
TL;DR: 提出了一种新的知识蒸馏方法Delta-KD,通过保留教师模型在监督微调过程中引入的分布偏移Δ,来提升小模型的学习效果。
Details
Motivation: 传统知识蒸馏假设学生和教师模型共享相同的最优表示空间,但这一假设在许多情况下不成立。 Method: 在token级别KD基础上,显式建模并保留教师模型SFT过程中的分布偏移Delta,使学生模型逼近更优的表示空间。 Result: 在ROUGE指标上显著提升了学生模型性能,并更好地保留了教师模型的知识。 Conclusion: Delta-KD有效改进了传统知识蒸馏的性能,验证了建模分布偏移在知识迁移中的重要性。 Abstract: Knowledge distillation (KD) is a widely adopted approach for compressing large neural networks by transferring knowledge from a large teacher model to a smaller student model. In the context of large language models, token level KD, typically minimizing the KL divergence between student output distribution and teacher output distribution, has shown strong empirical performance. However, prior work assumes student output distribution and teacher output distribution share the same optimal representation space, a premise that may not hold in many cases. To solve this problem, we propose Delta Knowledge Distillation (Delta-KD), a novel extension of token level KD that encourages the student to approximate an optimal representation space by explicitly preserving the distributional shift Delta introduced during the teacher's supervised finetuning (SFT). Empirical results on ROUGE metrics demonstrate that Delta KD substantially improves student performance while preserving more of the teacher's knowledge.[34] Catch Me If You Can? Not Yet: LLMs Still Struggle to Imitate the Implicit Writing Styles of Everyday Authors
Zhengxiang Wang,Nafis Irtiza Tripto,Solha Park,Zhenzhen Li,Jiawei Zhou
Main category: cs.CL
TL;DR: 该论文系统评估了大语言模型通过少量样本模仿个人写作风格的能力,发现模型在结构化文本(如新闻、邮件)中表现较好,但在非正式、复杂的博客和论坛文本中效果有限。
Details
Motivation: 随着大语言模型越来越多地融入个人写作工具,能否仅从少量样例中准确模仿用户的个性化写作风格成为一个关键问题。这种风格通常是隐性的、细微的,难以通过提示明确表达,但对用户对齐的生成至关重要。 Method: 研究采用上下文学习方法,利用来自真实作者的少量文本样例,通过作者归属、作者验证、风格匹配和AI检测等多种互补指标,对多个先进大语言模型在新闻、邮件、论坛和博客等多个领域超过4万次生成结果进行综合评估。 Result: 实验结果显示,当前大语言模型能在结构化文本中较好地逼近用户风格,但在非正式、风格更复杂的博客和论坛内容中表现不佳;不同示例数量等提示策略的分析揭示了个性化生成的关键局限性。 Conclusion: 现有大语言模型在个性化风格模仿方面仍存在根本性差距,亟需改进技术以实现更隐性、一致的风格化生成。作者开源了数据与代码以支持后续研究。 Abstract: As large language models (LLMs) become increasingly integrated into personal writing tools, a critical question arises: can LLMs faithfully imitate an individual's writing style from just a few examples? Personal style is often subtle and implicit, making it difficult to specify through prompts yet essential for user-aligned generation. This work presents a comprehensive evaluation of state-of-the-art LLMs' ability to mimic personal writing styles via in-context learning from a small number of user-authored samples. We introduce an ensemble of complementary metrics-including authorship attribution, authorship verification, style matching, and AI detection-to robustly assess style imitation. Our evaluation spans over 40000 generations per model across domains such as news, email, forums, and blogs, covering writing samples from more than 400 real-world authors. Results show that while LLMs can approximate user styles in structured formats like news and email, they struggle with nuanced, informal writing in blogs and forums. Further analysis on various prompting strategies such as number of demonstrations reveal key limitations in effective personalization. Our findings highlight a fundamental gap in personalized LLM adaptation and the need for improved techniques to support implicit, style-consistent generation. To aid future research and for reproducibility, we open-source our data and code.[35] Controlling Language Difficulty in Dialogues with Linguistic Features
Shuyao Xu,Wenguang Wang,Handong Gao,Wei Kang,Long Qin,Weizhi Wang
Main category: cs.CL
TL;DR: 提出一种通过语言学特征控制教育对话系统中语言熟练度的框架,相较于基于提示的方法,在灵活性和稳定性上表现更优。
Details
Motivation: 适应LLM生成回应的语言难度以匹配学习者的熟练水平是一个挑战。 Method: 利用可读性、句法和词汇三类语言学特征来量化和调节文本复杂度,并在语言学标注的对话数据上训练大语言模型。 Result: 所提出的方法在语言熟练度的可控性方面优于基于提示的方法,并保持了较高的对话质量;新指标Dilaprix与专家对语言难度的判断有强相关性。 Conclusion: 该框架能有效调节语言难度,提升二语习得中对话系统的实用性。 Abstract: Large language models (LLMs) have emerged as powerful tools for supporting second language acquisition, particularly in simulating interactive dialogues for speaking practice. However, adapting the language difficulty of LLM-generated responses to match learners' proficiency levels remains a challenge. This work addresses this issue by proposing a framework for controlling language proficiency in educational dialogue systems. Our approach leverages three categories of linguistic features, readability features (e.g., Flesch-Kincaid Grade Level), syntactic features (e.g., syntactic tree depth), and lexical features (e.g., simple word ratio), to quantify and regulate text complexity. We demonstrate that training LLMs on linguistically annotated dialogue data enables precise modulation of language proficiency, outperforming prompt-based methods in both flexibility and stability. To evaluate this, we introduce Dilaprix, a novel metric integrating the aforementioned features, which shows strong correlation with expert judgments of language difficulty. Empirical results reveal that our approach achieves superior controllability of language proficiency while maintaining high dialogue quality.[36] Position: Thematic Analysis of Unstructured Clinical Transcripts with Large Language Models
Seungjun Yi,Joakim Nguyen,Terence Lim,Andrew Well,Joseph Skrovan,Mehak Beri,YongGeon Lee,Kavita Radhakrishnan,Liu Leqi,Mia Markey,Ying Ding
Main category: cs.CL
TL;DR: 本文探讨了大语言模型(LLM)在非结构化临床文本主题分析中的应用,指出当前方法在评估方面存在碎片化问题,并提出以有效性、可靠性和可解释性为核心的标准化评估框架。
Details
Motivation: 主题分析是一种广泛用于挖掘患者和医护人员叙述中模式的方法,但资源消耗大。大语言模型有望提升效率,但当前研究在方法和评估上缺乏统一标准,阻碍了领域进展。 Method: 通过系统综述近期将大语言模型应用于主题分析的研究,并结合对执业临床医生的访谈,分析现有方法在主题分析类型、数据集、提示策略和模型使用等方面的差异,特别关注评估方法的不一致性。 Result: 发现现有研究在评估方法上差异显著,从专家定性评审到自动相似性度量不等,导致难以进行跨研究比较和建立基准。 Conclusion: 建立标准化的评估实践对于推动该领域发展至关重要,因此提出了一个涵盖有效性、可靠性和可解释性的三维评估框架。 Abstract: This position paper examines how large language models (LLMs) can support thematic analysis of unstructured clinical transcripts, a widely used but resource-intensive method for uncovering patterns in patient and provider narratives. We conducted a systematic review of recent studies applying LLMs to thematic analysis, complemented by an interview with a practicing clinician. Our findings reveal that current approaches remain fragmented across multiple dimensions including types of thematic analysis, datasets, prompting strategies and models used, most notably in evaluation. Existing evaluation methods vary widely (from qualitative expert review to automatic similarity metrics), hindering progress and preventing meaningful benchmarking across studies. We argue that establishing standardized evaluation practices is critical for advancing the field. To this end, we propose an evaluation framework centered on three dimensions: validity, reliability, and interpretability.[37] Leveraging IndoBERT and DistilBERT for Indonesian Emotion Classification in E-Commerce Reviews
William Christian,Daniel Adamlu,Adrian Yu,Derwin Suhartono
Main category: cs.CL
TL;DR: 本研究通过使用IndoBERT和DistilBERT等先进语言模型,结合回译和同义词替换等数据增强技术,提升了印尼语情感分类的准确性。IndoBERT在调参后达到80%的准确率,表明数据预处理的关键作用。
Details
Motivation: 提升印尼语情感分析的准确性,以改善电子商务中的用户体验。 Method: 采用IndoBERT和DistilBERT模型,结合回译和同义词替换进行数据增强,并进行超参数调优。 Result: IndoBERT模型达到80%的准确率;数据增强显著提升性能,而多模型组合仅带来轻微改进。 Conclusion: IndoBERT是印尼语情感分类中最有效的模型,数据增强对提高准确性至关重要;未来应探索其他架构以提升印尼语NLP任务的泛化能力。 Abstract: Understanding emotions in the Indonesian language is essential for improving customer experiences in e-commerce. This study focuses on enhancing the accuracy of emotion classification in Indonesian by leveraging advanced language models, IndoBERT and DistilBERT. A key component of our approach was data processing, specifically data augmentation, which included techniques such as back-translation and synonym replacement. These methods played a significant role in boosting the model's performance. After hyperparameter tuning, IndoBERT achieved an accuracy of 80\%, demonstrating the impact of careful data processing. While combining multiple IndoBERT models led to a slight improvement, it did not significantly enhance performance. Our findings indicate that IndoBERT was the most effective model for emotion classification in Indonesian, with data augmentation proving to be a vital factor in achieving high accuracy. Future research should focus on exploring alternative architectures and strategies to improve generalization for Indonesian NLP tasks.[38] Reveal and Release: Iterative LLM Unlearning with Self-generated Data
Linxi Xie,Xin Teng,Shichang Ke,Hongyi Wen,Shengjie Wang
Main category: cs.CL
TL;DR: 提出了一种“揭示与释放”方法,通过模型自生成遗忘数据进行无学习,解决了传统方法需完整访问敏感或受限遗忘数据的难题。
Details
Motivation: 现有LLM无学习方法通常需要完全访问遗忘数据集,但这些数据往往隐私敏感、稀少或受法律限制,难以获取;且其分布可能与模型内部表示不一致。 Method: 提出“Reveal-and-Release”方法,利用优化指令提示模型揭示其知识以生成遗忘数据,并在参数高效模块基础上构建迭代无学习框架,在权重空间中逐步调整模型。 Result: 实验结果表明,该方法在遗忘质量和模型效用保持之间取得了良好平衡,有效利用自生成数据实现无学习。 Conclusion: 该方法无需直接访问真实遗忘数据,通过自生成数据和迭代参数调整,实现了高效且实用的模型无学习。 Abstract: Large language model (LLM) unlearning has demonstrated effectiveness in removing the influence of undesirable data (also known as forget data). Existing approaches typically assume full access to the forget dataset, overlooking two key challenges: (1) Forget data is often privacy-sensitive, rare, or legally regulated, making it expensive or impractical to obtain (2) The distribution of available forget data may not align with how that information is represented within the model. To address these limitations, we propose a ``Reveal-and-Release'' method to unlearn with self-generated data, where we prompt the model to reveal what it knows using optimized instructions. To fully utilize the self-generated forget data, we propose an iterative unlearning framework, where we make incremental adjustments to the model's weight space with parameter-efficient modules trained on the forget data. Experimental results demonstrate that our method balances the tradeoff between forget quality and utility preservation.[39] SWE-QA: Can Language Models Answer Repository-level Code Questions?
Weihan Peng,Yuling Shi,Yuhang Wang,Xinyun Zhang,Beijun Shen,Xiaodong Gu
Main category: cs.CL
TL;DR: 本文提出了SWE-QA,一个面向软件仓库级别的代码问答基准,包含576个高质量问题,涵盖跨文件推理和多跳依赖分析等复杂任务。基于对GitHub issue的分析构建,并提出SWE-QA-Agent框架以LLM代理自动回答问题,评估结果显示了大模型在仓库级理解中的潜力与挑战。
Details
Motivation: 现有代码问答基准多关注独立的小段代码,难以反映真实仓库中复杂的跨文件依赖和架构理解需求,因此需要更贴近实际开发场景的仓库级问答基准。 Method: 从11个流行仓库的77,100个GitHub issue中提取开发者问题,建立两层分类体系,按类别设计并人工验证问题与答案,构建SWE-QA数据集;同时开发SWE-QA-Agent代理框架,结合LLM进行自动推理与回答。 Result: SWE-QA包含576个高质量、多类别的仓库级问题;实验评估了六种先进LLM在不同上下文增强策略下的表现,SWE-QA-Agent展现出良好潜力,但模型在长距离依赖和复杂推理上仍有不足。 Conclusion: SWE-QA为仓库级代码理解提供了新的基准,推动了面向真实软件工程场景的智能问答系统研究,揭示了当前LLM在复杂代码环境中的能力与局限,指明了未来研究方向。 Abstract: Understanding and reasoning about entire software repositories is an essential capability for intelligent software engineering tools. While existing benchmarks such as CoSQA and CodeQA have advanced the field, they predominantly focus on small, self-contained code snippets. These setups fail to capture the complexity of real-world repositories, where effective understanding and reasoning often require navigating multiple files, understanding software architecture, and grounding answers in long-range code dependencies. In this paper, we present SWE-QA, a repository-level code question answering (QA) benchmark designed to facilitate research on automated QA systems in realistic code environments. SWE-QA involves 576 high-quality question-answer pairs spanning diverse categories, including intention understanding, cross-file reasoning, and multi-hop dependency analysis. To construct SWE-QA, we first crawled 77,100 GitHub issues from 11 popular repositories. Based on an analysis of naturally occurring developer questions extracted from these issues, we developed a two-level taxonomy of repository-level questions and constructed a set of seed questions for each category. For each category, we manually curated and validated questions and collected their corresponding answers. As a prototype application, we further develop SWE-QA-Agent, an agentic framework in which LLM agents reason and act to find answers automatically. We evaluate six advanced LLMs on SWE-QA under various context augmentation strategies. Experimental results highlight the promise of LLMs, particularly our SWE-QA-Agent framework, in addressing repository-level QA, while also revealing open challenges and pointing to future research directions.[40] MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
Siyu Yan,Long Zeng,Xuecheng Wu,Chengcheng Han,Kongcheng Zhang,Chong Peng,Xuezhi Cao,Xunliang Cai,Chenjuan Guo
Main category: cs.CL
TL;DR: 提出MUSE框架,从攻击和防御两方面应对多轮越狱问题。
Details
Motivation: 确保大语言模型在多轮对话中与人类价值观对齐,防止通过上下文绕过安全机制的越狱攻击。 Method: 攻击方面采用框架语义与启发式树搜索(MUSE-A),防御方面提出细粒度安全对齐方法早期干预(MUSE-D)。 Result: 在多种模型上实验表明MUSE能有效识别和缓解多轮越狱漏洞。 Conclusion: MUSE在多轮对话场景下显著提升了模型的安全性,兼具攻击发现与防御能力。 Abstract: As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.[41] UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition
Ying Fang,Xiaofei Li
Main category: cs.CL
TL;DR: 提出基于单模态聚合(UMA)的非自回归模型,改进其在英语和普通话语音识别中的表现,通过引入分裂模块使每个聚合帧可映射到多个token,提升多语言适用性。
Details
Motivation: 原始UMA在英语等语言上表现不佳,因细粒度分词和短音帧难以形成有效的单模态权重,需增强其跨语言适应能力。 Method: 在UMA基础上引入分裂模块,将每个聚合帧生成两个token后再计算CTC损失,以支持一个帧对应多个token的映射。 Result: 改进后的模型在英语和普通话语音识别任务中均优于原始UMA,提升了对多语言和细粒度分词的适应性。 Conclusion: 通过扩展UMA的映射能力,所提方法有效解决了其在英语等语言上的局限性,增强了非自回归语音识别模型的通用性和性能。 Abstract: This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss.[42] TableDART: Dynamic Adaptive Multi-Modal Routing for Table Understanding
Xiaobo Xing,Wei Yuan,Tong Chen,Quoc Viet Hung Nguyen,Xiangliang Zhang,Hongzhi Yin
Main category: cs.CL
TL;DR: 提出TableDART,一种训练高效的多模态表格理解框架,通过轻量级门控网络动态选择文本、图像或融合路径,并引入代理模型整合跨模态输出,在七个基准上优于现有开源模型。
Details
Motivation: 现有表格理解方法在保留结构信息和语义细节之间存在权衡,多模态方法存在冗余、冲突和高昂的微调成本。 Method: 设计轻量级MLP门控网络动态选择最优处理路径(文本、图像或融合),并引入代理模型进行跨模态结果选择或推理合成,复用单模态预训练模型以避免全量微调。 Result: 在七个基准测试上平均超越最强基线4.02%,达到开源模型中的最先进性能。 Conclusion: TableDART有效平衡了多模态表格理解中的效率与性能,显著降低计算成本的同时提升了准确性。 Abstract: Modeling semantic and structural information from tabular data remains a core challenge for effective table understanding. Existing Table-as-Text approaches flatten tables for large language models (LLMs), but lose crucial structural cues, while Table-as-Image methods preserve structure yet struggle with fine-grained semantics. Recent Table-as-Multimodality strategies attempt to combine textual and visual views, but they (1) statically process both modalities for every query-table pair within a large multimodal LLMs (MLLMs), inevitably introducing redundancy and even conflicts, and (2) depend on costly fine-tuning of MLLMs. In light of this, we propose TableDART, a training-efficient framework that integrates multimodal views by reusing pretrained single-modality models. TableDART introduces a lightweight 2.59M-parameter MLP gating network that dynamically selects the optimal path (either Text-only, Image-only, or Fusion) for each table-query pair, effectively reducing redundancy and conflicts from both modalities. In addition, we propose a novel agent to mediate cross-modal knowledge integration by analyzing outputs from text- and image-based models, either selecting the best result or synthesizing a new answer through reasoning. This design avoids the prohibitive costs of full MLLM fine-tuning. Extensive experiments on seven benchmarks show that TableDART establishes new state-of-the-art performance among open-source models, surpassing the strongest baseline by an average of 4.02%. The code is available at: https://anonymous.4open.science/r/TableDART-C52B[43] HARNESS: Lightweight Distilled Arabic Speech Foundation Models
Vrunda N. sukhadia,Shammur Absar Chowdhury
Main category: cs.CL
TL;DR: 本文提出了HArnESS,首个以阿拉伯语为中心的自监督语音模型家族,通过迭代自蒸馏和低秩近似方法,在保持阿拉伯语语音特征的同时实现模型压缩,在ASR、SER和DID任务上表现优异且适用于资源受限环境。
Details
Motivation: 大型预训练语音模型在下游任务中表现优秀,但在资源受限环境中部署不切实际。阿拉伯语在现有模型中缺乏针对性建模,因此需要一种轻量且保留阿拉伯语语音特点的专用模型。 Method: 提出HArnESS模型家族,采用迭代自蒸馏方法,先训练双语大模型(HL),再将知识蒸馏至小型学生模型(HS, HST),并结合低秩近似进一步压缩教师模型的离散监督信号,实现高效紧凑建模。 Result: 在阿拉伯语ASR、说话人情感识别(SER)和方言识别(DID)任务上,HArnESS在极少微调下达到或接近SOTA水平,优于HuBERT和XLS-R,同时显著降低模型体积和计算需求。 Conclusion: HArnESS是一种高效、轻量且专为阿拉伯语设计的语音表示模型,通过知识蒸馏与结构压缩实现了性能与部署可行性的平衡,适合低资源场景应用,作者已公开模型以促进相关研究。 Abstract: Large pre-trained speech models excel in downstream tasks but their deployment is impractical for resource-limited environments. In this paper, we introduce HArnESS, the first Arabic-centric self-supervised speech model family, designed to capture Arabic speech nuances. Using iterative self-distillation, we train large bilingual HArnESS (HL) SSL models and then distill knowledge into compressed student models (HS, HST), preserving Arabic-specific representations. We use low-rank approximation to further compact the teacher's discrete supervision into shallow, thin models. We evaluate HArnESS on Arabic ASR, Speaker Emotion Recognition (SER), and Dialect Identification (DID), demonstrating effectiveness against HuBERT and XLS-R. With minimal fine-tuning, HArnESS achieves SOTA or comparable performance, making it a lightweight yet powerful alternative for real-world use. We release our distilled models and findings to support responsible research and deployment in low-resource settings.[44] From Ground Trust to Truth: Disparities in Offensive Language Judgments on Contemporary Korean Political Discourse
Seunguk Yu,Jungmin Yun,Jinhee Jang,Youngbin Kim
Main category: cs.CL
TL;DR: 本研究构建了大规模当代政治话语数据集,采用三种改进的判断方法在缺乏真实标签的情况下进行评估,并通过伪标签验证发现,精心设计的单次提示能达到与资源密集型方法相当的性能。
Details
Motivation: 现有研究多依赖过时的数据集,且很少评估对未见文本的泛化能力,而 offensive language 随时间不断演变,需要更贴近现实场景的检测方法。 Method: 构建当代政治话语数据集,采用三种代表性的检测方法进行精细化判断,使用留一法分析标签一致性趋势,并通过伪标签进行定量性能评估。 Result: 识别出不同判断方法的独特模式,发现标签间存在特定一致倾向,并证明单次提示策略在性能上可媲美更复杂的资源密集型方法。 Conclusion: 精心设计的单次提示是一种在现实约束条件下可行且高效的 offensive language 检测方法,具备良好的应用潜力。 Abstract: Although offensive language continually evolves over time, even recent studies using LLMs have predominantly relied on outdated datasets and rarely evaluated the generalization ability on unseen texts. In this study, we constructed a large-scale dataset of contemporary political discourse and employed three refined judgments in the absence of ground truth. Each judgment reflects a representative offensive language detection method and is carefully designed for optimal conditions. We identified distinct patterns for each judgment and demonstrated tendencies of label agreement using a leave-one-out strategy. By establishing pseudo-labels as ground trust for quantitative performance assessment, we observed that a strategically designed single prompting achieves comparable performance to more resource-intensive methods. This suggests a feasible approach applicable in real-world settings with inherent constraints.[45] Decoupled Proxy Alignment: Mitigating Language Prior Conflict for Multimodal Alignment in MLLM
Chenkun Tan,Pengyu Wang,Shaojun Zhou,Botian Jiang,Zhaowei Li,Dong Zhang,Xinghao Wang,Yaqian Zhou,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种新的多模态大语言模型训练方法DPA,以解决语言先验冲突问题,通过引入代理LLM和动态损失调整,显著提升了视觉-语言对齐性能。
Details
Motivation: 发现现有MLLM在训练中存在语言先验冲突,导致视觉-语言对齐效果不佳,影响模型泛化能力。 Method: 提出Decoupled Proxy Alignment (DPA)方法,使用代理LLM解耦语言先验干扰,并基于视觉相关性进行动态损失调整。 Result: 实验表明DPA在多种数据集、模型族和规模上均显著缓解语言先验冲突,提升对齐性能和泛化能力。 Conclusion: DPA是一种有效且鲁棒的视觉-语言对齐训练方法,具有广泛的应用前景。 Abstract: Multimodal large language models (MLLMs) have gained significant attention due to their impressive ability to integrate vision and language modalities. Recent advancements in MLLMs have primarily focused on improving performance through high-quality datasets, novel architectures, and optimized training strategies. However, in this paper, we identify a previously overlooked issue, language prior conflict, a mismatch between the inherent language priors of large language models (LLMs) and the language priors in training datasets. This conflict leads to suboptimal vision-language alignment, as MLLMs are prone to adapting to the language style of training samples. To address this issue, we propose a novel training method called Decoupled Proxy Alignment (DPA). DPA introduces two key innovations: (1) the use of a proxy LLM during pretraining to decouple the vision-language alignment process from language prior interference, and (2) dynamic loss adjustment based on visual relevance to strengthen optimization signals for visually relevant tokens. Extensive experiments demonstrate that DPA significantly mitigates the language prior conflict, achieving superior alignment performance across diverse datasets, model families, and scales. Our method not only improves the effectiveness of MLLM training but also shows exceptional generalization capabilities, making it a robust approach for vision-language alignment. Our code is available at https://github.com/fnlp-vision/DPA.[46] UnifiedVisual: A Framework for Constructing Unified Vision-Language Datasets
Pengyu Wang,Shaojun Zhou,Chenkun Tan,Xinghao Wang,Wei Huang,Zhen Ye,Zhaowei Li,Botian Jiang,Dong Zhang,Xipeng Qiu
Main category: cs.CL
TL;DR: 本文提出了一种新的数据集构建框架UnifiedVisual及其实例UnifiedVisual-240K,旨在促进多模态理解与生成能力的相互增强,从而推动统一视觉大语言模型的发展。
Details
Motivation: 现有的多模态数据集通常孤立地处理理解和生成任务,限制了统一视觉大语言模型(VLLMs)的发展,缺乏能够充分利用这两种核心能力协同效应的数据集。 Method: 提出了UnifiedVisual框架,并构建了高质量、多样化的数据集UnifiedVisual-240K,整合多种视觉与文本输入输出形式,支持跨模态推理和精确的文本到图像对齐。 Result: 在UnifiedVisual-240K上训练的模型在多种任务上表现优异,展现出理解与生成能力之间的显著相互促进作用。 Conclusion: UnifiedVisual为统一视觉大语言模型的发展提供了新的增长点,有效释放其全部潜力。 Abstract: Unified vision large language models (VLLMs) have recently achieved impressive advancements in both multimodal understanding and generation, powering applications such as visual question answering and text-guided image synthesis. However, progress in unified VLLMs remains constrained by the lack of datasets that fully exploit the synergistic potential between these two core abilities. Existing datasets typically address understanding and generation in isolation, thereby limiting the performance of unified VLLMs. To bridge this critical gap, we introduce a novel dataset construction framework, UnifiedVisual, and present UnifiedVisual-240K, a high-quality dataset meticulously designed to facilitate mutual enhancement between multimodal understanding and generation. UnifiedVisual-240K seamlessly integrates diverse visual and textual inputs and outputs, enabling comprehensive cross-modal reasoning and precise text-to-image alignment. Our dataset encompasses a wide spectrum of tasks and data sources, ensuring rich diversity and addressing key shortcomings of prior resources. Extensive experiments demonstrate that models trained on UnifiedVisual-240K consistently achieve strong performance across a wide range of tasks. Notably, these models exhibit significant mutual reinforcement between multimodal understanding and generation, further validating the effectiveness of our framework and dataset. We believe UnifiedVisual represents a new growth point for advancing unified VLLMs and unlocking their full potential. Our code and datasets is available at https://github.com/fnlp-vision/UnifiedVisual.[47] Evaluating Large Language Models for Cross-Lingual Retrieval
Longfei Zuo,Pingjun Hong,Oliver Kraus,Barbara Plank,Robert Litschko
Main category: cs.CL
TL;DR: 本文研究了基于大语言模型(LLM)的两阶段跨语言信息检索(CLIR),发现使用多语言双编码器作为第一阶段检索器可提升性能,且无需依赖机器翻译,现有最先进重排序模型在直接用于CLIR时表现显著下降。
Details
Motivation: 现有的CLIR评估多依赖于带机器翻译的词典检索,成本高且易产生错误传播,缺乏对LLM在多语言场景下系统性的大规模比较。 Method: 采用多语言双编码器作为第一阶段检索器,结合指令调优的LLM进行成对或列表式重排序,在段落级和文档级CLIR任务上进行实验评估。 Result: 实验表明,使用多语言双编码器优于传统MT+词典检索;随着重排序模型增强,翻译带来的增益减弱;成对LLM重排序器表现可与列表式媲美;不使用机器翻译时,当前SOTA重排序模型在CLIR中表现大幅下降。 Conclusion: 本文首次系统研究了LLM在两阶段CLIR中检索器与重排序器的交互影响,指出无需机器翻译的高效路径,并揭示当前模型在直接跨语言应用中的局限性。 Abstract: Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on lexical retrieval with machine translation (MT) for the first stage. This is not only prohibitively expensive but also prone to error propagation across stages. Our evaluation on passage-level and document-level CLIR reveals that further gains can be achieved with multilingual bi-encoders as first-stage retrievers and that the benefits of translation diminishes with stronger reranking models. We further show that pairwise rerankers based on instruction-tuned LLMs perform competitively with listwise rerankers. To the best of our knowledge, we are the first to study the interaction between retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that, without MT, current state-of-the-art rerankers fall severely short when directly applied in CLIR.[48] KAIO: A Collection of More Challenging Korean Questions
Nahyun Lee,Guijin Son,Hyunwoo Ko,Kyubeen Han
Main category: cs.CL
TL;DR: 本文介绍了KAIO,一个以数学为中心、强调长链推理的韩语基准测试,旨在解决现有韩语基准测试饱和快、范围窄和更新慢的问题,有效评估和排名前沿模型在韩语环境下的表现。
Details
Motivation: 现有的韩语基准测试数量少、多为翻译版本或范围狭窄,且更新缓慢,导致测试很快饱和和污染,难以准确追踪前沿大模型的进步。 Method: 提出KAIO,一个专注于数学和长链推理的韩语基准测试,通过保持数据集私有并使用保留评估器来减少污染,在公开模型达到至少80%准确率前不公开测试集,并计划后续推出更难版本。 Result: 目前表现最好的GPT-5得分为62.8,Gemini-2.5-Pro为52.3,而Qwen3-235B和DeepSeek-R1等开源模型得分低于30,显示KAIO具有足够的区分度和发展空间。 Conclusion: KAIO作为一个尚未饱和的高难度韩语基准测试,能够有效评估和持续追踪前沿语言模型在韩语场景下的进展,尤其适用于衡量复杂推理能力。 Abstract: With the advancement of mid/post-training techniques, LLMs are pushing their boundaries at an accelerated pace. Legacy benchmarks saturate quickly (e.g., broad suites like MMLU over the years, newer ones like GPQA-D even faster), which makes frontier progress hard to track. The problem is especially acute in Korean: widely used benchmarks are fewer, often translated or narrow in scope, and updated more slowly, so saturation and contamination arrive sooner. Accordingly, at this moment, there is no Korean benchmark capable of evaluating and ranking frontier models. To bridge this gap, we introduce KAIO, a Korean, math-centric benchmark that stresses long-chain reasoning. Unlike recent Korean suites that are at or near saturation, KAIO remains far from saturated: the best-performing model, GPT-5, attains 62.8, followed by Gemini-2.5-Pro (52.3). Open models such as Qwen3-235B and DeepSeek-R1 cluster falls below 30, demonstrating substantial headroom, enabling robust tracking of frontier progress in Korean. To reduce contamination, KAIO will remain private and be served via a held-out evaluator until the best publicly known model reaches at least 80% accuracy, after which we will release the set and iterate to a harder version.[49] Reasoning over Boundaries: Enhancing Specification Alignment via Test-time Delibration
Haoran Zhang,Yafu Li,Xuyang Hu,Dongrui Liu,Zhilin Wang,Bo Li,Yu Cheng
Main category: cs.CL
TL;DR: 本文提出了Align3方法和SpecBench基准,用于评估大语言模型在动态、场景特定的行为与安全规范下的对齐能力,实验表明测试时推理能有效提升规范对齐性能。
Details
Motivation: 大语言模型在不同应用场景中需遵循用户或组织定制的多样化且动态变化的行为与安全规范,现有方法难以适应这种复杂多变的规范对齐需求。 Method: 提出Align3方法,采用分层反思与修订的测试时推理(TTD)机制来推理规范边界;同时构建SpecBench统一基准,涵盖5种场景、103项规范和1500个提示。 Result: 在15个推理模型和18个指令模型上的实验表明:(i) 测试时推理能提升规范对齐效果;(ii) Align3以极低开销改善了安全性与有用性的权衡;(iii) SpecBench能有效揭示对齐差距。 Conclusion: 测试时推理是一种有效的策略,可用于应对现实世界中复杂、动态的规范对齐挑战。 Abstract: Large language models (LLMs) are increasingly applied in diverse real-world scenarios, each governed by bespoke behavioral and safety specifications (spec) custom-tailored by users or organizations. These spec, categorized into safety-spec and behavioral-spec, vary across scenarios and evolve with changing preferences and requirements. We formalize this challenge as specification alignment, focusing on LLMs' ability to follow dynamic, scenario-specific spec from both behavioral and safety perspectives. To address this challenge, we propose Align3, a lightweight method that employs Test-Time Deliberation (TTD) with hierarchical reflection and revision to reason over the specification boundaries. We further present SpecBench, a unified benchmark for measuring specification alignment, covering 5 scenarios, 103 spec, and 1,500 prompts. Experiments on 15 reasoning and 18 instruct models with several TTD methods, including Self-Refine, TPO, and MoreThink, yield three key findings: (i) test-time deliberation enhances specification alignment; (ii) Align3 advances the safety-helpfulness trade-off frontier with minimal overhead; (iii) SpecBench effectively reveals alignment gaps. These results highlight the potential of test-time deliberation as an effective strategy for reasoning over the real-world specification boundaries.[50] SINAI at eRisk@CLEF 2023: Approaching Early Detection of Gambling with Natural Language Processing
Alba Maria Marmol-Romero,Flor Miriam Plaza-del-Arco,Arturo Montejo-Raez
Main category: cs.CL
TL;DR: SINAI团队在eRisk@CLEF实验室的Task 2中提出了一种基于Transformer预训练模型和LSTM架构的方法,用于早期检测病理性赌博迹象,并在49个参赛队伍中排名第7,召回率和早期检测相关指标表现最佳。
Details
Motivation: 旨在通过先进的深度学习技术实现对病理性赌博的早期识别,以支持及时干预。 Method: 采用Transformer预训练模型结合LSTM架构,并应用全面的数据预处理和数据平衡技术。 Result: 在Task 2中,团队以F1分数0.126排名第七,在召回率和早期检测指标上取得最高值。 Conclusion: 所提出的方法在早期检测病理性赌博方面表现出良好的召回能力,验证了融合模型在该任务中的潜力。 Abstract: This paper describes the participation of the SINAI team in the eRisk@CLEF lab. Specifically, one of the proposed tasks has been addressed: Task 2 on the early detection of signs of pathological gambling. The approach presented in Task 2 is based on pre-trained models from Transformers architecture with comprehensive preprocessing data and data balancing techniques. Moreover, we integrate Long-short Term Memory (LSTM) architecture with automodels from Transformers. In this Task, our team has been ranked in seventh position, with an F1 score of 0.126, out of 49 participant submissions and achieves the highest values in recall metrics and metrics related to early detection.[51] SINAI at eRisk@CLEF 2022: Approaching Early Detection of Gambling and Eating Disorders with Natural Language Processing
Alba Maria Marmol-Romero,Salud Maria Jimenez-Zafra,Flor Miriam Plaza-del-Arco,M. Dolores Molina-Gonzalez,Maria-Teresa Martin-Valdivia,Arturo Montejo-Raez
Main category: cs.CL
TL;DR: SINAI团队在eRisk@CLEF实验室的两项任务中表现出色,分别在病理性赌博早期检测和进食障碍严重程度评估中均获得第二名。
Details
Motivation: 参与eRisk@CLEF实验室的任务,提升在心理健康问题早期识别方面的技术水平。 Method: 任务1结合Transformer句子嵌入与文本体积、词汇多样性、复杂度及情绪相关特征;任务3采用基于Transformer的上下文词嵌入进行文本相似性估计。 Result: 任务1在41个参赛队伍中F1得分为0.808,排名第二;任务3在3个队伍中也排名第二。 Conclusion: 所提出的方法在心理健康的文本分析任务中表现优异,具有较强的竞争力和应用潜力。 Abstract: This paper describes the participation of the SINAI team in the eRisk@CLEF lab. Specifically, two of the proposed tasks have been addressed: i) Task 1 on the early detection of signs of pathological gambling, and ii) Task 3 on measuring the severity of the signs of eating disorders. The approach presented in Task 1 is based on the use of sentence embeddings from Transformers with features related to volumetry, lexical diversity, complexity metrics, and emotion-related scores, while the approach for Task 3 is based on text similarity estimation using contextualized word embeddings from Transformers. In Task 1, our team has been ranked in second position, with an F1 score of 0.808, out of 41 participant submissions. In Task 3, our team also placed second out of a total of 3 participating teams.[52] ReCoVeR the Target Language: Language Steering without Sacrificing Task Performance
Hannah Sterz,Fabian David Schmidt,Goran Glavaš,Ivan Vulić
Main category: cs.CL
TL;DR: 本文提出了一种名为ReCoVeR的新方法,通过语言特定的引导向量来减少大型语言模型中的语言混淆问题。
Details
Motivation: 随着大型语言模型变得越来越多语言化,它们在生成回答时容易出现语言混淆问题,即生成的语言与提示或用户要求的语言不一致。 Method: 利用多平行语料库分离出语言向量,并通过固定(无监督)和可训练的引导函数有效利用这些向量进行模型引导。 Result: 在三个基准测试和18种语言上的广泛评估表明,ReCoVeR在单语和跨语言设置中都能有效减轻语言混淆,同时保持任务性能。 Conclusion: ReCoVeR是一种轻量级且有效的方法,能够在不牺牲任务性能的前提下减少多语言大模型中的语言混淆问题。 Abstract: As they become increasingly multilingual, Large Language Models (LLMs) exhibit more language confusion, i.e., they tend to generate answers in a language different from the language of the prompt or the answer language explicitly requested by the user. In this work, we propose ReCoVeR (REducing language COnfusion in VEctor Representations), a novel lightweight approach for reducing language confusion based on language-specific steering vectors. We first isolate language vectors with the help of multi-parallel corpus and then effectively leverage those vectors for effective LLM steering via fixed (i.e., unsupervised) as well as trainable steering functions. Our extensive evaluation, encompassing three benchmarks and 18 languages, shows that ReCoVeR effectively mitigates language confusion in both monolingual and cross-lingual setups while at the same time -- and in contrast to prior language steering methods -- retaining task performance. Our data code is available at https://github.com/hSterz/recover.[53] LLM Agents at the Roundtable: A Multi-Perspective and Dialectical Reasoning Framework for Essay Scoring
Jinhee Jang,Ayoung Moon,Minkyoung Jung,YoungBin Kim. Seung Jin Lee
Main category: cs.CL
TL;DR: 提出了一种名为Roundtable Essay Scoring (RES)的多智能体评估框架,用于在零样本设置下实现精确且与人类评分对齐的自动作文评分。
Details
Motivation: 现有的大语言模型在自动作文评分中难以达到人类水平的多视角理解和判断,因此需要一种更贴近人类评估方式的方法。 Method: 构建基于大语言模型的多个 evaluator 智能体,每个智能体针对特定提示和主题上下文生成基于特征的评分标准并进行独立的多视角评估,再通过模拟圆桌讨论的辩证推理过程整合评分,得出最终整体分数。 Result: 在ASAP数据集上使用ChatGPT和Claude的实验表明,RES相比直接提示方法(Vanilla)平均QWK最高提升了34.86%。 Conclusion: RES通过多智能体协作与共识机制,在零样本设置下显著提升了自动作文评分的准确性与人类评分的一致性,优于以往的零样本AES方法。 Abstract: The emergence of large language models (LLMs) has brought a new paradigm to automated essay scoring (AES), a long-standing and practical application of natural language processing in education. However, achieving human-level multi-perspective understanding and judgment remains a challenge. In this work, we propose Roundtable Essay Scoring (RES), a multi-agent evaluation framework designed to perform precise and human-aligned scoring under a zero-shot setting. RES constructs evaluator agents based on LLMs, each tailored to a specific prompt and topic context. Each agent independently generates a trait-based rubric and conducts a multi-perspective evaluation. Then, by simulating a roundtable-style discussion, RES consolidates individual evaluations through a dialectical reasoning process to produce a final holistic score that more closely aligns with human evaluation. By enabling collaboration and consensus among agents with diverse evaluation perspectives, RES outperforms prior zero-shot AES approaches. Experiments on the ASAP dataset using ChatGPT and Claude show that RES achieves up to a 34.86% improvement in average QWK over straightforward prompting (Vanilla) methods.[54] V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang,Junjie Hu,Ming Jiang
Main category: cs.CL
TL;DR: 本文提出了V-SEAM,一种结合视觉语义编辑与注意力调制的框架,用于因果解释视觉语言模型(VLMs),能够在对象、属性和关系三个语义层次上进行概念级视觉操作,并识别对预测有正负贡献的注意力头,实验证明其在多个VQA基准上提升了LLaVA和InstructBLIP的性能。
Details
Motivation: 现有的视觉干预方法多依赖于粗粒度的像素级扰动,难以揭示视觉语言模型中多模态语义整合的内部机制,因此需要一种能进行语义级别分析的因果解释方法。 Method: 提出V-SEAM框架,结合视觉语义编辑和注意力调制,实现对视觉概念的精细操控,并识别在不同语义层次(对象、属性、关系)上影响预测结果的注意力头,同时引入自动嵌入调制方法优化关键注意力头。 Result: 发现正向注意力头通常在同一语义层级内共享,而负向注意力头具有跨层级泛化性;在三个VQA基准上,该方法显著提升了LLaVA和InstructBLIP的性能。 Conclusion: V-SEAM为视觉语言模型提供了细粒度的因果解释能力,揭示了多模态语义整合中注意力机制的作用,且通过自动调制策略有效增强了模型表现。 Abstract: Recent advances in causal interpretability have extended from language models to vision-language models (VLMs), seeking to reveal their internal mechanisms through input interventions. While textual interventions often target semantics, visual interventions typically rely on coarse pixel-level perturbations, limiting semantic insights on multimodal integration. In this study, we introduce V-SEAM, a novel framework that combines Visual Semantic Editing and Attention Modulating for causal interpretation of VLMs. V-SEAM enables concept-level visual manipulations and identifies attention heads with positive or negative contributions to predictions across three semantic levels: objects, attributes, and relationships. We observe that positive heads are often shared within the same semantic level but vary across levels, while negative heads tend to generalize broadly. Finally, we introduce an automatic method to modulate key head embeddings, demonstrating enhanced performance for both LLaVA and InstructBLIP across three diverse VQA benchmarks. Our data and code are released at: https://github.com/petergit1/V-SEAM.[55] Empathy-R1: A Chain-of-Empathy and Reinforcement Learning Framework for Long-Form Mental Health Support
Xianrong Yao,Dong She,Chenxu Zhang,Yimeng Zhang,Yueru Sun,Noman Ahmed,Yang Gao,Zhanpeng Jin
Main category: cs.CL
TL;DR: 本文提出了Empathy-R1框架,结合共情推理链(CoE)与强化学习(RL),提升大模型在长文本心理咨询中的回应质量,尤其适用于中文场景。
Details
Motivation: 现有大语言模型在心理健康支持中虽语义流畅,但缺乏结构化推理能力,难以提供真正有效的心理支持,尤其是在中文长文本咨询场景下。 Method: 提出Chain-of-Empathy(CoE)推理范式,模拟认知行为疗法,引导模型依次分析求助者的情绪、成因和意图;构建大规模中文数据集Empathy-QA,并采用两阶段训练:监督微调建立推理结构,强化学习优化回应的治疗相关性与情境适配性。 Result: 实验显示Empathy-R1在自动指标上表现优异,人工评估中显著优于基线模型,在新基准上获得44.30%的Win@1率。 Conclusion: Empathy-R1通过可解释且情境敏感的回应机制,推动了负责任、真正有益的AI心理支持系统的发展。 Abstract: Empathy is critical for effective mental health support, especially when addressing Long Counseling Texts (LCTs). However, existing Large Language Models (LLMs) often generate replies that are semantically fluent but lack the structured reasoning necessary for genuine psychological support, particularly in a Chinese context. To bridge this gap, we introduce Empathy-R1, a novel framework that integrates a Chain-of-Empathy (CoE) reasoning process with Reinforcement Learning (RL) to enhance response quality for LCTs. Inspired by cognitive-behavioral therapy, our CoE paradigm guides the model to sequentially reason about a help-seeker's emotions, causes, and intentions, making its thinking process both transparent and interpretable. Our framework is empowered by a new large-scale Chinese dataset, Empathy-QA, and a two-stage training process. First, Supervised Fine-Tuning instills the CoE's reasoning structure. Subsequently, RL, guided by a dedicated reward model, refines the therapeutic relevance and contextual appropriateness of the final responses. Experiments show that Empathy-R1 achieves strong performance on key automatic metrics. More importantly, human evaluations confirm its superiority, showing a clear preference over strong baselines and achieving a Win@1 rate of 44.30% on our new benchmark. By enabling interpretable and contextually nuanced responses, Empathy-R1 represents a significant advancement in developing responsible and genuinely beneficial AI for mental health support.[56] Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens
Issa Sugiura,Shuhei Kurita,Yusuke Oda,Ryuichiro Higashinaka
Main category: cs.CL
TL;DR: Llama-Mimi是一种结合语义和声学标记的统一语音语言模型,在声学一致性和说话人身份保持方面表现优异,但增加量化器数量会影响语言性能。
Details
Motivation: 为了实现语音生成中语义与声学信息的联合建模,解决现有方法在连贯性和音质之间的权衡问题。 Method: 采用统一的分词器和单一Transformer解码器,对交错的语义和声学标记序列进行建模,并引入LLM-as-a-Judge评估生成语音内容质量。 Result: Llama-Mimi在声学一致性方面达到SOTA水平,能有效保留说话人身份;增加量化器可提升声学保真度但降低语言性能。 Conclusion: 统一建模语义与声学标记是可行且有效的,但在长期连贯性上仍存在挑战,需进一步平衡音质与语言质量。 Abstract: We propose Llama-Mimi, a speech language model that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting the inherent challenge of maintaining long-term coherence. We additionally introduce an LLM-as-a-Judge-based evaluation to assess the spoken content quality of generated outputs. Our models, code, and speech samples are publicly available.[57] A Multi-To-One Interview Paradigm for Efficient MLLM Evaluation
Ye Shen,Junying Wang,Farong Wen,Yijin Guo,Qi Jia,Zicheng Zhang,Guangtao Zhai
Main category: cs.CL
TL;DR: 提出了一种多对一的面试范式,用于高效评估多模态大语言模型,通过两阶段面试策略、动态调整权重和自适应选择问题难度,在减少问题数量的同时显著提高了与全量覆盖结果的相关性。
Details
Motivation: 传统全量问答评估存在高冗余和低效率问题,需要更高效的多模态大语言模型评估方法。 Method: 设计了一个包含预面试和正式面试的两阶段策略,动态调整面试官权重以保证公平性,并采用自适应机制选择问题难度级别。 Result: 在多个基准上的实验表明,该方法相比随机采样显著提升了与全量覆盖结果的一致性(PLCC提升17.6%,SRCC提升16.7%),同时减少了所需问题数量。 Conclusion: 所提出的面试范式为大规模多模态大语言模型评估提供了一种可靠且高效的替代方案。 Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has spurred the creation of numerous benchmarks. However, conventional full-coverage Question-Answering evaluations suffer from high redundancy and low efficiency. Inspired by human interview processes, we propose a multi-to-one interview paradigm for efficient MLLM evaluation. Our framework consists of (i) a two-stage interview strategy with pre-interview and formal interview phases, (ii) dynamic adjustment of interviewer weights to ensure fairness, and (iii) an adaptive mechanism for question difficulty-level chosen. Experiments on different benchmarks show that the proposed paradigm achieves significantly higher correlation with full-coverage results than random sampling, with improvements of up to 17.6% in PLCC and 16.7% in SRCC, while reducing the number of required questions. These findings demonstrate that the proposed paradigm provides a reliable and efficient alternative for large-scale MLLM benchmarking.[58] FURINA: Free from Unmergeable Router via LINear Aggregation of mixed experts
Jiayi Han,Liang Du,Yinda Chen,Xiao Kang,Weiyang Ding,Donghong Han
Main category: cs.CL
TL;DR: 提出FURINA,一种无需路由器的MoE-LoRA新框架,通过线性聚合专家实现可合并的参数高效微调,消除推理开销。
Details
Motivation: 现有MoE-LoRA方法依赖离散路由器,导致无法将MoE组件完全合并到主干模型中,限制了部署效率。 Method: 引入自路由机制:解耦LoRA适配器的方向与幅度学习、共享可学习的幅度向量、设计促进专家分化的选择损失;利用输入与适配器方向的角相似性激活专家。 Result: FURINA在保持零推理开销的同时,性能优于标准LoRA并匹敌或超越现有MoE-LoRA方法,且能完全合并到主干模型。 Conclusion: FURINA是首个无需路由器、可完全合并的MoE-LoRA方法,有效解决了推理复杂性与部署兼容性问题。 Abstract: The Mixture of Experts (MoE) paradigm has been successfully integrated into Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning (PEFT), delivering performance gains with minimal parameter overhead. However, a key limitation of existing MoE-LoRA methods is their reliance on a discrete router, which prevents the integration of the MoE components into the backbone model. To overcome this, we propose FURINA, a novel Free from Unmergeable Router framework based on the LINear Aggregation of experts. FURINA eliminates the router by introducing a Self-Routing mechanism. This is achieved through three core innovations: (1) decoupled learning of the direction and magnitude for LoRA adapters, (2) a shared learnable magnitude vector for consistent activation scaling, and (3) expert selection loss that encourages divergent expert activation. The proposed mechanism leverages the angular similarity between the input and each adapter's directional component to activate experts, which are then scaled by the shared magnitude vector. This design allows the output norm to naturally reflect the importance of each expert, thereby enabling dynamic, router-free routing. The expert selection loss further sharpens this behavior by encouraging sparsity and aligning it with standard MoE activation patterns. We also introduce a shared expert within the MoE-LoRA block that provides stable, foundational knowledge. To the best of our knowledge, FURINA is the first router-free, MoE-enhanced LoRA method that can be fully merged into the backbone model, introducing zero additional inference-time cost or complexity. Extensive experiments demonstrate that FURINA not only significantly outperforms standard LoRA but also matches or surpasses the performance of existing MoE-LoRA methods, while eliminating the extra inference-time overhead of MoE.[59] A Comparative Evaluation of Large Language Models for Persian Sentiment Analysis and Emotion Detection in Social Media Texts
Kian Tohidi,Kia Dashtipour,Simone Rebora,Sevda Pourfaramarz
Main category: cs.CL
TL;DR: 本研究对四种先进的大语言模型(Claude 3.7 Sonnet、DeepSeek-V3、Gemini 2.0 Flash 和 GPT-4o)在波斯语社交媒体文本的情感分析和情绪检测任务中的表现进行了全面比较。使用平衡的波斯语数据集,通过统一的提示和评估指标发现,所有模型表现均可接受,性能无显著差异,但GPT-4o在准确率上略占优势,Gemini 2.0 Flash 成本效益最佳。情绪识别比情感分析更具挑战性,且存在特定的误分类模式,揭示了多语言AI系统在波斯语应用中的语言与文化难题。
Details
Motivation: 现有大语言模型的比较研究多集中于英语任务,缺乏对波斯语等非英语语言性能的理解,导致跨语言应用中存在盲区。本研究旨在填补这一空白,评估主流LLMs在波斯语情感分析与情绪检测中的实际表现,并为多语言NLP系统提供基准与选型依据。 Method: 采用包含900条文本的情感分析数据集和1800条文本的情绪检测数据集,设计统一的提示词和一致的处理参数,对四种大语言模型进行实验评估。使用精确率、召回率、F1分数等指标衡量性能,并分析误分类模式,同时比较各模型的准确性与成本效率。 Result: 所有模型在两项任务中均达到可接受水平,统计检验显示前三名模型间无显著差异;GPT-4o 在原始准确率上略高,Gemini 2.0 Flash 成本最低;情绪检测任务整体难度高于情感分析;模型在波斯语特定表达、文化语境理解方面表现出误判倾向。 Conclusion: 该研究为波斯语NLP任务建立了大模型性能基准,表明当前主流LLMs在非英语场景下具有可用性但仍有局限,建议根据应用场景在准确性、效率与成本之间权衡选型,并强调在多语言AI部署中需重视语言特性和文化背景的影响。 Abstract: This study presents a comprehensive comparative evaluation of four state-of-the-art Large Language Models (LLMs)--Claude 3.7 Sonnet, DeepSeek-V3, Gemini 2.0 Flash, and GPT-4o--for sentiment analysis and emotion detection in Persian social media texts. Comparative analysis among LLMs has witnessed a significant rise in recent years, however, most of these analyses have been conducted on English language tasks, creating gaps in understanding cross-linguistic performance patterns. This research addresses these gaps through rigorous experimental design using balanced Persian datasets containing 900 texts for sentiment analysis (positive, negative, neutral) and 1,800 texts for emotion detection (anger, fear, happiness, hate, sadness, surprise). The main focus was to allow for a direct and fair comparison among different models, by using consistent prompts, uniform processing parameters, and by analyzing the performance metrics such as precision, recall, F1-scores, along with misclassification patterns. The results show that all models reach an acceptable level of performance, and a statistical comparison of the best three models indicates no significant differences among them. However, GPT-4o demonstrated a marginally higher raw accuracy value for both tasks, while Gemini 2.0 Flash proved to be the most cost-efficient. The findings indicate that the emotion detection task is more challenging for all models compared to the sentiment analysis task, and the misclassification patterns can represent some challenges in Persian language texts. These findings establish performance benchmarks for Persian NLP applications and offer practical guidance for model selection based on accuracy, efficiency, and cost considerations, while revealing cultural and linguistic challenges that require consideration in multilingual AI system deployment.[60] Patent Language Model Pretraining with ModernBERT
Amirhossein Yousefiramandi,Ciaran Cooney
Main category: cs.CL
TL;DR: 本研究通过在超过6000万项专利记录上预训练三种面向专利领域的掩码语言模型(ModernBERT架构),结合FlashAttention、旋转位置编码和GLU前馈层等优化,显著提升了专利文本分类任务的性能,同时保持了比PatentBERT快3倍以上的推理速度。
Details
Motivation: 通用或现有领域适配的语言模型在处理专利这类长文本、技术性强且结构复杂的领域时表现不佳,亟需更高效的领域专用模型。 Method: 采用ModernBERT架构,在大规模专利语料上进行领域特定的预训练,并引入FlashAttention、旋转位置编码和GLU前馈网络等架构优化;同时探索不同模型规模和定制分词器的影响。 Result: ModernBERT-base-PT在四个下游专利分类任务中的三个上优于通用ModernBERT,并与PatentBERT表现相当;更大的模型(如Mosaic-BERT-large)和定制分词器进一步提升性能;所有ModernBERT变体的推理速度均超过PatentBERT三倍以上。 Conclusion: 针对专利领域进行特定预训练并结合现代架构优化,能有效提升模型性能与推理效率,为专利NLP任务提供了更优解决方案。 Abstract: Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.[61] Cross-Modal Knowledge Distillation for Speech Large Language Models
Enzhi Wang,Qicheng Li,Zhiyuan Tang,Yuhang Jia
Main category: cs.CL
TL;DR: 本文首次系统评估了语音大语言模型中的灾难性遗忘和模态不等价问题,提出了一种跨模态知识蒸馏框架,通过文本到文本和语音到文本通道来保留文本知识并提升语音交互中的推理能力。
Details
Motivation: 引入语音能力可能导致大语言模型在文本输入下也出现知识和推理能力退化,且语音查询时性能进一步下降,亟需解决模态间知识失衡问题。 Method: 提出一种利用文本到文本和语音到文本通道的跨模态知识蒸馏框架,将基于文本的教师模型的知识迁移至语音大语言模型。 Result: 在对话和音频理解任务上的实验表明,该方法能有效保持文本知识、改善跨模态对齐,并增强基于语音交互的推理性能。 Conclusion: 所提出的跨模态知识蒸馏框架有效缓解了语音大语言模型中的灾难性遗忘与模态不等价问题,提升了多模态下的综合表现。 Abstract: In this work, we present the first systematic evaluation of catastrophic forgetting and modality inequivalence in speech large language models, showing that introducing speech capabilities can degrade knowledge and reasoning even when inputs remain textual, and performance further decreases with spoken queries. To address these challenges, we propose a cross-modal knowledge distillation framework that leverages both text-to-text and speech-to-text channels to transfer knowledge from a text-based teacher model to a speech LLM. Extensive experiments on dialogue and audio understanding tasks validate the effectiveness of our approach in preserving textual knowledge, improving cross-modal alignment, and enhancing reasoning in speech-based interactions.[62] Explicit vs. Implicit Biographies: Evaluating and Adapting LLM Information Extraction on Wikidata-Derived Texts
Alessandra Stramiglio,Andrea Schimmenti,Valentina Pasqual,Marieke van Erp,Francesco Sovrano,Fabio Vitali
Main category: cs.CL
TL;DR: 本研究探讨了文本隐含性对预训练大语言模型(如LLaMA 2.3、DeepSeekV1和Phi1.5)在信息抽取任务中的影响,通过生成包含1万条显性和隐性表达的合成数据集,评估模型表现,并分析使用LoRA微调是否能提升模型在隐性推理任务中的泛化能力。
Details
Motivation: 由于传统NLP方法依赖显性语句进行实体和关系识别,难以自动推断文本中的隐含信息,因此需要探索大语言模型在处理隐含语义时的表现及其改进方法。 Method: 构建两个包含1万条样本的合成数据集(显性与隐性表达的传记信息),在LLaMA 2.3、DeepSeekV1和Phi1.5等模型上进行信息抽取实验,并采用LoRA进行微调以评估其对隐性推理性能的影响。 Result: 实验结果表明,使用LoRA对大语言模型进行微调可显著提升其从隐性文本中提取信息的能力,且模型在隐性上下文中的推理性能得到改善。 Conclusion: 微调能够增强大语言模型对隐含文本的理解能力,有助于提高模型在信息抽取任务中的可解释性和可靠性,为处理自然语言中的隐含性提供了有效途径。 Abstract: Text Implicitness has always been challenging in Natural Language Processing (NLP), with traditional methods relying on explicit statements to identify entities and their relationships. From the sentence "Zuhdi attends church every Sunday", the relationship between Zuhdi and Christianity is evident for a human reader, but it presents a challenge when it must be inferred automatically. Large language models (LLMs) have proven effective in NLP downstream tasks such as text comprehension and information extraction (IE). This study examines how textual implicitness affects IE tasks in pre-trained LLMs: LLaMA 2.3, DeepSeekV1, and Phi1.5. We generate two synthetic datasets of 10k implicit and explicit verbalization of biographic information to measure the impact on LLM performance and analyze whether fine-tuning implicit data improves their ability to generalize in implicit reasoning tasks. This research presents an experiment on the internal reasoning processes of LLMs in IE, particularly in dealing with implicit and explicit contexts. The results demonstrate that fine-tuning LLM models with LoRA (low-rank adaptation) improves their performance in extracting information from implicit texts, contributing to better model interpretability and reliability.[63] Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
Mario Sanz-Guerrero,Minh Duc Bui,Katharina von der Wense
Main category: cs.CL
TL;DR: 本文研究了在多选题问答中,大型语言模型评估时“Answer:”后空格的分词方式对结果的影响,发现不同分词方式可导致最高达11%的准确率差异,并影响模型排名。推荐将空格与答案字母一起分词以提升性能和模型校准性。
Details
Motivation: 在大模型评估中,'Answer:'后空格的分词方式常被视为无关紧要,但其可能影响评估结果的可靠性和可比性,亟需系统研究。 Method: 通过实验比较不同分词策略(如是否将空格与答案字母合并)对多个LLM在MCQA任务上的准确率和模型校准性的影响。 Result: 发现分词方式可导致高达11%的准确率差异,并改变模型排名;将空格与答案字母合并分词能带来一致且显著的性能提升,并改善模型校准性。 Conclusion: 评估设计中的细微选择对结果有重大影响,应建立标准化、透明的评估协议以确保结果可靠可比。 Abstract: When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter -- as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model's confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.[64] CLEAR: A Comprehensive Linguistic Evaluation of Argument Rewriting by Large Language Models
Thomas Huber,Christina Niklaus
Main category: cs.CL
TL;DR: 本文研究了大语言模型(LLM)在论点改进(ArgImp)这一文本重写任务中的行为,提出了一种名为CLEAR的评估管道,包含57个指标,涵盖词汇、句法、语义和语用四个语言层面。通过在多个论辩语料库上应用该管道,发现LLM在进行ArgImp时倾向于缩短文本、增加平均词长、合并句子,并在说服力和连贯性方面有所提升。
Details
Motivation: 尽管大语言模型在通用文本生成任务上已有广泛研究,但在与之相关的文本重写任务,尤其是论点改进(ArgImp)方面的研究较少,且缺乏对模型在此类任务中行为的系统分析。 Method: 提出CLEAR评估管道,整合57个指标覆盖词汇、句法、语义和语用四个层次,应用于多个论辩语料库,系统分析不同LLM在ArgImp任务中的改写行为。 Result: 发现LLM在ArgImp任务中普遍缩短文本、增加平均词长、合并句子;在语言层面表现为词汇和句法变化显著,整体提升了文本的说服力与连贯性。 Conclusion: LLM在论点改进任务中表现出一致的重写模式,通过特定语言层面的调整提升论证质量,CLEAR为分析LLM在文本重写中的行为提供了有效框架。 Abstract: While LLMs have been extensively studied on general text generation tasks, there is less research on text rewriting, a task related to general text generation, and particularly on the behavior of models on this task. In this paper we analyze what changes LLMs make in a text rewriting setting. We focus specifically on argumentative texts and their improvement, a task named Argument Improvement (ArgImp). We present CLEAR: an evaluation pipeline consisting of 57 metrics mapped to four linguistic levels: lexical, syntactic, semantic and pragmatic. This pipeline is used to examine the qualities of LLM-rewritten arguments on a broad set of argumentation corpora and compare the behavior of different LLMs on this task and analyze the behavior of different LLMs on this task in terms of linguistic levels. By taking all four linguistic levels into consideration, we find that the models perform ArgImp by shortening the texts while simultaneously increasing average word length and merging sentences. Overall we note an increase in the persuasion and coherence dimensions.[65] Value-Guided KV Compression for LLMs via Approximated CUR Decomposition
Ayan Sengupta,Siddhant Chaudhary,Tanmoy Chakraborty
Main category: cs.CL
TL;DR: 本文提出了一种基于CUR矩阵分解的值向量中心化KV缓存压缩方法CurDKV,通过考虑值向量的重要性来更有效地保留关键信息,在高压缩比下显著优于现有方法。
Details
Motivation: 现有KV缓存压缩方法主要依赖查询-键注意力分数进行令牌淘汰,忽略了直接影响注意力输出的值向量的作用,可能导致语义重要信息丢失。 Method: 提出CurDKV,利用CUR矩阵分解计算杠杆得分,选择对注意力输出子空间贡献最大的键和值,从而更好地保持模型预测行为,并理论证明该方法能最小化端到端注意力重建损失。 Result: 在LLaMA和Mistral模型上,相比SnapKV和ChunkKV等最先进方法,CurDKV在高压缩比下准确率最高提升9.6%,同时兼容FlashAttention和分组查询注意力机制。 Conclusion: CurDKV通过值向量驱动的压缩策略,在大幅减少KV缓存的同时更有效地保留了模型性能,兼具更高的准确性和高达40%的生成延迟降低,提供了更优的速度-精度权衡。 Abstract: Key-value (KV) cache compression has emerged as a critical technique for reducing the memory and latency overhead of autoregressive language models during inference. Prior approaches predominantly rely on query-key attention scores to rank and evict cached tokens, assuming that attention intensity correlates with semantic importance. However, this heuristic overlooks the contribution of value vectors, which directly influence the attention output. In this paper, we propose CurDKV, a novel, value-centric KV compression method that selects keys and values based on leverage scores computed from CUR matrix decomposition. Our approach approximates the dominant subspace of the attention output $softmax(QK^T)V$, ensuring that the retained tokens best preserve the model's predictive behavior. Theoretically, we show that attention score approximation does not guarantee output preservation, and demonstrate that CUR-based selection minimizes end-to-end attention reconstruction loss. Empirically, CurDKV achieves up to 9.6% higher accuracy than state-of-the-art methods like SnapKV and ChunkKV under aggressive compression budgets on LLaMA and Mistral, while maintaining compatibility with FlashAttention and Grouped Query Attention. In addition to improved accuracy, CurDKV reduces generation latency by up to 40% at high compression, offering a practical speed-accuracy tradeoff.[66] Can maiBERT Speak for Maithili?
Sumit Yadav,Raju Kumar Yadav,Utsav Maskey,Gautam Siddharth Kashyap Md Azizul Hoque,Ganesh Gautam
Main category: cs.CL
TL;DR: 本文提出了针对低资源语言Maithili的BERT模型maiBERT,通过掩码语言建模预训练,在新闻分类任务中达到87.02%准确率,优于现有区域模型,并已开源。
Details
Motivation: Maithili作为一种使用广泛但计算资源匮乏的语言,在自然语言理解方面缺乏高质量数据和专用模型,限制了其在AI应用中的发展。 Method: 采用Masked Language Modeling(MLM)技术,基于Maithili语料库预训练一个BERT模型(maiBERT),并在新闻分类任务上进行评估。 Result: maiBERT在新闻分类任务中取得了87.02%的准确率,整体比NepBERTa和HindiBERT高出0.13%,在多个类别上提升5-7%。 Conclusion: maiBERT是首个专为Maithili设计的预训练语言模型,表现优越,且已在Hugging Face开源,支持情感分析、命名实体识别等下游任务的进一步微调。 Abstract: Natural Language Understanding (NLU) for low-resource languages remains a major challenge in NLP due to the scarcity of high-quality data and language-specific models. Maithili, despite being spoken by millions, lacks adequate computational resources, limiting its inclusion in digital and AI-driven applications. To address this gap, we introducemaiBERT, a BERT-based language model pre-trained specifically for Maithili using the Masked Language Modeling (MLM) technique. Our model is trained on a newly constructed Maithili corpus and evaluated through a news classification task. In our experiments, maiBERT achieved an accuracy of 87.02%, outperforming existing regional models like NepBERTa and HindiBERT, with a 0.13% overall accuracy gain and 5-7% improvement across various classes. We have open-sourced maiBERT on Hugging Face enabling further fine-tuning for downstream tasks such as sentiment analysis and Named Entity Recognition (NER).[67] LLM-OREF: An Open Relation Extraction Framework Based on Large Language Models
Hongyao Tu,Liang Zhang,Yujie Lin,Xin Lin,Haibo Zhang,Long Zhang,Jinsong Su
Main category: cs.CL
TL;DR: 本文提出了一种基于大语言模型的开放关系抽取框架(LLM-OREF),通过关系发现和关系预测两个组件,结合自纠正推理策略,无需人工干预即可自动发现并预测新关系。
Details
Motivation: 现有开放关系抽取方法依赖聚类和人工标注来定义新关系,实用性受限,因此需要一种能自动识别新关系且无需人工干预的方法。 Method: 提出一个包含关系发现器(RD)和关系预测器(RP)的框架,利用大语言模型的生成与理解能力,通过示范样例进行新关系预测,并设计三阶段自纠正推理策略:关系发现、去噪与再预测。 Result: 在三个OpenRE数据集上的实验表明,该方法显著优于现有方法,能够有效自动发现和预测新关系。 Conclusion: 所提出的基于大语言模型的OpenRE框架无需人工干预,通过自纠正推理机制提升了新关系预测的准确性与鲁棒性,推动了开放关系抽取的实用化发展。 Abstract: The goal of open relation extraction (OpenRE) is to develop an RE model that can generalize to new relations not encountered during training. Existing studies primarily formulate OpenRE as a clustering task. They first cluster all test instances based on the similarity between the instances, and then manually assign a new relation to each cluster. However, their reliance on human annotation limits their practicality. In this paper, we propose an OpenRE framework based on large language models (LLMs), which directly predicts new relations for test instances by leveraging their strong language understanding and generation abilities, without human intervention. Specifically, our framework consists of two core components: (1) a relation discoverer (RD), designed to predict new relations for test instances based on \textit{demonstrations} formed by training instances with known relations; and (2) a relation predictor (RP), used to select the most likely relation for a test instance from $n$ candidate relations, guided by \textit{demonstrations} composed of their instances. To enhance the ability of our framework to predict new relations, we design a self-correcting inference strategy composed of three stages: relation discovery, relation denoising, and relation prediction. In the first stage, we use RD to preliminarily predict new relations for all test instances. Next, we apply RP to select some high-reliability test instances for each new relation from the prediction results of RD through a cross-validation method. During the third stage, we employ RP to re-predict the relations of all test instances based on the demonstrations constructed from these reliable test instances. Extensive experiments on three OpenRE datasets demonstrate the effectiveness of our framework. We release our code at https://github.com/XMUDeepLIT/LLM-OREF.git.[68] TextMine: LLM-Powered Knowledge Extraction for Humanitarian Mine Action
Chenyue Zhou,Gürkan Solmaz,Flavio Cirillo,Kiril Gashteovski,Jonathan Fürst
Main category: cs.CL
TL;DR: 本文提出了TextMine,一个基于本体引导的管道,利用大语言模型从人道主义排雷行动(HMA)文本中提取知识三元组,显著提升了信息抽取的准确性和格式一致性。
Details
Motivation: 人道主义排雷行动中积累了大量最佳实践知识,但这些知识大多存在于非结构化报告中,难以有效利用。 Method: 提出TextMine管道,结合文档分块、领域感知提示、三元组提取,并采用基于参考和LLM-as-a-Judge的评估方法;同时构建了首个HMA本体和真实排雷报告数据集。 Result: 实验表明,与基线相比,基于本体的提示使提取准确率提高44.2%,幻觉减少22.5%,格式符合度提升20.9%。 Conclusion: TextMine能有效将非结构化HMA文本转化为结构化知识,具备良好的可扩展性,可适应全球排雷工作或其他领域。 Abstract: Humanitarian Mine Action has generated extensive best-practice knowledge, but much remains locked in unstructured reports. We introduce TextMine, an ontology-guided pipeline that uses Large Language Models to extract knowledge triples from HMA texts. TextMine integrates document chunking, domain-aware prompting, triple extraction, and both reference-based and LLM-as-a-Judge evaluation. We also create the first HMA ontology and a curated dataset of real-world demining reports. Experiments show ontology-aligned prompts boost extraction accuracy by 44.2%, cut hallucinations by 22.5%, and improve format conformance by 20.9% over baselines. While validated on Cambodian reports, TextMine can adapt to global demining efforts or other domains, transforming unstructured data into structured knowledge.[69] Large Language Model probabilities cannot distinguish between possible and impossible language
Evelina Leivada,Raquel Montero,Paolo Morosi,Natalia Moskvina,Tamara Serrano,Marcel Aguilar,Fritz Guenther
Main category: cs.CL
TL;DR: 该研究探讨大语言模型是否能区分语法可能与不可能的语言,通过新基准测试发现,模型对语义和语用异常句子的意外程度高于语法错误句子,表明概率不能可靠反映句法知识的内部表征。
Details
Motivation: 验证大语言模型是否真正具备识别语法边界的能力,并质疑当前使用概率作为语法判断代理的有效性。 Method: 利用模型内部表示,计算四种条件下最小对立对的意外度差异:(i)低频语法句、(ii)无语法句、(iii)语义异常句、(iv)语用异常句。 Result: 未发现无语法条件独有的意外度峰值,语义和语用异常引发更高意外度,表明概率无法可靠标识语法错误。 Conclusion: 语言模型的概率输出不能作为其内部句法知识的可靠代理,需采用其他方法验证模型对语言可能性的判断能力。 Abstract: A controversial test for Large Language Models concerns the ability to discern possible from impossible language. While some evidence attests to the models' sensitivity to what crosses the limits of grammatically impossible language, this evidence has been contested on the grounds of the soundness of the testing material. We use model-internal representations to tap directly into the way Large Language Models represent the 'grammatical-ungrammatical' distinction. In a novel benchmark, we elicit probabilities from 4 models and compute minimal-pair surprisal differences, juxtaposing probabilities assigned to grammatical sentences to probabilities assigned to (i) lower frequency grammatical sentences, (ii) ungrammatical sentences, (iii) semantically odd sentences, and (iv) pragmatically odd sentences. The prediction is that if string-probabilities can function as proxies for the limits of grammar, the ungrammatical condition will stand out among the conditions that involve linguistic violations, showing a spike in the surprisal rates. Our results do not reveal a unique surprisal signature for ungrammatical prompts, as the semantically and pragmatically odd conditions consistently show higher surprisal. We thus demonstrate that probabilities do not constitute reliable proxies for model-internal representations of syntactic knowledge. Consequently, claims about models being able to distinguish possible from impossible language need verification through a different methodology.[70] A1: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong,Qiujiang Chen,Fanghua Ye,Zhongwei Wan,Chuanyang Zheng,Chenyang Zhao,Hui Shen,Alexander Hanbo Li,Chaofan Tao,Haochen Tan,Haoli Bai,Lifeng Shang,Lingpeng Kong,Ngai Wong
Main category: cs.CL
TL;DR: 本文提出了A1(异步测试时扩展)框架,通过提高计算密集度、在线校准策略和三阶段拒绝采样流水线,显著提升了大语言模型推理的效率和吞吐量,实现了56.7倍的速度提升和4.14倍的吞吐量改进,同时保持准确性和低延迟。
Details
Motivation: 现有的大语言模型测试时扩展方法面临严重的同步开销、内存瓶颈和延迟问题,尤其是在长推理链的推测解码中表现不佳,因此需要一种更高效、可扩展的推理框架。 Method: 提出A1框架,优化算术强度以识别同步瓶颈,采用在线校准实现异步推理,并设计支持串行与并行扩展的三阶段拒绝采样流水线。 Result: 在MATH、AMC23、AIME24和AIME25数据集上实验表明,A1相比仅使用目标模型扩展的方法,实现了56.7倍的速度提升和4.14倍的吞吐量提升,同时有效控制拒绝率,降低延迟和内存开销,且无精度损失。 Conclusion: A1是一种高效且原理严谨的大语言模型可扩展推理解决方案,能够显著提升测试时扩展性能,适用于复杂推理任务。 Abstract: Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.[71] SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
Huy Nghiem,Advik Sachdeva,Hal Daumé III
Main category: cs.CL
TL;DR: 本文提出了一种名为SMARTER的两阶段框架,利用大语言模型(LLMs)进行可解释的内容审核,在少量数据下实现了比标准少样本基线最高13.5%的macro-F1提升。
Details
Motivation: 社交媒体上的有害内容日益严重,现有方法在标注数据和解释性方面存在不足,因此需要一种数据高效且具备可解释性的内容审核方法。 Method: 第一阶段利用LLMs生成正确和错误标签的合成解释,通过偏好优化实现对齐;第二阶段通过跨模型训练提升解释质量,使较弱模型在风格和语义上与更强模型对齐。 Result: 在HateXplain、Latent Hate和Implicit Hate三个基准任务上,SMARTER相比少样本基线最高提升了13.5%的macro-F1分数,并仅使用少量训练数据即达到优越性能。 Conclusion: SMARTER提供了一种可扩展、低资源的内容审核策略,有效结合了LLMs的自提升能力,兼顾分类准确性与解释质量。 Abstract: WARNING: This paper contains examples of offensive materials. Toxic content has become pervasive on social media platforms. We introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks -- HateXplain, Latent Hate, and Implicit Hate -- demonstrate that SMARTER enables LLMs to achieve up to a 13.5% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs' self-improving capabilities for both classification and explanation.[72] Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning
Yeongbin Seo,Dongha Lee,Jaehyung Kim,Jinyoung Yeo
Main category: cs.CL
TL;DR: 提出卷积解码(Conv)和基于拒绝规则的微调(R2FT)方法,解决扩散语言模型中的长解码窗口问题,在保持并行性的同时提升生成质量与速度。
Details
Motivation: 现有扩散语言模型在长距离生成时易出现无关或重复内容,且已有解决方案牺牲了并行性和速度优势。 Method: 提出卷积解码(Conv),通过基于归一化的方法缩小解码窗口而不进行硬分割;引入拒绝规则微调(R2FT)以对远离上下文位置的token进行后训练对齐。 Result: 在开放生成基准(如AlpacaEval)上达到扩散语言模型中的最优性能,且所需步数显著低于先前方法。 Conclusion: 所提方法在不牺牲并行解码优势的前提下,有效缓解长解码窗口问题,兼顾生成质量与推理速度。 Abstract: Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a key bottleneck in current diffusion LMs: the long decoding-window problem, where tokens generated far from the input context often become irrelevant or repetitive. Previous solutions like semi-autoregressive address this issue by splitting windows into blocks, but this sacrifices speed and bidirectionality, eliminating the main advantage of diffusion models. To overcome this, we propose Convolutional decoding (Conv), a normalization-based method that narrows the decoding window without hard segmentation, leading to better fluency and flexibility. Additionally, we introduce Rejecting Rule-based Fine-Tuning (R2FT), a post-hoc training scheme that better aligns tokens at positions far from context. Our methods achieve state-of-the-art results on open-ended generation benchmarks (e.g., AlpacaEval) among diffusion LM baselines, with significantly lower step size than previous works, demonstrating both speed and quality improvements.[73] Fair-GPTQ: Bias-Aware Quantization for Large Language Models
Irina Proskurina,Guillaume Metzler,Julien Velcin
Main category: cs.CL
TL;DR: 本文提出了Fair-GPTQ,一种在量化过程中显式减少大语言模型不公平性的新方法,通过在量化目标中引入群体公平性约束,有效降低性别、种族和宗教等偏见,同时保持4比特量化的效率和性能。
Details
Motivation: 现有量化方法(如GPTQ)虽能降低计算成本,但可能加剧模型输出的偏见,影响公平性,而具体导致该问题的权重尚不明确。因此,需要一种兼顾效率与公平性的量化方法。 Method: 在GPTQ框架基础上,引入显式的群体公平性约束到量化目标中,指导舍入操作的学习过程,以减少对受保护群体的偏见生成,特别关注职业刻板印象及性别、种族、宗教相关的歧视性语言。 Result: Fair-GPTQ在零样本基准上保留至少90%的基线准确率,显著降低不公平性,优于半精度模型,并保持4比特量化的内存和速度优势;在种族刻板印象基准上表现与迭代零空间投影去偏方法相当。 Conclusion: Fair-GPTQ首次将公平性约束融入量化过程,验证了在量化阶段缓解生成模型群体偏见的有效性,并可用于分析通道和权重层级对公平性的贡献。 Abstract: High memory demands of generative language models have drawn attention to quantization, which reduces computational cost, memory usage, and latency by mapping model weights to lower-precision integers. Approaches such as GPTQ effectively minimize input-weight product errors during quantization; however, recent empirical studies show that they can increase biased outputs and degrade performance on fairness benchmarks, and it remains unclear which specific weights cause this issue. In this work, we draw new links between quantization and model fairness by adding explicit group-fairness constraints to the quantization objective and introduce Fair-GPTQ, the first quantization method explicitly designed to reduce unfairness in large language models. The added constraints guide the learning of the rounding operation toward less-biased text generation for protected groups. Specifically, we focus on stereotype generation involving occupational bias and discriminatory language spanning gender, race, and religion. Fair-GPTQ has minimal impact on performance, preserving at least 90% of baseline accuracy on zero-shot benchmarks, reduces unfairness relative to a half-precision model, and retains the memory and speed benefits of 4-bit quantization. We also compare the performance of Fair-GPTQ with existing debiasing methods and find that it achieves performance on par with the iterative null-space projection debiasing approach on racial-stereotype benchmarks. Overall, the results validate our theoretical solution to the quantization problem with a group-bias term, highlight its applicability for reducing group bias at quantization time in generative models, and demonstrate that our approach can further be used to analyze channel- and weight-level contributions to fairness during quantization.[74] What's the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques
Petros Stylianos Giouroukis,Dimitris Dimitriadis,Dimitrios Papadopoulos,Zhenwen Shao,Grigorios Tsoumakas
Main category: cs.CL
TL;DR: 本文研究了多种有效的幻灯片检索方法,包括视觉late-interaction嵌入模型、视觉重排序器和混合检索技术,并提出了一种基于视觉语言模型的字幕生成管道,在减少存储需求的同时保持良好的检索性能。
Details
Motivation: 由于幻灯片是结合文本、图像和图表的多模态文档,传统分离索引方式复杂且易丢失上下文信息,因此需要更高效的检索方法。 Method: 采用了ColPali等视觉late-interaction嵌入模型、视觉重排序器、BM25与稠密检索结合的混合检索技术,以及Reciprocal Rank Fusion等融合方法,并评估了基于视觉语言模型的字幕生成管道。 Result: 提出的字幕生成管道显著降低了嵌入存储需求,同时实现了与视觉late-interaction技术相当的检索性能;综合评估了各类方法在检索效果、运行时间和存储开销方面的表现。 Conclusion: 研究为实际应用中高效鲁棒的幻灯片检索系统的选择与开发提供了实用指导,表明结合字幕的轻量级方法可作为高效替代方案。 Abstract: Slide decks, serving as digital reports that bridge the gap between presentation slides and written documents, are a prevalent medium for conveying information in both academic and corporate settings. Their multimodal nature, combining text, images, and charts, presents challenges for retrieval-augmented generation systems, where the quality of retrieval directly impacts downstream performance. Traditional approaches to slide retrieval often involve separate indexing of modalities, which can increase complexity and lose contextual information. This paper investigates various methodologies for effective slide retrieval, including visual late-interaction embedding models like ColPali, the use of visual rerankers, and hybrid retrieval techniques that combine dense retrieval with BM25, further enhanced by textual rerankers and fusion methods like Reciprocal Rank Fusion. A novel Vision-Language Models-based captioning pipeline is also evaluated, demonstrating significantly reduced embedding storage requirements compared to visual late-interaction techniques, alongside comparable retrieval performance. Our analysis extends to the practical aspects of these methods, evaluating their runtime performance and storage demands alongside retrieval efficacy, thus offering practical guidance for the selection and development of efficient and robust slide retrieval systems for real-world applications.[75] Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models
Sreejato Chatterjee,Linh Tran,Quoc Duy Nguyen,Roni Kirson,Drue Hamlin,Harvest Aquino,Hanjia Lyu,Jiebo Luo,Timothy Dye
Main category: cs.CL
TL;DR: 提出一种利用大语言模型(LLM)测量历史结构性压迫的新框架,通过规则引导的提示策略,从多语言新冠疫情数据中的民族认同表述生成情境敏感的压迫评分,揭示身份相关的系统性排斥,并发布开源基准数据集。
Details
Motivation: 传统测量历史结构性压迫的方法因各国历史背景差异大而缺乏跨国可比性,且常忽视基于身份的生存性排斥,仅侧重物质资源指标。 Method: 利用大语言模型(LLM),结合规则引导的提示策略,分析多语言自我认同的民族表述,生成理论驱动、可解释的历史压迫评分,并在多个先进LLM上系统评估该方法。 Result: 研究表明,在明确规则引导下,LLM能有效捕捉国家内部基于身份的历史压迫细微差异,提供一种可扩展、跨文化的系统性排斥测量视角。 Conclusion: 该框架为测量历史结构性压迫提供了补充性工具,突出了数据驱动研究和公共卫生中压迫表现的维度,具有良好的可复现性和应用潜力。 Abstract: Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/llm-oppression-benchmark).[76] LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
Ruijie Hou,Yueyang Jiao,Hanxu Hu,Yingming Li,Wai Lam,Huajian Zhang,Hongyuan Lu
Main category: cs.CL
TL;DR: 提出了一种名为LNE-Blocking的新框架,用于在数据污染不可避免的情况下恢复大语言模型在潜在泄露数据集上的性能。
Details
Motivation: 由于训练数据中可能无意包含评估基准,导致大语言模型存在数据污染问题,难以公平评估模型性能。 Method: 框架包含两部分:使用LNE进行污染检测,根据检测结果调整Blocking操作的强度,以抑制模型的记忆化响应。 Result: 该方法能有效恢复模型在贪婪解码下的性能,在多个可能存在数据泄露的数据集上表现良好,并在不同模型和污染程度下保持稳定恢复效果。 Conclusion: LNE-Blocking是首个能高效恢复污染后模型性能的框架,为在无法避免数据污染的情况下进行公平评估提供了可行方案。 Abstract: The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, \textbf{LNE}, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, \textbf{Blocking}, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model's greedy decoding performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.cs.CV [Back]
[77] Class-invariant Test-Time Augmentation for Domain Generalization
Zhicheng Lin,Xiaolin Wu,Xi Zhang
Main category: cs.CV
TL;DR: 提出一种轻量级测试时增强方法CI-TTA,通过弹性与网格变形生成同类别图像变体,并利用置信度过滤机制聚合预测,提升模型在分布偏移下的泛化性能。
Details
Motivation: 深度模型在分布偏移下性能下降严重,现有领域泛化方法多依赖多域训练或高计算成本的测试时适应,缺乏高效轻量的解决方案。 Method: 提出类不变测试时增强(CI-TTA),通过弹性变形和网格变形生成同一类别的输入图像变体,结合置信度引导的过滤机制聚合预测结果,去除不可靠输出。 Result: 在PACS和Office-Home数据集上验证了CI-TTA的有效性,可在不同领域泛化算法和主干网络上实现一致性能提升。 Conclusion: CI-TTA是一种有效且通用的轻量级测试时增强策略,能够显著提升模型在未见域上的泛化能力,无需额外训练或高计算开销。 Abstract: Deep models often suffer significant performance degradation under distribution shifts. Domain generalization (DG) seeks to mitigate this challenge by enabling models to generalize to unseen domains. Most prior approaches rely on multi-domain training or computationally intensive test-time adaptation. In contrast, we propose a complementary strategy: lightweight test-time augmentation. Specifically, we develop a novel Class-Invariant Test-Time Augmentation (CI-TTA) technique. The idea is to generate multiple variants of each input image through elastic and grid deformations that nevertheless belong to the same class as the original input. Their predictions are aggregated through a confidence-guided filtering scheme that remove unreliable outputs, ensuring the final decision relies on consistent and trustworthy cues. Extensive Experiments on PACS and Office-Home datasets demonstrate consistent gains across different DG algorithms and backbones, highlighting the effectiveness and generality of our approach.[78] AToken: A Unified Tokenizer for Vision
Jiasen Lu,Liangchen Song,Mingze Xu,Byeongjoo Ahn,Yanjun Wang,Chen Chen,Afshin Dehghan,Yinfei Yang
Main category: cs.CV
TL;DR: AToken是首个统一的视觉分词器,能够在图像、视频和3D资产上同时实现高保真重建和语义理解,通过共享4D潜在空间和纯Transformer架构,支持多种模态和任务。
Details
Motivation: 现有分词器通常仅专注于单一模态下的重建或理解任务,缺乏跨模态统一处理的能力,因此需要一个能够同时兼顾高质量重建与语义理解的通用视觉分词器。 Method: 提出AToken,采用纯Transformer架构与4D旋转位置编码,将多种视觉输入映射到共享4D潜在空间;使用无对抗的感知损失和Gram矩阵损失进行训练,并通过渐进式训练策略逐步扩展至图像、视频和3D数据。 Result: 在图像上达到0.21 rFID和82.2% ImageNet准确率,视频上3.01 rFVD和32.6% MSRVTT检索率,3D数据上28.19 PSNR和90.9%分类准确率;支持生成与理解任务,如文本到视频生成、图像到3D合成和多模态大模型应用。 Conclusion: AToken实现了跨模态、跨任务的统一视觉分词,为下一代多模态AI系统提供了基础框架。 Abstract: We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.[79] MemEvo: Memory-Evolving Incremental Multi-view Clustering
Zisen Kong,Bo Zhong,Pengyuan Li,Dongxia Chang,Yiming Wang
Main category: cs.CV
TL;DR: 提出了一种基于海马体-前额叶皮层记忆机制的增量多视图聚类方法MemEvo,通过视图对齐、认知遗忘和知识巩固模块有效平衡稳定性与可塑性。
Details
Motivation: 解决增量多视图聚类中的稳定性-可塑性困境(SPD),避免模型在新增视图时发生灾难性遗忘,同时保持对新数据的快速适应能力。 Method: 受神经科学中海马体-前额叶协同记忆机制启发,设计了三个核心模块:海马体启发的视图对齐模块用于捕捉新视图结构信息,模拟人类记忆衰减的认知遗忘机制调节历史知识权重,前额叶启发的知识整合模块利用时间张量稳定性逐步固化长期知识。 Result: 实验表明,MemEvo在不断增加视图数量的场景下显著优于现有最先进方法,展现出更强的知识保留能力和稳定的聚类性能。 Conclusion: MemEvo通过模拟大脑记忆机制,有效平衡了模型的稳定性和可塑性,为增量多视图聚类提供了新的思路和高效解决方案。 Abstract: Incremental multi-view clustering aims to achieve stable clustering results while addressing the stability-plasticity dilemma (SPD) in incremental views. At the core of SPD is the challenge that the model must have enough plasticity to quickly adapt to new data, while maintaining sufficient stability to consolidate long-term knowledge and prevent catastrophic forgetting. Inspired by the hippocampal-prefrontal cortex collaborative memory mechanism in neuroscience, we propose a Memory-Evolving Incremental Multi-view Clustering method (MemEvo) to achieve this balance. First, we propose a hippocampus-inspired view alignment module that captures the gain information of new views by aligning structures in continuous representations. Second, we introduce a cognitive forgetting mechanism that simulates the decay patterns of human memory to modulate the weights of historical knowledge. Additionally, we design a prefrontal cortex-inspired knowledge consolidation memory module that leverages temporal tensor stability to gradually consolidate historical knowledge. By integrating these modules, MemEvo achieves strong knowledge retention capabilities in scenarios with a growing number of views. Extensive experiments demonstrate that MemEvo exhibits remarkable advantages over existing state-of-the-art methods.[80] Edge-Aware Normalized Attention for Efficient and Detail-Preserving Single Image Super-Resolution
Penghao Rao,Tieyong Zeng
Main category: cs.CV
TL;DR: 提出一种基于边缘引导注意力机制的轻量级单图像超分辨率方法,通过联合编码边缘特征和中间特征生成自适应调制图,有效提升结构清晰度和感知质量。
Details
Motivation: 现有边缘感知方法在复杂主干网络上附加边缘先验或注意力分支,常导致冗余、优化不稳定或结构增益有限。 Method: 设计边缘引导注意力机制,从联合编码的边缘特征和中间特征激活中生成自适应调制图,用于归一化和重加权响应;结合像素级、感知和对抗损失的复合目标函数,在轻量残差结构中训练。 Result: 在标准SISR基准上显著优于SRGAN、ESRGAN及先前边缘注意力方法,提升了结构锐度和感知质量,同时保持模型复杂度相当。 Conclusion: 所提方法为注入边缘先验提供了参数高效路径,通过定制多项目损失稳定了对抗性优化,并在不增加网络深度或参数量的情况下增强了边缘保真度。 Abstract: Single-image super-resolution (SISR) remains highly ill-posed because recovering structurally faithful high-frequency content from a single low-resolution observation is ambiguous. Existing edge-aware methods often attach edge priors or attention branches onto increasingly complex backbones, yet ad hoc fusion frequently introduces redundancy, unstable optimization, or limited structural gains. We address this gap with an edge-guided attention mechanism that derives an adaptive modulation map from jointly encoded edge features and intermediate feature activations, then applies it to normalize and reweight responses, selectively amplifying structurally salient regions while suppressing spurious textures. In parallel, we integrate this mechanism into a lightweight residual design trained under a composite objective combining pixel-wise, perceptual, and adversarial terms to balance fidelity, perceptual realism, and training stability. Extensive experiments on standard SISR benchmarks demonstrate consistent improvements in structural sharpness and perceptual quality over SRGAN, ESRGAN, and prior edge-attention baselines at comparable model complexity. The proposed formulation provides (i) a parameter-efficient path to inject edge priors, (ii) stabilized adversarial refinement through a tailored multiterm loss, and (iii) enhanced edge fidelity without resorting to deeper or heavily overparameterized architectures. These results highlight the effectiveness of principled edge-conditioned modulation for advancing perceptual super-resolution.[81] Adaptive and Iterative Point Cloud Denoising with Score-Based Diffusion Model
Zhaonan Wang,Manyi Li,ShiQing Xin,Changhe Tu
Main category: cs.CV
TL;DR: 本文提出了一种基于分数扩散模型的自适应迭代点云去噪方法,能够根据噪声水平自动调整去噪策略,并通过两阶段采样和网络设计实现特征与梯度融合,显著提升了去噪效果,尤其在保持形状边界和细节方面优于现有方法。
Details
Motivation: 现有点云去噪方法通常采用固定次数的迭代,缺乏对不同噪声水平和模式的自适应能力,且去噪过程效率不高,因此需要一种能根据输入噪声特性动态调整的高效去噪框架。 Method: 提出基于分数扩散模型的自适应迭代去噪方法:首先估计噪声方差并生成自适应去噪调度,然后通过设计的网络架构和两阶段采样策略进行迭代优化,实现特征融合与梯度融合。 Result: 在合成数据集(多种噪声模式)和真实扫描数据集上均取得优于当前最先进方法的去噪效果,定性和定量指标均领先,同时更好地保留了点云的边界和细节结构。 Conclusion: 所提出的自适应扩散去噪框架有效解决了传统方法在处理不同噪声水平时的局限性,通过动态调度和融合机制实现了高质量、鲁棒的点云去噪。 Abstract: Point cloud denoising task aims to recover the clean point cloud from the scanned data coupled with different levels or patterns of noise. The recent state-of-the-art methods often train deep neural networks to update the point locations towards the clean point cloud, and empirically repeat the denoising process several times in order to obtain the denoised results. It is not clear how to efficiently arrange the iterative denoising processes to deal with different levels or patterns of noise. In this paper, we propose an adaptive and iterative point cloud denoising method based on the score-based diffusion model. For a given noisy point cloud, we first estimate the noise variation and determine an adaptive denoising schedule with appropriate step sizes, then invoke the trained network iteratively to update point clouds following the adaptive schedule. To facilitate this adaptive and iterative denoising process, we design the network architecture and a two-stage sampling strategy for the network training to enable feature fusion and gradient fusion for iterative denoising. Compared to the state-of-the-art point cloud denoising methods, our approach obtains clean and smooth denoised point clouds, while preserving the shape boundary and details better. Our results not only outperform the other methods both qualitatively and quantitatively, but also are preferable on the synthetic dataset with different patterns of noises, as well as the real-scanned dataset.[82] DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising
Li Gao,Hongyang Sun,Liu Liu,Yunhao Li,Yang Cai
Main category: cs.CV
TL;DR: 提出DiffVL,首个将视觉定位重构为GPS去噪任务的扩散模型框架,利用噪声GPS轨迹、SD地图和视觉信号实现无需高精地图的亚米级定位精度。
Details
Motivation: 现有基于标准精度地图(SD map)的视觉定位方法主要依赖鸟瞰图匹配,忽略了易获取但含噪声的GPS信号;而高精地图虽精确却成本高昂,难以扩展。因此需要一种不依赖高精地图且能有效利用噪声GPS的可扩展定位方法。 Method: 提出DiffVL,首次将视觉定位视为GPS去噪问题,采用扩散模型,以视觉BEV特征和SD地图为条件,对噪声GPS轨迹进行迭代去噪,恢复真实位姿分布,联合建模GPS、SD地图与视觉信号。 Result: 在多个数据集上达到最优性能,显著优于BEV匹配基线方法(如OrienterNet),实现亚米级定位精度,验证了扩散模型作为生成先验在定位任务中的有效性。 Conclusion: DiffVL实现了无需高精地图的高精度视觉定位,通过将噪声GPS作为生成先验,标志着从传统匹配范式向生成式去噪范式的转变,推动了可扩展自动驾驶定位技术的发展。 Abstract: Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird's-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.[83] DICE: Diffusion Consensus Equilibrium for Sparse-view CT Reconstruction
Leon Suarez-Rodriguez,Roman Jacome,Romario Gualdron-Hurtado,Ana Mantilla-Dulcey,Henry Arguello
Main category: cs.CV
TL;DR: 本文提出了一种名为Diffusion Consensus Equilibrium (DICE)的稀疏视角CT重建框架,通过结合扩散模型的生成先验与测量一致性约束,在均匀和非均匀稀疏采样条件下显著优于现有方法。
Details
Motivation: 稀疏视角CT重建因欠采样导致逆问题严重不适定,传统方法难以有效捕捉医学图像中的复杂结构。 Method: 引入DICE框架,在扩散模型采样过程中融入双代理共识均衡机制:一个数据一致性代理(通过近端算子强制测量一致性),一个先验代理(由扩散模型在每一步采样中估计干净图像)。两者迭代平衡,实现强生成先验与数据保真的结合。 Result: 实验表明,DICE在15、30、60视图(共180视图)的均匀与非均匀稀疏设置下,显著优于当前最先进的基线方法,能高质量重建CT图像。 Conclusion: DICE有效融合了扩散模型的强大先验能力与测量一致性,展现出在稀疏视角CT重建中的优越性能和鲁棒性。 Abstract: Sparse-view computed tomography (CT) reconstruction is fundamentally challenging due to undersampling, leading to an ill-posed inverse problem. Traditional iterative methods incorporate handcrafted or learned priors to regularize the solution but struggle to capture the complex structures present in medical images. In contrast, diffusion models (DMs) have recently emerged as powerful generative priors that can accurately model complex image distributions. In this work, we introduce Diffusion Consensus Equilibrium (DICE), a framework that integrates a two-agent consensus equilibrium into the sampling process of a DM. DICE alternates between: (i) a data-consistency agent, implemented through a proximal operator enforcing measurement consistency, and (ii) a prior agent, realized by a DM performing a clean image estimation at each sampling step. By balancing these two complementary agents iteratively, DICE effectively combines strong generative prior capabilities with measurement consistency. Experimental results show that DICE significantly outperforms state-of-the-art baselines in reconstructing high-quality CT images under uniform and non-uniform sparse-view settings of 15, 30, and 60 views (out of a total of 180), demonstrating both its effectiveness and robustness.[84] Domain Adaptation for Ulcerative Colitis Severity Estimation Using Patient-Level Diagnoses
Takamasa Yamaguchi,Brian Kenji Iwana,Ryoma Bise,Shota Harada,Takumi Okuo,Kiyohito Tanaka,Kaito Shiku
Main category: cs.CV
TL;DR: 提出一种弱监督域适应方法,利用患者级别的诊断结果作为弱监督信号,通过共享聚合令牌和最大严重性三元组损失对齐跨域类别分布,有效提升溃疡性结肠炎严重程度估计性能。
Details
Motivation: 现有方法因不同医院成像设备和临床环境差异导致的域偏移问题而受限,且目标域缺乏标注或标注成本高。 Method: 提出弱监督域适应方法,利用患者级别诊断结果作为弱监督信号,采用共享聚合令牌和最大严重性三元组损失对齐跨域类别分布。 Result: 实验结果显示该方法在域偏移场景下优于现有的域适应方法,显著提升UC严重程度估计效果。 Conclusion: 所提方法能有效利用弱监督信息缓解域偏移问题,在实际临床应用中具有潜力。 Abstract: The development of methods to estimate the severity of Ulcerative Colitis (UC) is of significant importance. However, these methods often suffer from domain shifts caused by differences in imaging devices and clinical settings across hospitals. Although several domain adaptation methods have been proposed to address domain shift, they still struggle with the lack of supervision in the target domain or the high cost of annotation. To overcome these challenges, we propose a novel Weakly Supervised Domain Adaptation method that leverages patient-level diagnostic results, which are routinely recorded in UC diagnosis, as weak supervision in the target domain. The proposed method aligns class-wise distributions across domains using Shared Aggregation Tokens and a Max-Severity Triplet Loss, which leverages the characteristic that patient-level diagnoses are determined by the most severe region within each patient. Experimental results demonstrate that our method outperforms comparative DA approaches, improving UC severity estimation in a domain-shifted setting.[85] Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
Rashid Mushkani
Main category: cs.CV
TL;DR: 本文介绍了一个用于测试视觉-语言模型在城市感知任务上的小型基准,基于100张蒙特利尔街景图像(真实与合成各半),结合多维度人类标注,评估了七种VLM在客观属性和主观印象识别上的表现,发现模型在客观特征上表现更好,且人类一致性高的任务模型得分也更高。
Details
Motivation: 理解人们如何解读城市场景有助于城市设计与规划,但现有视觉-语言模型在城市感知任务上的评估不足,缺乏结合主观与客观维度的标准化基准。 Method: 构建包含100张真实与合成街景图像的基准数据集,由12名来自社区群体的参与者在30个维度上提供230份标注;将法语标注统一为英文,并采用结构化提示与确定性解析器对七种VLM进行零样本评估;使用准确率和Jaccard重叠度量模型性能,人类一致性通过Krippendorff's alpha和成对Jaccard衡量。 Result: 模型在可见的客观属性上表现优于主观评价;最佳系统(claude-sonnet)在多标签项目上达到macro F1 0.31和平均Jaccard 0.48;人类标注一致性越高,模型得分越高;合成图像略降低模型表现。 Conclusion: 该研究展示了当前VLM在城市感知任务中的局限性,特别是在主观判断方面,并提供了可复现、支持不确定性评估的开源基准,推动公众参与式城市分析的发展。 Abstract: Understanding how people read city scenes can inform design and planning. We introduce a small benchmark for testing vision-language models (VLMs) on urban perception using 100 Montreal street images, evenly split between photographs and photorealistic synthetic scenes. Twelve participants from seven community groups supplied 230 annotation forms across 30 dimensions mixing physical attributes and subjective impressions. French responses were normalized to English. We evaluated seven VLMs in a zero-shot setup with a structured prompt and deterministic parser. We use accuracy for single-choice items and Jaccard overlap for multi-label items; human agreement uses Krippendorff's alpha and pairwise Jaccard. Results suggest stronger model alignment on visible, objective properties than subjective appraisals. The top system (claude-sonnet) reaches macro 0.31 and mean Jaccard 0.48 on multi-label items. Higher human agreement coincides with better model scores. Synthetic images slightly lower scores. We release the benchmark, prompts, and harness for reproducible, uncertainty-aware evaluation in participatory urban analysis.[86] Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression
Xuan Deng,Xiandong Meng,Longguang Wang,Tiange Zhang,Xiaopeng Fan,Debin Zhao
Main category: cs.CV
TL;DR: 提出一种基于特征对齐的运动变换(FMT)框架,用于动态点云压缩,通过隐式建模时间连续性变化提升压缩效率。
Details
Motivation: 现有方法依赖显式运动估计,难以捕捉复杂动态且无法充分挖掘时间相关性,限制了动态点云的压缩性能。 Method: 采用特征对齐的运动变换(FMT),在潜在空间中通过时空对齐策略隐式建模时间变化,并结合随机访问参考策略实现双向运动参考和分层编码。 Result: 相比D-DPCC和AdaDPCC方法,在编码和解码效率上均有提升,BD-Rate分别降低20%和9.4%。 Conclusion: FMT框架能有效提升动态点云压缩的效率与处理性能,支持帧级并行压缩。 Abstract: Dynamic point clouds are widely used in applications such as immersive reality, robotics, and autonomous driving. Efficient compression largely depends on accurate motion estimation and compensation, yet the irregular structure and significant local variations of point clouds make this task highly challenging. Current methods often rely on explicit motion estimation, whose encoded vectors struggle to capture intricate dynamics and fail to fully exploit temporal correlations. To overcome these limitations, we introduce a Feature-aligned Motion Transformation (FMT) framework for dynamic point cloud compression. FMT replaces explicit motion vectors with a spatiotemporal alignment strategy that implicitly models continuous temporal variations, using aligned features as temporal context within a latent-space conditional encoding framework. Furthermore, we design a random access (RA) reference strategy that enables bidirectional motion referencing and layered encoding, thereby supporting frame-level parallel compression. Extensive experiments demonstrate that our method surpasses D-DPCC and AdaDPCC in both encoding and decoding efficiency, while also achieving BD-Rate reductions of 20% and 9.4%, respectively. These results highlight the effectiveness of FMT in jointly improving compression efficiency and processing performance.[87] HybridMamba: A Dual-domain Mamba for 3D Medical Image Segmentation
Weitong Wu,Zhaohu Xing,Jing Gong,Qin Peng,Lei Zhu
Main category: cs.CV
TL;DR: 提出HybridMamba架构,通过双互补机制在3D医学图像分割中实现局部与全局特征的平衡,显著优于现有方法。
Details
Motivation: 现有方法在建模长距离依赖时存在局限性,CNN难以捕捉全局上下文,Transformer计算开销大,且过度关注全局信息可能导致局部结构丢失和边界模糊。 Method: 设计HybridMamba,采用特征扫描策略融合轴向遍历和局部自适应路径,并引入结合空间-频率分析的门控模块进行上下文建模。 Result: 在MRI和CT数据集上实验表明,HybridMamba在3D医学图像分割任务中显著优于当前最先进的方法。 Conclusion: HybridMamba有效平衡了局部与全局特征表示,提升了分割精度,尤其在边界清晰度和区域一致性方面表现突出。 Abstract: In the domain of 3D biomedical image segmentation, Mamba exhibits the superior performance for it addresses the limitations in modeling long-range dependencies inherent to CNNs and mitigates the abundant computational overhead associated with Transformer-based frameworks when processing high-resolution medical volumes. However, attaching undue importance to global context modeling may inadvertently compromise critical local structural information, thus leading to boundary ambiguity and regional distortion in segmentation outputs. Therefore, we propose the HybridMamba, an architecture employing dual complementary mechanisms: 1) a feature scanning strategy that progressively integrates representations both axial-traversal and local-adaptive pathways to harmonize the relationship between local and global representations, and 2) a gated module combining spatial-frequency analysis for comprehensive contextual modeling. Besides, we collect a multi-center CT dataset related to lung cancer. Experiments on MRI and CT datasets demonstrate that HybridMamba significantly outperforms the state-of-the-art methods in 3D medical image segmentation.[88] Enhancing Feature Fusion of U-like Networks with Dynamic Skip Connections
Yue Cao,Quansong He,Kaishen Wang,Jianlong Xiong,Tao He
Main category: cs.CV
TL;DR: 提出一种新型动态跳跃连接(DSC)模块,通过测试时训练和动态多尺度核模块解决传统U型网络中跳跃连接的跨特征与特征内约束问题,提升医学图像分割性能。
Details
Motivation: 传统U-like网络中的跳跃连接存在跨特征静态融合和特征内多尺度建模不足的问题,限制了语义与空间信息的有效整合。 Method: 设计DSC模块,包含测试时训练(TTT)模块以实现推理时内容感知的动态特征优化,以及动态多尺度核(DMSK)模块,根据全局上下文自适应选择卷积核尺寸,增强多尺度特征融合能力。 Result: DSC模块在CNN、Transformer、混合结构及Mamba-based的U-like网络中均表现出显著性能提升,具备良好的通用性和即插即用特性。 Conclusion: DSC模块有效克服了传统跳跃连接的局限性,增强了跨层连接的灵活性与表达能力,可广泛适用于多种架构的医学图像分割模型。 Abstract: U-like networks have become fundamental frameworks in medical image segmentation through skip connections that bridge high-level semantics and low-level spatial details. Despite their success, conventional skip connections exhibit two key limitations: inter-feature constraints and intra-feature constraints. The inter-feature constraint refers to the static nature of feature fusion in traditional skip connections, where information is transmitted along fixed pathways regardless of feature content. The intra-feature constraint arises from the insufficient modeling of multi-scale feature interactions, thereby hindering the effective aggregation of global contextual information. To overcome these limitations, we propose a novel Dynamic Skip Connection (DSC) block that fundamentally enhances cross-layer connectivity through adaptive mechanisms. The DSC block integrates two complementary components. (1) Test-Time Training (TTT) module. This module addresses the inter-feature constraint by enabling dynamic adaptation of hidden representations during inference, facilitating content-aware feature refinement. (2) Dynamic Multi-Scale Kernel (DMSK) module. To mitigate the intra-feature constraint, this module adaptively selects kernel sizes based on global contextual cues, enhancing the network capacity for multi-scale feature integration. The DSC block is architecture-agnostic and can be seamlessly incorporated into existing U-like network structures. Extensive experiments demonstrate the plug-and-play effectiveness of the proposed DSC block across CNN-based, Transformer-based, hybrid CNN-Transformer, and Mamba-based U-like networks.[89] LSTC-MDA: A Unified Framework for Long-Short Term Temporal Convolution and Mixed Data Augmentation in Skeleton-Based Action Recognition
Feng Ding,Haisheng Fu,Soroush Oraki,Jie Liang
Main category: cs.CV
TL;DR: 提出了一种统一框架LSTC-MDA,用于解决基于骨架的动作识别中标注样本稀缺和时序依赖建模困难的问题,通过新型的长短时卷积模块和改进的数据增强方法,在多个数据集上实现了最先进的性能。
Details
Motivation: 解决基于骨架动作识别中标签样本稀缺以及时序建模(短程和长程依赖)困难的问题。 Method: 提出LSTC-MDA框架,包含Long-Short Term Temporal Convolution(LSTC)模块,采用并行的短时和长时分支,并通过学习到的相似性权重自适应融合;同时扩展Joint Mixing Data Augmentation(JMDA),在输入层引入Additive Mixup,并限制同一摄像头视角内的mixup操作以避免分布偏移。 Result: 在NTU 60、NTU 120和NW-UCLA等多个数据集上达到SOTA性能:NTU 60上X-Sub为94.1%,X-View为97.5%;NTU 120上X-Sub为90.4%,X-Set为92.0%;NW-UCLA上为97.2%。消融实验验证了各组件的有效性。 Conclusion: LSTC-MDA通过改进的时序建模和数据增强策略,有效提升了骨架动作识别的性能,尤其在处理长程依赖和数据稀缺问题上表现突出。 Abstract: Skeleton-based action recognition faces two longstanding challenges: the scarcity of labeled training samples and difficulty modeling short- and long-range temporal dependencies. To address these issues, we propose a unified framework, LSTC-MDA, which simultaneously improves temporal modeling and data diversity. We introduce a novel Long-Short Term Temporal Convolution (LSTC) module with parallel short- and long-term branches, these two feature branches are then aligned and fused adaptively using learned similarity weights to preserve critical long-range cues lost by conventional stride-2 temporal convolutions. We also extend Joint Mixing Data Augmentation (JMDA) with an Additive Mixup at the input level, diversifying training samples and restricting mixup operations to the same camera view to avoid distribution shifts. Ablation studies confirm each component contributes. LSTC-MDA achieves state-of-the-art results: 94.1% and 97.5% on NTU 60 (X-Sub and X-View), 90.4% and 92.0% on NTU 120 (X-Sub and X-Set),97.2% on NW-UCLA. Code: https://github.com/xiaobaoxia/LSTC-MDA.[90] MultiEdit: Advancing Instruction-based Image Editing on Diverse and Challenging Tasks
Mingsong Li,Lin Liu,Hongjun Wang,Haoxing Chen,Xijun Gu,Shizhan Liu,Dong Gong,Junbo Zhao,Zhenzhong Lan,Jianguo Li
Main category: cs.CV
TL;DR: 本文提出了MultiEdit,一个包含超过10.7万高质量图像编辑样本的数据集,涵盖6种复杂编辑任务和多种编辑类型,通过两个多模态大模型构建数据管道生成视觉自适应指令和高保真编辑图像,显著提升了基础模型在复杂编辑任务上的表现。
Details
Motivation: 现有指令式图像编辑方法受限于数据集的编辑类型少、样本数量有限以及图像-文本对噪声多的问题,难以应对复杂编辑任务。 Method: 提出MultiEdit数据集,采用两个多模态大语言模型分别生成视觉自适应编辑指令和高保真编辑图像,构建高质量、多样化的数据集,并用于微调开源基础模型。 Result: 在MultiEdit-Test基准上,使用MultiEdit-Train微调的模型在复杂编辑任务中性能显著提升,同时保持了在标准编辑基准上的能力。 Conclusion: MultiEdit为推进多样化和更具挑战性的指令式图像编辑研究提供了有价值的资源。 Abstract: Current instruction-based image editing (IBIE) methods struggle with challenging editing tasks, as both editing types and sample counts of existing datasets are limited. Moreover, traditional dataset construction often contains noisy image-caption pairs, which may introduce biases and limit model capabilities in complex editing scenarios. To address these limitations, we introduce MultiEdit, a comprehensive dataset featuring over 107K high-quality image editing samples. It encompasses 6 challenging editing tasks through a diverse collection of 18 non-style-transfer editing types and 38 style transfer operations, covering a spectrum from sophisticated style transfer to complex semantic operations like person reference editing and in-image text editing. We employ a novel dataset construction pipeline that utilizes two multi-modal large language models (MLLMs) to generate visual-adaptive editing instructions and produce high-fidelity edited images, respectively. Extensive experiments demonstrate that fine-tuning foundational open-source models with our MultiEdit-Train set substantially improves models' performance on sophisticated editing tasks in our proposed MultiEdit-Test benchmark, while effectively preserving their capabilities on the standard editing benchmark. We believe MultiEdit provides a valuable resource for advancing research into more diverse and challenging IBIE capabilities. Our dataset is available at https://huggingface.co/datasets/inclusionAI/MultiEdit.[91] Attention Lattice Adapter: Visual Explanation Generation for Visual Foundation Model
Shinnosuke Hirano,Yuiga Wada,Tsumugi Iida,Komei Sugiura
Main category: cs.CV
TL;DR: 本文提出了一种用于视觉基础模型的新型解释生成方法,通过引入注意力晶格适配器(ALA)和交替周期架构(AEA)机制,在无需手动选择层的同时提升模型可解释性和适应性,并在CUB-200-2011和ImageNet-S数据集上显著优于基线方法。
Details
Motivation: 现有视觉解释方法缺乏对复杂模型的适应性,难以有效生成高质量的可视化解释,且常出现注意力区域过小的问题。 Method: 提出Attention Lattice Adapter(ALA)自动简化层选择过程,结合Alternating Epoch Architect(AEA)每两个周期更新参数以扩大注意力区域,从而增强解释能力和模型可解释性。 Result: 在CUB-200-2011和ImageNet-S上均优于基线方法,评估指标包括平均IoU、插入得分、删除得分和插入-删除得分;在CUB-200-2011上最佳模型平均IoU提升53.2点。 Conclusion: 所提方法具有更强的适应性和解释能力,能有效提升视觉基础模型的可解释性,尤其在细粒度图像识别任务中表现突出。 Abstract: In this study, we consider the problem of generating visual explanations in visual foundation models. Numerous methods have been proposed for this purpose; however, they often cannot be applied to complex models due to their lack of adaptability. To overcome these limitations, we propose a novel explanation generation method in visual foundation models that is aimed at both generating explanations and partially updating model parameters to enhance interpretability. Our approach introduces two novel mechanisms: Attention Lattice Adapter (ALA) and Alternating Epoch Architect (AEA). ALA mechanism simplifies the process by eliminating the need for manual layer selection, thus enhancing the model's adaptability and interpretability. Moreover, the AEA mechanism, which updates ALA's parameters every other epoch, effectively addresses the common issue of overly small attention regions. We evaluated our method on two benchmark datasets, CUB-200-2011 and ImageNet-S. Our results showed that our method outperformed the baseline methods in terms of mean intersection over union (IoU), insertion score, deletion score, and insertion-deletion score on both the CUB-200-2011 and ImageNet-S datasets. Notably, our best model achieved a 53.2-point improvement in mean IoU on the CUB-200-2011 dataset compared with the baselines.[92] DACoN: DINO for Anime Paint Bucket Colorization with Any Number of Reference Images
Kazuma Nagata,Naoshi Kaneko
Main category: cs.CV
TL;DR: 提出DACoN框架,利用基础模型结合CNN进行线稿自动上色,支持多参考图像,提升复杂情况下的上色性能。
Details
Motivation: 现有方法在遮挡、姿态和视角变化下表现不佳,且通常仅支持一到两张参考图,限制了上色的准确性和灵活性。 Method: 融合基础模型的低分辨率语义特征与CNN的高分辨率空间特征,并去除对Multiplex Transformer的依赖,支持任意数量的参考图像。 Result: 在定量和定性评估中均表现出优于先前方法的上色效果,尤其在使用多个参考图像时性能更佳。 Conclusion: DACoN通过结合语义与空间特征并支持多参考图像,在复杂场景下实现了更精细且鲁棒的线稿上色。 Abstract: Automatic colorization of line drawings has been widely studied to reduce the labor cost of hand-drawn anime production. Deep learning approaches, including image/video generation and feature-based correspondence, have improved accuracy but struggle with occlusions, pose variations, and viewpoint changes. To address these challenges, we propose DACoN, a framework that leverages foundation models to capture part-level semantics, even in line drawings. Our method fuses low-resolution semantic features from foundation models with high-resolution spatial features from CNNs for fine-grained yet robust feature extraction. In contrast to previous methods that rely on the Multiplex Transformer and support only one or two reference images, DACoN removes this constraint, allowing any number of references. Quantitative and qualitative evaluations demonstrate the benefits of using multiple reference images, achieving superior colorization performance. Our code and model are available at https://github.com/kzmngt/DACoN.[93] FMGS-Avatar: Mesh-Guided 2D Gaussian Splatting with Foundation Model Priors for 3D Monocular Avatar Reconstruction
Jinlong Fan,Bingyu Hu,Xingguang Li,Yuxiang Yang,Jing Zhang
Main category: cs.CV
TL;DR: 本文提出FMGS-Avatar,一种基于单目视频重建高保真可动画人体avatar的新方法,通过网格引导的2D高斯点阵和基础模型知识蒸馏提升几何细节与外观保真度。
Details
Motivation: 单目视频中几何信息不足且现有3D高斯点阵方法难以保持表面细节,导致高质量人体avatar重建困难。 Method: 提出Mesh-Guided 2D Gaussian Splatting,将2D高斯点绑定到模板网格面,并利用Sapiens等基础模型补充视觉线索,采用选择性梯度隔离的协同训练策略解决多模态蒸馏中的优化冲突。 Result: 实验表明该方法在几何精度、外观保真度和语义丰富性方面优于现有方法,并支持新视角和姿态下的时空一致渲染。 Conclusion: FMGS-Avatar通过增强表示能力和协调的信息蒸馏,显著推进了单目3D人体avatar重建的效果。 Abstract: Reconstructing high-fidelity animatable human avatars from monocular videos remains challenging due to insufficient geometric information in single-view observations. While recent 3D Gaussian Splatting methods have shown promise, they struggle with surface detail preservation due to the free-form nature of 3D Gaussian primitives. To address both the representation limitations and information scarcity, we propose a novel method, \textbf{FMGS-Avatar}, that integrates two key innovations. First, we introduce Mesh-Guided 2D Gaussian Splatting, where 2D Gaussian primitives are attached directly to template mesh faces with constrained position, rotation, and movement, enabling superior surface alignment and geometric detail preservation. Second, we leverage foundation models trained on large-scale datasets, such as Sapiens, to complement the limited visual cues from monocular videos. However, when distilling multi-modal prior knowledge from foundation models, conflicting optimization objectives can emerge as different modalities exhibit distinct parameter sensitivities. We address this through a coordinated training strategy with selective gradient isolation, enabling each loss component to optimize its relevant parameters without interference. Through this combination of enhanced representation and coordinated information distillation, our approach significantly advances 3D monocular human avatar reconstruction. Experimental evaluation demonstrates superior reconstruction quality compared to existing methods, with notable gains in geometric accuracy and appearance fidelity while providing rich semantic information. Additionally, the distilled prior knowledge within a shared canonical space naturally enables spatially and temporally consistent rendering under novel views and poses.[94] Chain-of-Thought Re-ranking for Image Retrieval Tasks
Shangrong Wu,Yanghong Zhou,Yang Chen,Feng Zhang,P. Y. Mok
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型的链式思维重排序方法(CoTRR),通过设计列表式排序提示和查询分解提示,使MLLM直接参与图像检索的重排序过程,实现了全局比较、一致推理和可解释决策,在多个图像检索任务上达到SOTA性能。
Details
Motivation: 现有方法仅将多模态大语言模型(MLLM)用于评估,未充分利用其多模态推理能力,导致图像检索性能受限。 Method: 提出链式思维重排序(CoTRR)方法,设计列表式排序提示和图像评估提示,使MLLM直接参与候选图像的重排序;引入查询分解提示,将原始查询分解为多个语义成分以支持细粒度分析。 Result: 在五个数据集上的实验表明,CoTRR在文本到图像检索(TIR)、组合图像检索(CIR)和基于对话的图像检索(Chat-IR)三个任务上均达到最先进的性能。 Conclusion: CoTRR有效利用了MLLM的多模态推理能力,通过结构化提示实现更准确、可解释的图像检索,推动了MLLM在检索系统中的直接应用。 Abstract: Image retrieval remains a fundamental yet challenging problem in computer vision. While recent advances in Multimodal Large Language Models (MLLMs) have demonstrated strong reasoning capabilities, existing methods typically employ them only for evaluation, without involving them directly in the ranking process. As a result, their rich multimodal reasoning abilities remain underutilized, leading to suboptimal performance. In this paper, we propose a novel Chain-of-Thought Re-Ranking (CoTRR) method to address this issue. Specifically, we design a listwise ranking prompt that enables MLLM to directly participate in re-ranking candidate images. This ranking process is grounded in an image evaluation prompt, which assesses how well each candidate aligns with users query. By allowing MLLM to perform listwise reasoning, our method supports global comparison, consistent reasoning, and interpretable decision-making - all of which are essential for accurate image retrieval. To enable structured and fine-grained analysis, we further introduce a query deconstruction prompt, which breaks down the original query into multiple semantic components. Extensive experiments on five datasets demonstrate the effectiveness of our CoTRR method, which achieves state-of-the-art performance across three image retrieval tasks, including text-to-image retrieval (TIR), composed image retrieval (CIR) and chat-based image retrieval (Chat-IR). Our code is available at https://github.com/freshfish15/CoTRR .[95] Data Augmentation via Latent Diffusion Models for Detecting Smell-Related Objects in Historical Artworks
Ahmed Sheta,Mathias Zinnen,Aline Sindel,Andreas Maier,Vincent Christlein
Main category: cs.CV
TL;DR: 本研究探索了利用合成数据生成来改善历史艺术作品中气味相关物体检测的准确性,特别是在标注稀缺且获取成本高的小众应用中,扩散模型的大规模预训练显示出巨大潜力。
Details
Motivation: 由于历史艺术作品中的风格差异以及气味识别需要极其详细的标注类别,导致标注稀疏和类别极度不平衡,使得气味相关物体的检测具有挑战性。 Method: 采用基于扩散模型的数据增强策略,将合成数据引入模型训练过程,以提升检测性能。 Result: 实验表明,加入合成数据后,模型在气味相关物体检测上的性能得到提升,且该方法在少量数据下依然有效,具备良好的可扩展性。 Conclusion: 利用扩散模型生成合成数据是一种有前景的方法,能够有效缓解标注不足问题,显著提高在稀缺数据场景下的检测精度。 Abstract: Finding smell references in historic artworks is a challenging problem. Beyond artwork-specific challenges such as stylistic variations, their recognition demands exceptionally detailed annotation classes, resulting in annotation sparsity and extreme class imbalance. In this work, we explore the potential of synthetic data generation to alleviate these issues and enable accurate detection of smell-related objects. We evaluate several diffusion-based augmentation strategies and demonstrate that incorporating synthetic data into model training can improve detection performance. Our findings suggest that leveraging the large-scale pretraining of diffusion models offers a promising approach for improving detection accuracy, particularly in niche applications where annotations are scarce and costly to obtain. Furthermore, the proposed approach proves to be effective even with relatively small amounts of data, and scaling it up provides high potential for further enhancements.[96] Frame Sampling Strategies Matter: A Benchmark for small vision language models
Marija Brkic,Anas Filali Razzouki,Yannis Tevissen,Khalil Guetari,Mounim A. El Yacoubi
Main category: cs.CV
TL;DR: 提出首个针对小型视觉语言模型(SVLMs)的帧精确视频问答基准,揭示现有评测中的帧采样偏差,并倡导标准化的帧采样策略。
Details
Motivation: 现有视频基准因使用不同的帧采样策略而存在显著偏差,导致模型性能比较不公。 Method: 构建一个控制帧采样策略的帧精确基准,对最先进的小型视觉语言模型进行评估。 Result: 验证了帧采样偏差的存在,并发现不同帧采样技术下SVLMs表现出数据和任务特定的行为。 Conclusion: 应采用标准化且针对数据集定制的帧采样策略,作者通过开源代码提供了可复现、无偏的评测协议。 Abstract: Comparing vision language models on videos is particularly complex, as the performances is jointly determined by the model's visual representation capacity and the frame-sampling strategy used to construct the input. Current video benchmarks are suspected to suffer from substantial frame-sampling bias, as models are evaluated with different frame selection strategies. In this work, we propose the first frame-accurate benchmark of state-of-the-art small VLMs for video question-answering, evaluated under controlled frame-sampling strategies. Our results confirm the suspected bias and highlight both data-specific and task-specific behaviors of SVLMs under different frame-sampling techniques. By open-sourcing our benchmarking code, we provide the community with a reproducible and unbiased protocol for evaluating video VLMs and emphasize the need for standardized frame-sampling strategies tailored to each benchmarking dataset in future research.[97] A Real-Time Multi-Model Parametric Representation of Point Clouds
Yuan Gao,Wei Dong
Main category: cs.CV
TL;DR: 提出了一种实时多模型参数化表示方法,用于点云的表面检测与拟合,结合高斯混合模型聚类、平面和B样条曲面拟合,并采用2D体素边界描述,在效率和精度上均优于现有方法。
Details
Motivation: 现有方法在点云参数化表示中存在计算成本高或自由度低导致精度不足的问题,难以兼顾实时性与高精度。 Method: 首先使用高斯混合模型对点云进行聚类;然后识别并合并平坦聚类为平面,用2D体素边界法拟合和界定;对有曲率的聚类则采用B样条曲面拟合,同样使用2D体素边界描述。 Result: 在多个公开数据集上验证,表面检测效率比最先进方法提升3.78倍,精度较高斯混合模型提高2倍,可在低功耗机载计算机上以36.4 fps运行。 Conclusion: 该方法实现了高效、高精度的实时点云表面拟合,适用于内存受限和多机器人协作等应用场景。 Abstract: In recent years, parametric representations of point clouds have been widely applied in tasks such as memory-efficient mapping and multi-robot collaboration. Highly adaptive models, like spline surfaces or quadrics, are computationally expensive in detection or fitting. In contrast, real-time methods, such as Gaussian mixture models or planes, have low degrees of freedom, making high accuracy with few primitives difficult. To tackle this problem, a multi-model parametric representation with real-time surface detection and fitting is proposed. Specifically, the Gaussian mixture model is first employed to segment the point cloud into multiple clusters. Then, flat clusters are selected and merged into planes or curved surfaces. Planes can be easily fitted and delimited by a 2D voxel-based boundary description method. Surfaces with curvature are fitted by B-spline surfaces and the same boundary description method is employed. Through evaluations on multiple public datasets, the proposed surface detection exhibits greater robustness than the state-of-the-art approach, with 3.78 times improvement in efficiency. Meanwhile, this representation achieves a 2-fold gain in accuracy over Gaussian mixture models, operating at 36.4 fps on a low-power onboard computer.[98] Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models
Sunwoo Cho,Yejin Jung,Nam Ik Cho,Jae Woong Soh
Main category: cs.CV
TL;DR: 提出了一种无需类别标签或预训练超分辨率模型的数据蒸馏新方法,通过提取高梯度图像块并基于CLIP特征分类,微调扩散模型以生成蒸馏训练图像,在极少量数据下实现了最先进的性能。
Details
Motivation: 现有基于GAN反演的超分辨率数据蒸馏方法依赖预训练模型和类别信息,限制了泛化能力,因此需要一种更通用、高效的方法。 Method: 首先提取高梯度图像块并基于CLIP特征对图像分类,然后在选定图像块上微调扩散模型以学习其分布并合成蒸馏训练图像,从而用于超分辨率模型训练。 Result: 该方法在仅使用原始数据集0.68%的情况下,性能下降仅为0.3 dB,扩散模型微调耗时4小时,SR模型训练1小时完成,显著优于全数据集11小时的训练时间。 Conclusion: 所提方法在减少数据需求和计算时间的同时保持高性能,推动了数据高效的图像超分辨率技术发展。 Abstract: Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources, especially as model complexity advances. Data distillation methods, which aim to improve data efficiency, have emerged as promising solutions to this challenge. In the field of single image super-resolution (SISR), the reliance on large training datasets highlights the importance of these techniques. Recently, a generative adversarial network (GAN) inversion-based data distillation framework for SR was proposed, showing potential for better data utilization. However, the current method depends heavily on pre-trained SR networks and class-specific information, limiting its generalizability and applicability. To address these issues, we introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models. In particular, we first extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images. Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time. Specifically, when we train a baseline Transformer model for SR with only 0.68\% of the original dataset, the performance drop is just 0.3 dB. In this case, diffusion model fine-tuning takes 4 hours, and SR model training completes within 1 hour, much shorter than the 11-hour training time with the full dataset.[99] Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model
Sina Amirrajab,Zohaib Salahuddin,Sheng Kuang,Henry C. Woodruff,Philippe Lambin
Main category: cs.CV
TL;DR: Report2CT 是一种基于完整放射学报告的文本到3D CT图像生成的潜在扩散模型,通过多文本编码器融合临床语义信息,实现了高质量、解剖一致且与文本高度对齐的合成CT生成,在多项指标上达到SOTA。
Details
Motivation: 现有文本到医学图像生成方法多依赖简化的文本提示,忽略了放射学报告中丰富的语义信息,导致生成图像的临床保真度和文本-图像对齐效果较差,尤其在3D CT生成方面仍存在明显局限。 Method: 提出 Report2CT,一种基于自由文本放射学报告(包含发现和结论部分)生成3D胸部CT体积的潜在扩散框架;采用三个预训练医学文本编码器(BiomedVLP CXR BERT、MedEmbed、ClinicalBERT)进行多编码器条件输入,并结合体素间距信息共同引导3D潜在扩散模型训练,使用来自CT-RATE数据集的20000个体积进行训练,并采用无分类器引导提升对齐性能。 Result: Report2CT 生成的CT图像具有良好的视觉质量和解剖一致性;在FID和基于CLIP的语义对齐指标上优于现有方法(如GenerateCT),多编码器设计提升了CLIP得分,表明其更好地保留了临床细节;在MICCAI 2025 VLM3D挑战赛的文本条件CT生成任务中排名第一。 Conclusion: 通过利用完整的放射学报告和多编码器文本条件机制,Report2CT 显著提升了3D CT合成的临床保真度和生成质量,为构建高真实感、语义对齐的医学图像生成系统提供了有效方案。 Abstract: Text to image latent diffusion models have recently advanced medical image synthesis, but applications to 3D CT generation remain limited. Existing approaches rely on simplified prompts, neglecting the rich semantic detail in full radiology reports, which reduces text image alignment and clinical fidelity. We propose Report2CT, a radiology report conditional latent diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports, incorporating both findings and impression sections using multiple text encoder. Report2CT integrates three pretrained medical text encoders (BiomedVLP CXR BERT, MedEmbed, and ClinicalBERT) to capture nuanced clinical context. Radiology reports and voxel spacing information condition a 3D latent diffusion model trained on 20000 CT volumes from the CT RATE dataset. Model performance was evaluated using Frechet Inception Distance (FID) for real synthetic distributional similarity and CLIP based metrics for semantic alignment, with additional qualitative and quantitative comparisons against GenerateCT model. Report2CT generated anatomically consistent CT volumes with excellent visual quality and text image alignment. Multi encoder conditioning improved CLIP scores, indicating stronger preservation of fine grained clinical details in the free text radiology reports. Classifier free guidance further enhanced alignment with only a minor trade off in FID. We ranked first in the VLM3D Challenge at MICCAI 2025 on Text Conditional CT Generation and achieved state of the art performance across all evaluation metrics. By leveraging complete radiology reports and multi encoder text conditioning, Report2CT advances 3D CT synthesis, producing clinically faithful and high quality synthetic data.[100] Fracture interactive geodesic active contours for bone segmentation
Liheng Wang,Licheng Zhang,Hailin Xu,Jingxin Zhao,Xiuyun Su,Jiantao Li,Miutian Tang,Weilu Gao,Chong Chen
Main category: cs.CV
TL;DR: 提出一种针对骨分割的骨折交互式测地线主动轮廓算法,结合强度和梯度范数构建新的边缘检测函数,并引入距离信息作为自适应步长,有效解决边缘阻塞、泄漏及骨折问题,提升分割精度与稳定性。
Details
Motivation: 传统测地线主动轮廓模型在骨分割中因特征提取不加区分,难以应对边缘阻塞、泄漏和骨折等问题,亟需更鲁棒的方法。 Method: 基于骨科知识设计融合强度与梯度范数的边缘检测函数,引导轮廓避开软组织;引入含骨折提示的距离信息作为自适应步长,稳定轮廓演化并使其在骨边缘和骨折处准确停止。 Result: 在骨盆和踝关节分割实验中表现出色,有效解决了边缘问题,对骨折区域分割更准确,整体性能稳定一致。 Conclusion: 所提方法能更好地捕捉骨特征,鲁棒应对骨折和软组织干扰,具有在其他骨结构中广泛应用的潜力,并为融合领域知识与深度神经网络提供思路。 Abstract: For bone segmentation, the classical geodesic active contour model is usually limited by its indiscriminate feature extraction, and then struggles to handle the phenomena of edge obstruction, edge leakage and bone fracture. Thus, we propose a fracture interactive geodesic active contour algorithm tailored for bone segmentation, which can better capture bone features and perform robustly to the presence of bone fractures and soft tissues. Inspired by orthopedic knowledge, we construct a novel edge-detector function that combines the intensity and gradient norm, which guides the contour towards bone edges without being obstructed by other soft tissues and therefore reduces mis-segmentation. Furthermore, distance information, where fracture prompts can be embedded, is introduced into the contour evolution as an adaptive step size to stabilize the evolution and help the contour stop at bone edges and fractures. This embedding provides a way to interact with bone fractures and improves the accuracy in the fracture regions. Experiments in pelvic and ankle segmentation demonstrate the effectiveness on addressing the aforementioned problems and show an accurate, stable and consistent performance, indicating a broader application in other bone anatomies. Our algorithm also provides insights into combining the domain knowledge and deep neural networks.[101] Template-Based Cortical Surface Reconstruction with Minimal Energy Deformation
Patrick Madlindl,Fabian Bongratz,Christian Wachinger
Main category: cs.CV
TL;DR: 提出了一种最小能量变形(MED)损失函数,用于优化基于学习的皮层表面重建中的变形轨迹,提高了训练一致性和可重复性,同时保持了重建精度和拓扑正确性。
Details
Motivation: 确保学习到的形变在变形能量上最优,并在不同训练运行中保持一致性,解决现有方法在此方面的不足。 Method: 设计了一种最小能量变形(MED)损失函数,作为形变轨迹的正则化项,并将其集成到V2C-Flow模型中,结合广泛使用的Chamfer距离进行优化。 Result: 在多个评估指标下,该方法显著提升了训练的一致性和可重复性,同时保持了高精度的表面重建和拓扑正确性。 Conclusion: 所提出的MED损失有效改善了基于学习的皮层表面重建的稳定性与可靠性,为后续神经影像分析提供了更鲁棒的工具。 Abstract: Cortical surface reconstruction (CSR) from magnetic resonance imaging (MRI) is fundamental to neuroimage analysis, enabling morphological studies of the cerebral cortex and functional brain mapping. Recent advances in learning-based CSR have dramatically accelerated processing, allowing for reconstructions through the deformation of anatomical templates within seconds. However, ensuring the learned deformations are optimal in terms of deformation energy and consistent across training runs remains a particular challenge. In this work, we design a Minimal Energy Deformation (MED) loss, acting as a regularizer on the deformation trajectories and complementing the widely used Chamfer distance in CSR. We incorporate it into the recent V2C-Flow model and demonstrate considerable improvements in previously neglected training consistency and reproducibility without harming reconstruction accuracy and topological correctness.[102] ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification
Alvaro Lopez Pellicer,Andre Mariucci,Plamen Angelov,Marwan Bukhari,Jemma G. Kerns
Main category: cs.CV
TL;DR: 提出了一种名为ProtoMedX的多模态模型,结合DEXA扫描和患者记录,用于骨健康分类,具有可解释性设计,并在真实临床数据上实现了优于现有方法的性能。
Details
Motivation: 现有AI方法在骨健康研究中多关注预测准确性,忽视可解释性,且通常仅使用影像数据,缺乏对临床决策支持的透明度。 Method: 提出ProtoMedX,一种基于原型的多模态深度学习模型,融合腰椎DEXA图像和患者病历数据,通过原型机制实现模型决策的内在可解释性。 Result: 在4,160名NHS真实患者数据上,ProtoMedX在纯视觉任务中达到87.58%准确率,多模态版本达89.8%,均优于现有方法,并提供临床医生可直观理解的可视化解释。 Conclusion: ProtoMedX在骨健康分类中实现了高精度与内在可解释性的结合,有助于临床应用中的信任建立,并符合欧盟AI法案对透明AI系统的要求。 Abstract: Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and patient history. The applications of AI in this field are ongoing research. Most successful methods rely on deep learning models that use vision alone (DEXA/X-ray imagery) and focus on prediction accuracy, while explainability is often disregarded and left to post hoc assessments of input contributions. We propose ProtoMedX, a multi-modal model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, which is crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of model decisions, including incorrect ones. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using a dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both surpassing existing published methods.[103] MapAnything: Mapping Urban Assets using Single Street-View Images
Miriam Louise Carnot,Jonas Kunze,Erik Fastermann,Eric Peukert,André Ludwig,Bogdan Franczyk
Main category: cs.CV
TL;DR: 本文提出了一种名为MapAnything的模块,利用单张图像和度量深度估计模型自动确定城市物体的地理坐标,通过与LiDAR点云对比验证了其在交通标志和道路损坏等场景中的有效性。
Details
Motivation: 随着城市数字化的发展,城市管理和维护需要大量及时更新的地理数据,但传统人工采集方式耗时费力,因此需要一种自动化、高效的方法来获取城市物体和事件的地理坐标。 Method: MapAnything模块结合先进的度量深度估计模型,利用物体到相机的距离、几何原理和相机参数,从单张图像中计算物体的地理坐标,并通过与LiDAR点云数据对比评估其在不同距离区间和语义区域(如道路、植被)下的性能。 Result: 实验结果表明,该模块在城市环境中能较准确地估计物体距离和地理坐标,尤其在交通标志和道路损坏等实际应用场景中表现良好,且性能受距离和语义区域影响。 Conclusion: MapAnything为城市物体和事件的自动化地图绘制提供了一种可行方案,减少了对人工数据采集的依赖,有助于提升城市管理的效率和数据更新速度。 Abstract: To maintain an overview of urban conditions, city administrations manage databases of objects like traffic signs and trees, complete with their geocoordinates. Incidents such as graffiti or road damage are also relevant. As digitization increases, so does the need for more data and up-to-date databases, requiring significant manual effort. This paper introduces MapAnything, a module that automatically determines the geocoordinates of objects using individual images. Utilizing advanced Metric Depth Estimation models, MapAnything calculates geocoordinates based on the object's distance from the camera, geometric principles, and camera specifications. We detail and validate the module, providing recommendations for automating urban object and incident mapping. Our evaluation measures the accuracy of estimated distances against LiDAR point clouds in urban environments, analyzing performance across distance intervals and semantic areas like roads and vegetation. The module's effectiveness is demonstrated through practical use cases involving traffic signs and road damage.[104] Not All Degradations Are Equal: A Targeted Feature Denoising Framework for Generalizable Image Super-Resolution
Hongjun Wang,Jiyuan Chen,Zhengwei Yin,Xuan Song,Yinqiang Zheng
Main category: cs.CV
TL;DR: 提出了一种针对噪声过拟合问题的通用图像超分辨率特征去噪框架,包含噪声检测与去噪模块,可无缝集成到现有模型中,在多个基准和数据集上优于之前的正则化方法。
Details
Motivation: 发现现有模型在通用图像超分辨率中主要过拟合于噪声而非其他退化类型,而此前工作假设模型对所有退化类型均过拟合,因此需要针对性地解决噪声过拟合问题以提升泛化能力。 Method: 提出一种目标导向的特征去噪框架,包含噪声检测和去噪模块,该框架无需修改现有超分辨率模型结构即可集成,专注于抑制噪声相关的特征过拟合。 Result: 在五个传统基准和数据集(包括合成与真实场景)上验证了所提框架的有效性,性能优于先前基于正则化的方法。 Conclusion: 所提出的针对性特征去噪框架有效缓解了模型对噪声的过拟合问题,提升了超分辨率模型在未知退化下的泛化能力,且具有良好的通用性和即插即用特性。 Abstract: Generalizable Image Super-Resolution aims to enhance model generalization capabilities under unknown degradations. To achieve this goal, the models are expected to focus only on image content-related features instead of overfitting degradations. Recently, numerous approaches such as Dropout and Feature Alignment have been proposed to suppress models' natural tendency to overfit degradations and yield promising results. Nevertheless, these works have assumed that models overfit to all degradation types (e.g., blur, noise, JPEG), while through careful investigations in this paper, we discover that models predominantly overfit to noise, largely attributable to its distinct degradation pattern compared to other degradation types. In this paper, we propose a targeted feature denoising framework, comprising noise detection and denoising modules. Our approach presents a general solution that can be seamlessly integrated with existing super-resolution models without requiring architectural modifications. Our framework demonstrates superior performance compared to previous regularization-based methods across five traditional benchmarks and datasets, encompassing both synthetic and real-world scenarios.[105] [Re] Improving Interpretation Faithfulness for Vision Transformers
Izabela Kurek,Wojciech Trejter,Stipe Frkovic,Andro Erdelez
Main category: cs.CV
TL;DR: 本研究复现了FViT(Faithful Vision Transformers)的工作,并验证了其在分割和分类任务中通过Diffusion Denoised Smoothing(DDS)提升解释方法鲁棒性的主张,结果基本支持原结论,同时评估了DDS的计算成本与环境影响。
Details
Motivation: 验证FViT中DDS提升解释鲁棒性的有效性,并扩展至其他解释方法,同时评估其计算与环境代价。 Method: 复现FViT及多种Vision Transformer解释方法,测试DDS在攻击和扰动下的鲁棒性,涵盖分割与分类任务,并分析计算资源消耗。 Result: 结果总体支持原论文结论,DDS能提升解释方法的鲁棒性,但在某些情况下存在细微差异;同时发现DDS带来显著的计算开销。 Conclusion: DDS确实有助于提升Vision Transformer解释方法的鲁棒性,但其高昂的计算成本需在实际应用中权衡。 Abstract: This work aims to reproduce the results of Faithful Vision Transformers (FViTs) proposed by arXiv:2311.17983 alongside interpretability methods for Vision Transformers from arXiv:2012.09838 and Xu (2022) et al. We investigate claims made by arXiv:2311.17983, namely that the usage of Diffusion Denoised Smoothing (DDS) improves interpretability robustness to (1) attacks in a segmentation task and (2) perturbation and attacks in a classification task. We also extend the original study by investigating the authors' claims that adding DDS to any interpretability method can improve its robustness under attack. This is tested on baseline methods and the recently proposed Attribution Rollout method. In addition, we measure the computational costs and environmental impact of obtaining an FViT through DDS. Our results broadly agree with the original study's findings, although minor discrepancies were found and discussed.[106] MARIC: Multi-Agent Reasoning for Image Classification
Wonduk Seo,Minhyeong Yu,Hyunjin An,Seunghyun Lee
Main category: cs.CV
TL;DR: 本文提出了一种基于多智能体的图像分类框架MARIC,通过将图像分类重构为协作推理过程,克服了传统模型和现有视觉语言模型的局限性。
Details
Motivation: 传统图像分类依赖大规模标注数据和精细调参,而现有视觉语言模型受限于单通路表征,难以捕捉图像的多方面信息。因此需要一种更高效、更具解释性的方法。 Method: MARIC框架包含四个智能体:一个提纲智能体分析图像全局主题并生成提示,三个不同视角的方面智能体提取细粒度描述,最后由推理智能体通过整合反思步骤合成最终分类结果。 Result: 在4个不同的图像分类基准数据集上的实验表明,MARIC显著优于基线方法。 Conclusion: 多智能体协同与反思合成机制能有效提升图像分类的性能与可解释性,验证了多智能体视觉推理的潜力。 Abstract: Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.[107] Controllable Localized Face Anonymization Via Diffusion Inpainting
Ali Salar,Qing Liu,Guoying Zhao
Main category: cs.CV
TL;DR: 提出一种基于潜在扩散模型的统一框架,通过自适应属性引导模块实现对人脸图像的可控匿名化,同时保持图像在下游视觉任务中的可用性。
Details
Motivation: 随着人像图像在计算机视觉中的广泛应用,保护个人身份变得尤为重要,同时需确保匿名化后的图像仍可用于后续任务。 Method: 利用潜在扩散模型的修复能力,在反向去噪过程中引入自适应属性引导模块,通过梯度校正使生成图像的面部属性与目标图像对齐,并支持局部区域的匿名化。 Result: 在CelebA-HQ和FFHQ数据集上的实验表明,该方法优于现有最先进方法,且无需额外训练模型。 Conclusion: 所提出的框架实现了对人脸匿名化过程的完全控制,在保护隐私的同时保持了图像的实用性,具有良好的应用前景。 Abstract: The growing use of portrait images in computer vision highlights the need to protect personal identities. At the same time, anonymized images must remain useful for downstream computer vision tasks. In this work, we propose a unified framework that leverages the inpainting ability of latent diffusion models to generate realistic anonymized images. Unlike prior approaches, we have complete control over the anonymization process by designing an adaptive attribute-guidance module that applies gradient correction during the reverse denoising process, aligning the facial attributes of the generated image with those of the synthesized target image. Our framework also supports localized anonymization, allowing users to specify which facial regions are left unchanged. Extensive experiments conducted on the public CelebA-HQ and FFHQ datasets show that our method outperforms state-of-the-art approaches while requiring no additional model training. The source code is available on our page.[108] Temporal Representation Learning of Phenotype Trajectories for pCR Prediction in Breast Cancer
Ivana Janíčková,Yen Y. Tan,Thomas H. Helbich,Konstantin Miloserdov,Zsuzsanna Bago-Horvath,Ulrike Heber,Georg Langs
Main category: cs.CV
TL;DR: 提出一种基于MRI数据的早期治疗反应动态表征方法,用于预测乳腺癌患者新辅助化疗的病理完全缓解(pCR)。
Details
Motivation: 准确预测个体对治疗的反应具有挑战性,因为疾病进展和治疗反应在不同患者间差异显著。需要有效模型支持治疗决策。 Method: 利用纵向MRI数据构建潜在空间中的治疗反应轨迹,采用多任务模型捕捉图像表观特征、时间连续性,并处理非响应者群体的高度异质性,进而预测pCR。 Result: 在ISPY-2数据集上实验显示,仅使用治疗前数据(T0)时平衡准确率为0.761,加入早期反应数据(T0+T1)后提升至0.811,使用四个时间点(T0→T3)时达到0.861。 Conclusion: 该方法通过建模MRI数据的潜在轨迹,能有效预测乳腺癌患者对NACT的个体化治疗反应,且随着更多时间点数据的引入,预测性能逐步提高。 Abstract: Effective therapy decisions require models that predict the individual response to treatment. This is challenging since the progression of disease and response to treatment vary substantially across patients. Here, we propose to learn a representation of the early dynamics of treatment response from imaging data to predict pathological complete response (pCR) in breast cancer patients undergoing neoadjuvant chemotherapy (NACT). The longitudinal change in magnetic resonance imaging (MRI) data of the breast forms trajectories in the latent space, serving as basis for prediction of successful response. The multi-task model represents appearance, fosters temporal continuity and accounts for the comparably high heterogeneity in the non-responder cohort.In experiments on the publicly available ISPY-2 dataset, a linear classifier in the latent trajectory space achieves a balanced accuracy of 0.761 using only pre-treatment data (T0), 0.811 using early response (T0 + T1), and 0.861 using four imaging time points (T0 -> T3). The code will be made available upon paper acceptance.[109] NeRF-based Visualization of 3D Cues Supporting Data-Driven Spacecraft Pose Estimation
Antoine Legrand,Renaud Detry,Christophe De Vleeschouwer
Main category: cs.CV
TL;DR: 本文提出了一种可视化方法,用于揭示基于数据驱动的航天器位姿估计网络所依赖的3D视觉线索,通过训练基于NeRF的图像生成器并利用位姿估计网络反向传播的梯度,有效恢复了关键的3D特征。
Details
Motivation: 现有的数据驱动航天器位姿估计方法因缺乏对其决策过程的理解而难以在实际任务中应用。 Method: 利用位姿估计网络反向传播的梯度来训练一个基于NeRF的图像生成器,使其渲染出位姿估计网络所依赖的主要3D特征。 Result: 实验证明该方法能够成功恢复与位姿估计相关的关键3D视觉线索,并揭示了监督信号与网络对目标航天器隐式表征之间的关系。 Conclusion: 该可视化方法有助于理解数据驱动位姿估计模型的决策机制,提升了其可解释性,有望促进其在轨应用。 Abstract: On-orbit operations require the estimation of the relative 6D pose, i.e., position and orientation, between a chaser spacecraft and its target. While data-driven spacecraft pose estimation methods have been developed, their adoption in real missions is hampered by the lack of understanding of their decision process. This paper presents a method to visualize the 3D visual cues on which a given pose estimator relies. For this purpose, we train a NeRF-based image generator using the gradients back-propagated through the pose estimation network. This enforces the generator to render the main 3D features exploited by the spacecraft pose estimation network. Experiments demonstrate that our method recovers the relevant 3D cues. Furthermore, they offer additional insights on the relationship between the pose estimation network supervision and its implicit representation of the target spacecraft.[110] Pseudo-Label Enhanced Cascaded Framework: 2nd Technical Report for LSVOS 2025 VOS Track
An Yan,Leilei Cao,Feng Lu,Ran Hong,Youhai Jiang,Fengjie Zhu
Main category: cs.CV
TL;DR: 本文提出了一种基于SAM2框架的复杂视频对象分割方法,通过伪标签训练和级联多模型推理,在MOSE测试集上取得了86.16的J&F分数,位列LSVOS 2025 VOS赛道第二名。
Details
Motivation: 针对复杂视频中目标小、相似、遮挡频繁、运动快速和交互复杂导致分割困难的问题,需要提升现有方法在长时复杂场景下的准确性和鲁棒性。 Method: 采用伪标签策略进行训练:利用SAM2Long框架生成MOSE测试集的伪标签,并与现有数据结合进行再训练;在推理阶段,使用SAM2Long和开源SeC模型并行预测,通过级联决策机制融合两者输出,结合SAM2Long的时间稳定性和SeC的概念级鲁棒性。 Result: 在MOSE测试集上达到0.8616的J&F分数,比SAM2Long基线高出1.4个百分点。 Conclusion: 所提方法通过伪标签训练和级联多模型融合策略,显著提升了复杂长视频序列中的分割性能,具有良好的鲁棒性和准确性。 Abstract: Complex Video Object Segmentation (VOS) presents significant challenges in accurately segmenting objects across frames, especially in the presence of small and similar targets, frequent occlusions, rapid motion, and complex interactions. In this report, we present our solution for the LSVOS 2025 VOS Track based on the SAM2 framework. We adopt a pseudo-labeling strategy during training: a trained SAM2 checkpoint is deployed within the SAM2Long framework to generate pseudo labels for the MOSE test set, which are then combined with existing data for further training. For inference, the SAM2Long framework is employed to obtain our primary segmentation results, while an open-source SeC model runs in parallel to produce complementary predictions. A cascaded decision mechanism dynamically integrates outputs from both models, exploiting the temporal stability of SAM2Long and the concept-level robustness of SeC. Benefiting from pseudo-label training and cascaded multi-model inference, our approach achieves a J\&F score of 0.8616 on the MOSE test set -- +1.4 points over our SAM2Long baseline -- securing the 2nd place in the LSVOS 2025 VOS Track, and demonstrating strong robustness and accuracy in long, complex video segmentation scenarios.[111] Trade-offs in Cross-Domain Generalization of Foundation Model Fine-Tuned for Biometric Applications
Tahar Chettaoui,Naser Damer,Fadi Boutros
Main category: cs.CV
TL;DR: 本研究探讨了CLIP等基础模型在细调用于人脸识别(FR)、人脸变形攻击检测(MAD)和呈现攻击检测(PAD)等特定生物识别任务时的过专业化问题,发现尽管性能在特定任务上提升,但跨域泛化能力显著下降,尤其是复杂任务如FR;同时指出任务复杂度和分类头设计与灾难性遗忘程度相关,而较大模型架构能更好保持原有泛化能力。
Details
Motivation: 基础模型在细调后可能失去跨域泛化能力,本文旨在量化这一过专业化现象及其对不同生物识别任务的影响。 Method: 对三种针对FR、MAD和PAD细调的CLIP模型进行系统评估,使用14个通用视觉数据集在零样本和线性探测协议下测试,并结合标准FR、MAD、PAD基准进行分析。 Result: 细调模型在特定任务上表现提升,如FRoundation (ViT-L) 在IJB-C上比之前方法最高提升58.52%,但在ImageNetV2等通用数据集上性能大幅下降(从69.84%降至51.63%);任务越复杂(如多类FR vs 二类MAD/PAD),灾难性遗忘越严重;大模型比小模型更能保持通用性。 Conclusion: 细调基础模型会导致过专业化和通用性损失,任务复杂度和模型容量是影响该现象的关键因素,较大的模型有助于缓解此问题。 Abstract: Foundation models such as CLIP have demonstrated exceptional zero- and few-shot transfer capabilities across diverse vision tasks. However, when fine-tuned for highly specialized biometric tasks, face recognition (FR), morphing attack detection (MAD), and presentation attack detection (PAD), these models may suffer from over-specialization. Thus, they may lose one of their foundational strengths, cross-domain generalization. In this work, we systematically quantify these trade-offs by evaluating three instances of CLIP fine-tuned for FR, MAD, and PAD. We evaluate each adapted model as well as the original CLIP baseline on 14 general vision datasets under zero-shot and linear-probe protocols, alongside common FR, MAD, and PAD benchmarks. Our results indicate that fine-tuned models suffer from over-specialization, especially when fine-tuned for complex tasks of FR. Also, our results pointed out that task complexity and classification head design, multi-class (FR) vs. binary (MAD and PAD), correlate with the degree of catastrophic forgetting. The FRoundation model with the ViT-L backbone outperforms other approaches on the large-scale FR benchmark IJB-C, achieving an improvement of up to 58.52%. However, it experiences a substantial performance drop on ImageNetV2, reaching only 51.63% compared to 69.84% achieved by the baseline CLIP model. Moreover, the larger CLIP architecture consistently preserves more of the model's original generalization ability than the smaller variant, indicating that increased model capacity may help mitigate over-specialization.[112] GenKOL: Modular Generative AI Framework For Scalable Virtual KOL Generation
Tan-Hiep To,Duy-Khang Nguyen,Tam V. Nguyen,Minh-Triet Tran,Trung-Nghia Le
Main category: cs.CV
TL;DR: 本文提出GenKOL,一个基于生成式AI的交互系统,用于高效生成高质量虚拟关键意见领袖(KOL)图像,以降低营销成本并加速内容生产。
Details
Motivation: 传统与人类KOL合作存在高成本和后勤挑战,亟需一种更灵活、经济的替代方案来满足现代营销需求。 Method: 开发了一个模块化、可扩展的交互系统GenKOL,集成服装生成、妆容迁移、背景合成和头发编辑等多种AI能力,支持本地或云端部署。 Result: GenKOL能够通过直观界面动态生成促销视觉内容,显著简化品牌内容制作流程,提升营销效率。 Conclusion: GenKOL为市场营销提供了一种低成本、高适应性的虚拟KOL生成解决方案,具有广泛的应用前景。 Abstract: Key Opinion Leader (KOL) play a crucial role in modern marketing by shaping consumer perceptions and enhancing brand credibility. However, collaborating with human KOLs often involves high costs and logistical challenges. To address this, we present GenKOL, an interactive system that empowers marketing professionals to efficiently generate high-quality virtual KOL images using generative AI. GenKOL enables users to dynamically compose promotional visuals through an intuitive interface that integrates multiple AI capabilities, including garment generation, makeup transfer, background synthesis, and hair editing. These capabilities are implemented as modular, interchangeable services that can be deployed flexibly on local machines or in the cloud. This modular architecture ensures adaptability across diverse use cases and computational environments. Our system can significantly streamline the production of branded content, lowering costs and accelerating marketing workflows through scalable virtual KOL creation.[113] DF-LLaVA: Unlocking MLLM's potential for Synthetic Image Detection via Prompt-Guided Knowledge Injection
Zhuokang Shen,Kaisen Zhang,Bohan Jia,Yuan Fang,Zhou Yu,Shaohui Lin
Main category: cs.CV
TL;DR: 提出DF-LLaVA框架,通过提取并注入MLLM的潜在知识,实现超越专家模型的检测精度,同时保持良好的可解释性,用于合成图像检测。
Details
Motivation: 现有检测模型在图像真实性分类上缺乏可解释性,而基于MLLM的方法虽可解释但分类准确率不足,因此需要一种兼具高精度与可解释性的方法。 Method: 提出DF-LLaVA框架,首先从多模态大语言模型(MLLM)中提取潜在知识,并通过提示词(prompt)方式注入训练过程,以提升检测性能。 Result: DF-LLaVA在多个实验中表现出色,检测精度超过现有专家模型,同时保留了MLLM提供的自然语言解释能力,实现了高精度与高可解释性的平衡。 Conclusion: DF-LLaVA有效结合了MLLM的可解释性与高性能检测能力,为合成图像检测提供了一种兼具准确性与透明度的解决方案。 Abstract: With the increasing prevalence of synthetic images, evaluating image authenticity and locating forgeries accurately while maintaining human interpretability remains a challenging task. Existing detection models primarily focus on simple authenticity classification, ultimately providing only a forgery probability or binary judgment, which offers limited explanatory insights into image authenticity. Moreover, while MLLM-based detection methods can provide more interpretable results, they still lag behind expert models in terms of pure authenticity classification accuracy. To address this, we propose DF-LLaVA, a simple yet effective framework that unlocks the intrinsic discrimination potential of MLLMs. Our approach first extracts latent knowledge from MLLMs and then injects it into training via prompts. This framework allows LLaVA to achieve outstanding detection accuracy exceeding expert models while still maintaining the interpretability offered by MLLMs. Extensive experiments confirm the superiority of our DF-LLaVA, achieving both high accuracy and explainability in synthetic image detection. Code is available online at: https://github.com/Eliot-Shen/DF-LLaVA.[114] Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification
Xiang Tuo,Xu Xuemiao,Liu Bangzhen,Li Jinyi,Li Yong,He Shengfeng
Main category: cs.CV
TL;DR: 提出了一种名为跨模态几何校正(CMGR)的框架,通过利用CLIP的分层空间语义来增强3D几何保真度,有效解决了3D类别增量学习中的数据稀缺、几何错位和纹理偏差问题。
Details
Motivation: 现有3D类别增量学习方法在极端数据稀缺下表现不佳,主要受几何错位和纹理偏差影响,且当前融合2D基础模型的方法存在语义模糊和决策原型不稳定的问题。 Method: 提出了CMGR框架,包括结构感知的几何校正模块(通过注意力驱动的几何融合对齐3D部件结构与CLIP中间空间先验)和纹理增强模块(合成最小但判别性强的纹理以抑制噪声),并引入基类-新类判别器来稳定增量原型。 Result: 实验表明,该方法在跨域和域内设置下显著提升了3D少样本类别增量学习性能,具有更强的几何一致性及对纹理偏差的鲁棒性。 Conclusion: CMGR通过利用2D基础模型的层次化空间语义实现3D几何结构的精细化校正,为开放世界中低数据条件下的3D识别提供了有效解决方案。 Abstract: The rapid growth of 3D digital content necessitates expandable recognition systems for open-world scenarios. However, existing 3D class-incremental learning methods struggle under extreme data scarcity due to geometric misalignment and texture bias. While recent approaches integrate 3D data with 2D foundation models (e.g., CLIP), they suffer from semantic blurring caused by texture-biased projections and indiscriminate fusion of geometric-textural cues, leading to unstable decision prototypes and catastrophic forgetting. To address these issues, we propose Cross-Modal Geometric Rectification (CMGR), a framework that enhances 3D geometric fidelity by leveraging CLIP's hierarchical spatial semantics. Specifically, we introduce a Structure-Aware Geometric Rectification module that hierarchically aligns 3D part structures with CLIP's intermediate spatial priors through attention-driven geometric fusion. Additionally, a Texture Amplification Module synthesizes minimal yet discriminative textures to suppress noise and reinforce cross-modal consistency. To further stabilize incremental prototypes, we employ a Base-Novel Discriminator that isolates geometric variations. Extensive experiments demonstrate that our method significantly improves 3D few-shot class-incremental learning, achieving superior geometric coherence and robustness to texture bias across cross-domain and within-domain settings.[115] Brain-HGCN: A Hyperbolic Graph Convolutional Network for Brain Functional Network Analysis
Junhao Jia,Yunyou Liu,Cheng Yang,Yifei Sun,Feiwei Qin,Changmiao Wang,Yong Peng
Main category: cs.CV
TL;DR: 提出了一种基于双曲几何的脑功能网络分析框架Brain-HGCN,利用洛伦兹模型和符号聚合机制,在fMRI数据上实现了对精神疾病更准确的分类。
Details
Motivation: 标准欧几里得图神经网络难以无失真地建模大脑功能网络的层次结构,限制了其在临床中的表现。 Method: 采用洛伦兹模型下的双曲图注意力层,引入符号聚合机制区分兴奋性和抑制性连接,并使用几何合理的弗雷歇均值进行图读出。 Result: 在两个大规模fMRI数据集上的实验表明,该方法显著优于多种先进的欧几里得基线模型。 Conclusion: Brain-HGCN为fMRI分析提供了新的几何深度学习范式,展示了双曲图神经网络在计算精神病学中的巨大潜力。 Abstract: Functional magnetic resonance imaging (fMRI) provides a powerful non-invasive window into the brain's functional organization by generating complex functional networks, typically modeled as graphs. These brain networks exhibit a hierarchical topology that is crucial for cognitive processing. However, due to inherent spatial constraints, standard Euclidean GNNs struggle to represent these hierarchical structures without high distortion, limiting their clinical performance. To address this limitation, we propose Brain-HGCN, a geometric deep learning framework based on hyperbolic geometry, which leverages the intrinsic property of negatively curved space to model the brain's network hierarchy with high fidelity. Grounded in the Lorentz model, our model employs a novel hyperbolic graph attention layer with a signed aggregation mechanism to distinctly process excitatory and inhibitory connections, ultimately learning robust graph-level representations via a geometrically sound Fr\'echet mean for graph readout. Experiments on two large-scale fMRI datasets for psychiatric disorder classification demonstrate that our approach significantly outperforms a wide range of state-of-the-art Euclidean baselines. This work pioneers a new geometric deep learning paradigm for fMRI analysis, highlighting the immense potential of hyperbolic GNNs in the field of computational psychiatry.[116] RoboEye: Enhancing 2D Robotic Object Identification with Selective 3D Geometric Keypoint Matching
Xingwu Zhang,Guanxuan Li,Zhuocheng Zhang,Zijun Long
Main category: cs.CV
TL;DR: 本文提出RoboEye,一种两阶段物体识别框架,通过结合2D语义特征与领域自适应的3D推理,提升在复杂仓储环境下的商品识别准确率。
Details
Motivation: 随着电商商品类别的快速增长,传统仅依赖2D外观特征的方法在面对类内差异大、长尾分布、遮挡和视角变化等挑战时性能显著下降。 Method: 第一阶段使用大视觉模型提取2D特征生成候选排序;第二阶段引入轻量级3D特征感知模块判断是否需要3D重排序,并利用3D特征提取器和基于关键点的匹配器进行精确匹配。 Result: 实验表明,RoboEye相比先前最优方法RoboLLM将Recall@1提升了7.1%,且仅使用RGB图像,无需显式3D输入。 Conclusion: RoboEye有效缓解了训练与部署间的差距,在不增加硬件成本的前提下显著提升了大规模仓储场景下的物体识别性能。 Abstract: The rapidly growing number of product categories in large-scale e-commerce makes accurate object identification for automated packing in warehouses substantially more difficult. As the catalog grows, intra-class variability and a long tail of rare or visually similar items increase, and when combined with diverse packaging, cluttered containers, frequent occlusion, and large viewpoint changes-these factors amplify discrepancies between query and reference images, causing sharp performance drops for methods that rely solely on 2D appearance features. Thus, we propose RoboEye, a two-stage identification framework that dynamically augments 2D semantic features with domain-adapted 3D reasoning and lightweight adapters to bridge training deployment gaps. In the first stage, we train a large vision model to extract 2D features for generating candidate rankings. A lightweight 3D-feature-awareness module then estimates 3D feature quality and predicts whether 3D re-ranking is necessary, preventing performance degradation and avoiding unnecessary computation. When invoked, the second stage uses our robot 3D retrieval transformer, comprising a 3D feature extractor that produces geometry-aware dense features and a keypoint-based matcher that computes keypoint-correspondence confidences between query and reference images instead of conventional cosine-similarity scoring. Experiments show that RoboEye improves Recall@1 by 7.1% over the prior state of the art (RoboLLM). Moreover, RoboEye operates using only RGB images, avoiding reliance on explicit 3D inputs and reducing deployment costs. The code used in this paper is publicly available at: https://github.com/longkukuhi/RoboEye.[117] Beyond Random Masking: A Dual-Stream Approach for Rotation-Invariant Point Cloud Masked Autoencoders
Xuanhua Yin,Dingxin Zhang,Yu Feng,Shunqi Mao,Jianhui Yu,Weidong Cai
Main category: cs.CV
TL;DR: 提出了一种双流掩码方法,结合3D空间网格掩码和渐进式语义掩码,提升旋转不变点云MAE的性能。
Details
Motivation: 现有旋转不变点云MAE依赖随机掩码策略,忽略几何结构和语义一致性,无法捕捉跨方向的空间关系和保持语义部分的完整性。 Method: 提出双流掩码方法:1)3D空间网格掩码通过坐标排序构建结构化模式以捕获跨方向的几何关系;2)渐进式语义掩码利用注意力驱动聚类发现语义部分并保持其一致性;通过课程学习与动态加权协调两流。 Result: 在ModelNet40、ScanObjectNN和OmniObject3D上实验表明,所提方法在多种旋转场景下均优于基线方法,性能显著提升且兼容现有框架。 Conclusion: 该双流掩码策略有效增强了旋转不变点云MAE对几何结构和语义信息的建模能力,具有即插即用特性,适用于多种现有框架。 Abstract: Existing rotation-invariant point cloud masked autoencoders (MAE) rely on random masking strategies that overlook geometric structure and semantic coherence. Random masking treats patches independently, failing to capture spatial relationships consistent across orientations and overlooking semantic object parts that maintain identity regardless of rotation. We propose a dual-stream masking approach combining 3D Spatial Grid Masking and Progressive Semantic Masking to address these fundamental limitations. Grid masking creates structured patterns through coordinate sorting to capture geometric relationships that persist across different orientations, while semantic masking uses attention-driven clustering to discover semantically meaningful parts and maintain their coherence during masking. These complementary streams are orchestrated via curriculum learning with dynamic weighting, progressing from geometric understanding to semantic discovery. Designed as plug-and-play components, our strategies integrate into existing rotation-invariant frameworks without architectural changes, ensuring broad compatibility across different approaches. Comprehensive experiments on ModelNet40, ScanObjectNN, and OmniObject3D demonstrate consistent improvements across various rotation scenarios, showing substantial performance gains over the baseline rotation-invariant methods.[118] EchoVLM: Dynamic Mixture-of-Experts Vision-Language Model for Universal Ultrasound Intelligence
Chaoyin She,Ruifang Lu,Lida Chen,Wei Wang,Qinghua Huang
Main category: cs.CV
TL;DR: 本文提出了一种专用于超声医学成像的视觉-语言模型EchoVLM,采用Mixture of Experts架构,支持多器官、多任务诊断,在报告生成等任务中显著优于现有模型。
Details
Motivation: 传统超声诊断依赖医生经验,存在主观性强、效率低的问题;现有通用视觉-语言模型在超声任务中知识有限、泛化能力差。 Method: 提出EchoVLM模型,采用Mixture of Experts(MoE)架构,基于涵盖七个解剖区域的数据进行训练,支持超声报告生成、诊断和视觉问答(VQA)等多任务。 Result: 在超声报告生成任务中,相比Qwen2-VL,EchoVLM的BLEU-1分数提升10.15分,ROUGE-1分数提升4.77分。 Conclusion: EchoVLM在超声图像诊断中展现出显著的性能提升,具备增强诊断准确性的潜力,可为未来临床应用提供可行的技术方案。 Abstract: Ultrasound imaging has become the preferred imaging modality for early cancer screening due to its advantages of non-ionizing radiation, low cost, and real-time imaging capabilities. However, conventional ultrasound diagnosis heavily relies on physician expertise, presenting challenges of high subjectivity and low diagnostic efficiency. Vision-language models (VLMs) offer promising solutions for this issue, but existing general-purpose models demonstrate limited knowledge in ultrasound medical tasks, with poor generalization in multi-organ lesion recognition and low efficiency across multi-task diagnostics. To address these limitations, we propose EchoVLM, a vision-language model specifically designed for ultrasound medical imaging. The model employs a Mixture of Experts (MoE) architecture trained on data spanning seven anatomical regions. This design enables the model to perform multiple tasks, including ultrasound report generation, diagnosis and visual question-answering (VQA). The experimental results demonstrated that EchoVLM achieved significant improvements of 10.15 and 4.77 points in BLEU-1 scores and ROUGE-1 scores respectively compared to Qwen2-VL on the ultrasound report generation task. These findings suggest that EchoVLM has substantial potential to enhance diagnostic accuracy in ultrasound imaging, thereby providing a viable technical solution for future clinical applications. Source code and model weights are available at https://github.com/Asunatan/EchoVLM.[119] SPATIALGEN: Layout-guided 3D Indoor Scene Generation
Chuan Fang,Heng Li,Yixun Liang,Jia Zheng,Yongsen Mao,Yuan Liu,Rui Tang,Zihan Zhou,Ping Tan
Main category: cs.CV
TL;DR: 本文提出了一种新的多视角多模态扩散模型SpatialGen,用于生成逼真且语义一致的3D室内场景,并构建了一个包含12,328个标注场景的大规模合成数据集以支持该任务。
Details
Motivation: 现有的生成式AI在室内场景合成中难以平衡视觉质量、多样性、语义一致性和用户控制,且缺乏高质量的大规模专用数据集。 Method: 构建了一个包含12,328个结构化标注场景和470万张2D渲染图像的合成数据集,并提出SpatialGen模型,基于3D布局和参考图像(来自文本提示)生成多视角下的外观、几何和语义信息。 Result: 实验表明,SpatialGen在生成结果上优于先前方法,能够跨模态保持空间一致性,并实现高保真的3D室内场景合成。 Conclusion: SpatialGen结合大规模合成数据集,在3D室内场景生成中实现了更高质量和语义一致性,作者已开源数据和模型以推动该领域发展。 Abstract: Creating high-fidelity 3D models of indoor environments is essential for applications in design, virtual reality, and robotics. However, manual 3D modeling remains time-consuming and labor-intensive. While recent advances in generative AI have enabled automated scene synthesis, existing methods often face challenges in balancing visual quality, diversity, semantic consistency, and user control. A major bottleneck is the lack of a large-scale, high-quality dataset tailored to this task. To address this gap, we introduce a comprehensive synthetic dataset, featuring 12,328 structured annotated scenes with 57,440 rooms, and 4.7M photorealistic 2D renderings. Leveraging this dataset, we present SpatialGen, a novel multi-view multi-modal diffusion model that generates realistic and semantically consistent 3D indoor scenes. Given a 3D layout and a reference image (derived from a text prompt), our model synthesizes appearance (color image), geometry (scene coordinate map), and semantic (semantic segmentation map) from arbitrary viewpoints, while preserving spatial consistency across modalities. SpatialGen consistently generates superior results to previous methods in our experiments. We are open-sourcing our data and models to empower the community and advance the field of indoor scene understanding and generation.[120] PRISM: Product Retrieval In Shopping Carts using Hybrid Matching
Arda Kabadayi,Senem Velipasalar,Jiajing Chen
Main category: cs.CV
TL;DR: 本文提出了一种用于零售场景商品检索的新型混合方法PRISM,结合了视觉-语言模型和像素级匹配的优势,在保持实时处理能力的同时显著提升了检索精度。
Details
Motivation: 传统商品检索方法在处理品牌间外观相似但细节不同的商品时存在困难,现有基础模型难以捕捉细微差异,而像素级匹配方法计算开销大、速度慢。 Method: PRISM分为三个阶段:首先使用SigLIP模型从图库中快速筛选出前35个语义最相似的商品;然后利用YOLO-E分割模型去除背景干扰;最后在候选集中使用LightGlue进行细粒度的像素级匹配。 Result: 在ABV数据集上的实验表明,PRISM在top-1准确率上比现有最先进方法提高了4.21%,同时满足实时处理要求。 Conclusion: PRISM通过融合全局语义检索与局部精细匹配,在效率与准确性之间取得了良好平衡,适用于实际零售环境中的商品检索任务。 Abstract: Compared to traditional image retrieval tasks, product retrieval in retail settings is even more challenging. Products of the same type from different brands may have highly similar visual appearances, and the query image may be taken from an angle that differs significantly from view angles of the stored catalog images. Foundational models, such as CLIP and SigLIP, often struggle to distinguish these subtle but important local differences. Pixel-wise matching methods, on the other hand, are computationally expensive and incur prohibitively high matching times. In this paper, we propose a new, hybrid method, called PRISM, for product retrieval in retail settings by leveraging the advantages of both vision-language model-based and pixel-wise matching approaches. To provide both efficiency/speed and finegrained retrieval accuracy, PRISM consists of three stages: 1) A vision-language model (SigLIP) is employed first to retrieve the top 35 most semantically similar products from a fixed gallery, thereby narrowing the search space significantly; 2) a segmentation model (YOLO-E) is applied to eliminate background clutter; 3) fine-grained pixel-level matching is performed using LightGlue across the filtered candidates. This framework enables more accurate discrimination between products with high inter-class similarity by focusing on subtle visual cues often missed by global models. Experiments performed on the ABV dataset show that our proposed PRISM outperforms the state-of-the-art image retrieval methods by 4.21% in top-1 accuracy while still remaining within the bounds of real-time processing for practical retail deployments.[121] UCorr: Wire Detection and Depth Estimation for Autonomous Drones
Benedikt Kolbeinsson,Krystian Mikolajczyk
Main category: cs.CV
TL;DR: 提出一种基于单目视觉的端到端模型,结合时序相关性层和合成数据,实现电线分割与深度估计,提升自主无人机避障能力。
Details
Motivation: 电线因细长特性难以检测,对全自主无人机的安全导航构成挑战,需提高其在复杂环境中的障碍物感知能力。 Method: 设计一个单目端到端网络,引入时序相关性层,并在合成数据上进行训练,以同时完成电线分割与深度估计任务。 Result: 实验表明,该方法在电线检测与深度估计联合任务上优于现有竞争方法,具有更高的准确性和鲁棒性。 Conclusion: 所提模型能有效提升无人机对细小障碍物的感知能力,增强飞行安全性,具备实际应用潜力。 Abstract: In the realm of fully autonomous drones, the accurate detection of obstacles is paramount to ensure safe navigation and prevent collisions. Among these challenges, the detection of wires stands out due to their slender profile, which poses a unique and intricate problem. To address this issue, we present an innovative solution in the form of a monocular end-to-end model for wire segmentation and depth estimation. Our approach leverages a temporal correlation layer trained on synthetic data, providing the model with the ability to effectively tackle the complex joint task of wire detection and depth estimation. We demonstrate the superiority of our proposed method over existing competitive approaches in the joint task of wire detection and depth estimation. Our results underscore the potential of our model to enhance the safety and precision of autonomous drones, shedding light on its promising applications in real-world scenarios.[122] Sea-ing Through Scattered Rays: Revisiting the Image Formation Model for Realistic Underwater Image Generation
Vasiliki Ismiroglou,Malte Pedersen,Stefan H. Bengtson,Andreas Aakerberg,Thomas B. Moeslund
Main category: cs.CV
TL;DR: 提出了一种改进的合成水下数据生成管道,包含常被忽略的前向散射项,并考虑非均匀介质,通过在受控浑浊条件下收集的BUCKET数据集验证了方法的有效性。
Details
Motivation: 现有水下图像生成模型多关注变色问题,忽略了高浑浊环境中距离依赖的可见性损失建模能力。 Method: 引入前向散射项并考虑非均匀介质,构建改进的合成数据生成管道,并采集真实浑浊环境下的BUCKET数据集进行验证。 Result: 在浑浊度增加的情况下,相比参考模型表现出定性改进,用户调查中选择率达到82.5%。 Conclusion: 所提方法能更准确地模拟高浑浊水下场景,显著提升合成数据质量,有助于相关算法的训练与评估。 Abstract: In recent years, the underwater image formation model has found extensive use in the generation of synthetic underwater data. Although many approaches focus on scenes primarily affected by discoloration, they often overlook the model's ability to capture the complex, distance-dependent visibility loss present in highly turbid environments. In this work, we propose an improved synthetic data generation pipeline that includes the commonly omitted forward scattering term, while also considering a nonuniform medium. Additionally, we collected the BUCKET dataset under controlled turbidity conditions to acquire real turbid footage with the corresponding reference images. Our results demonstrate qualitative improvements over the reference model, particularly under increasing turbidity, with a selection rate of 82. 5\% by survey participants. Data and code can be accessed on the project page: vap.aau.dk/sea-ing-through-scattered-rays.[123] No Modality Left Behind: Adapting to Missing Modalities via Knowledge Distillation for Brain Tumor Segmentation
Shenghao Zhu,Yifei Chen,Weihong Chen,Shuo Jiang,Guanyu Zhou,Yuanhan Wang,Feiwei Qin,Changmiao Wang,Qiyuan Tian
Main category: cs.CV
TL;DR: 提出AdaMM,一种基于知识蒸馏的多模态脑肿瘤分割框架,有效应对缺失模态问题,在单模态和弱模态下表现优异。
Details
Motivation: 临床中常出现模态缺失,现有依赖完整输入的深度学习方法鲁棒性和泛化性受限,尤其在非主导模态组合下表现不佳。 Method: 提出AdaMM框架,包含三个模块:图引导自适应优化模块建模通用与模态特异性特征关联;双向瓶颈蒸馏模块通过全局风格匹配和对抗特征对齐进行知识迁移;病灶存在引导可靠性模块通过辅助分类抑制假阳性。 Result: 在BraTS 2018和2024数据集上实验表明,AdaMM在多种缺失模态配置下均优于现有方法,尤其在单模态和弱模态下具有更高分割精度和鲁棒性,并系统评估了六类缺失模态策略。 Conclusion: AdaMM通过知识蒸馏和协同模块设计,显著提升了多模态缺失下的脑肿瘤分割性能,为实际临床应用提供了可靠解决方案。 Abstract: Accurate brain tumor segmentation is essential for preoperative evaluation and personalized treatment. Multi-modal MRI is widely used due to its ability to capture complementary tumor features across different sequences. However, in clinical practice, missing modalities are common, limiting the robustness and generalizability of existing deep learning methods that rely on complete inputs, especially under non-dominant modality combinations. To address this, we propose AdaMM, a multi-modal brain tumor segmentation framework tailored for missing-modality scenarios, centered on knowledge distillation and composed of three synergistic modules. The Graph-guided Adaptive Refinement Module explicitly models semantic associations between generalizable and modality-specific features, enhancing adaptability to modality absence. The Bi-Bottleneck Distillation Module transfers structural and textural knowledge from teacher to student models via global style matching and adversarial feature alignment. The Lesion-Presence-Guided Reliability Module predicts prior probabilities of lesion types through an auxiliary classification task, effectively suppressing false positives under incomplete inputs. Extensive experiments on the BraTS 2018 and 2024 datasets demonstrate that AdaMM consistently outperforms existing methods, exhibiting superior segmentation accuracy and robustness, particularly in single-modality and weak-modality configurations. In addition, we conduct a systematic evaluation of six categories of missing-modality strategies, confirming the superiority of knowledge distillation and offering practical guidance for method selection and future research. Our source code is available at https://github.com/Quanato607/AdaMM.[124] AutoEdit: Automatic Hyperparameter Tuning for Image Editing
Chau Pham,Quan Dao,Mahesh Bhosale,Yunjie Tian,Dimitris Metaxas,David Doermann
Main category: cs.CV
TL;DR: 提出了一种基于强化学习的扩散模型图像编辑超参数优化框架,通过将超参数搜索建模为序列决策任务,显著降低了计算开销和搜索时间。
Details
Motivation: 现有文本引导图像编辑方法需要手动调参多个相互依赖的超参数,导致计算成本高且效率低。 Method: 将超参数优化视为扩散去噪过程中的序列决策问题,构建马尔可夫决策过程,利用近端策略优化(PPO)动态调整每一步的超参数,并将编辑目标融入奖励函数。 Result: 实验表明该方法相比暴力调参显著减少了搜索时间和计算开销,同时保持了良好的编辑性能。 Conclusion: 所提强化学习框架有效解决了扩散模型图像编辑中超参数调优的效率问题,推动了该技术在现实场景中的实际部署。 Abstract: Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc.} This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing's hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.[125] Synthetic-to-Real Object Detection using YOLOv11 and Domain Randomization Strategies
Luisa Torquato Niño,Hamza A. A. Gardi
Main category: cs.CV
TL;DR: 该论文研究了在仅使用合成数据和域随机化策略的情况下,训练YOLOv11模型检测特定物体(汤罐)时的合成到真实域差距问题。
Details
Motivation: 解决合成数据训练的目标检测模型在真实场景中性能不佳的问题,探索不依赖真实标注数据的有效训练方法。 Method: 通过数据增强、数据集构成优化和模型缩放进行大量实验,采用多样化的合成数据(不同视角和复杂背景)并结合精细调整的数据增强策略,使用YOLOv11模型进行训练,并在真实世界测试集上进行定量与定性评估。 Result: 最佳配置的YOLOv11l模型在扩展且多样化的数据集上训练后,在竞赛隐藏测试集上达到了0.910的mAP@50分数,尽管合成验证指标无法准确预测实际表现。 Conclusion: 仅使用合成数据结合域随机化和充分的数据多样性可以有效缩小合成到真实的域差距,但真实世界中的复杂变异性仍带来挑战。 Abstract: This paper addresses the synthetic-to-real domain gap in object detection, focusing on training a YOLOv11 model to detect a specific object (a soup can) using only synthetic data and domain randomization strategies. The methodology involves extensive experimentation with data augmentation, dataset composition, and model scaling. While synthetic validation metrics were consistently high, they proved to be poor predictors of real-world performance. Consequently, models were also evaluated qualitatively, through visual inspection of predictions, and quantitatively, on a manually labeled real-world test set, to guide development. Final mAP@50 scores were provided by the official Kaggle competition. Key findings indicate that increasing synthetic dataset diversity, specifically by including varied perspectives and complex backgrounds, combined with carefully tuned data augmentation, were crucial in bridging the domain gap. The best performing configuration, a YOLOv11l model trained on an expanded and diverse dataset, achieved a final mAP@50 of 0.910 on the competition's hidden test set. This result demonstrates the potential of a synthetic-only training approach while also highlighting the remaining challenges in fully capturing real-world variability.[126] Transplant-Ready? Evaluating AI Lung Segmentation Models in Candidates with Severe Lung Disease
Jisoo Lee,Michael R. Harowicz,Yuwen Chen,Hanxue Gu,Isaac S. Alderete,Lin Li,Maciej A. Mazurowski,Matthew G. Hartwig
Main category: cs.CV
TL;DR: 该研究评估了三种深度学习肺部分割模型在适合移植患者中的表现,发现Unet-R231整体性能最优,但所有模型在中重度病例中性能显著下降,提示严重病理情况下需专门微调模型。
Details
Motivation: 评估现有深度学习肺部分割模型在不同疾病严重程度、病理类型和肺侧的表现,识别其在肺移植术前规划应用中的局限性。 Method: 使用32名患者的3645张胸部CT轴向切片,采用Unet-R231、TotalSegmentator和MedSAM三种模型进行肺部分割,通过体积相似性、Dice相似系数、Hausdorff距离和四点临床可接受度评分评估性能。 Result: Unet-R231在各类指标上均优于TotalSegmentator和MedSAM(p<0.05),所有模型从中度到重度病例的性能显著下降(尤其在体积相似性方面,p<0.05),但左右肺或病理类型间无显著差异。 Conclusion: Unet-R231是当前最准确的自动肺部分割模型,TotalSegmentator次之,但在中重度病理情况下性能明显下降,需针对严重病例进行模型优化以提升临床适用性。 Abstract: This study evaluates publicly available deep-learning based lung segmentation models in transplant-eligible patients to determine their performance across disease severity levels, pathology categories, and lung sides, and to identify limitations impacting their use in preoperative planning in lung transplantation. This retrospective study included 32 patients who underwent chest CT scans at Duke University Health System between 2017 and 2019 (total of 3,645 2D axial slices). Patients with standard axial CT scans were selected based on the presence of two or more lung pathologies of varying severity. Lung segmentation was performed using three previously developed deep learning models: Unet-R231, TotalSegmentator, MedSAM. Performance was assessed using quantitative metrics (volumetric similarity, Dice similarity coefficient, Hausdorff distance) and a qualitative measure (four-point clinical acceptability scale). Unet-R231 consistently outperformed TotalSegmentator and MedSAM in general, for different severity levels, and pathology categories (p<0.05). All models showed significant performance declines from mild to moderate-to-severe cases, particularly in volumetric similarity (p<0.05), without significant differences among lung sides or pathology types. Unet-R231 provided the most accurate automated lung segmentation among evaluated models with TotalSegmentator being a close second, though their performance declined significantly in moderate-to-severe cases, emphasizing the need for specialized model fine-tuning in severe pathology contexts.[127] OmniSegmentor: A Flexible Multi-Modal Learning Framework for Semantic Segmentation
Bo-Wen Yin,Jiao-Long Cao,Xuying Zhang,Yuming Chen,Ming-Ming Cheng,Qibin Hou
Main category: cs.CV
TL;DR: 本文提出了OmniSegmentor,一种通用的多模态预训练框架,通过构建大规模多模态数据集ImageNeXt并实现灵活的预训练方式,在多种语义分割任务中取得领先性能。
Details
Motivation: 现有的多模态语义分割缺乏一个灵活且通用的预训练-微调框架,难以有效利用多种视觉模态的组合。 Method: 提出OmniSegmentor框架,构建包含五种视觉模态的大规模数据集ImageNeXt,并设计高效的预训练方法,使模型能适应任意模态组合。 Result: 在NYU Depthv2、EventScape、MFNet、DeLiVER、SUNRGBD和KITTI-360等多个多模态语义分割数据集上达到最先进的性能。 Conclusion: OmniSegmentor首次实现了通用的多模态预训练,显著提升了模型在各种场景下的感知能力,具有广泛的适用性和优越性。 Abstract: Recent research on representation learning has proved the merits of multi-modal clues for robust semantic segmentation. Nevertheless, a flexible pretrain-and-finetune pipeline for multiple visual modalities remains unexplored. In this paper, we propose a novel multi-modal learning framework, termed OmniSegmentor. It has two key innovations: 1) Based on ImageNet, we assemble a large-scale dataset for multi-modal pretraining, called ImageNeXt, which contains five popular visual modalities. 2) We provide an efficient pretraining manner to endow the model with the capacity to encode different modality information in the ImageNeXt. For the first time, we introduce a universal multi-modal pretraining framework that consistently amplifies the model's perceptual capabilities across various scenarios, regardless of the arbitrary combination of the involved modalities. Remarkably, our OmniSegmentor achieves new state-of-the-art records on a wide range of multi-modal semantic segmentation datasets, including NYU Depthv2, EventScape, MFNet, DeLiVER, SUNRGBD, and KITTI-360.[128] RGB-Only Supervised Camera Parameter Optimization in Dynamic Scenes
Fang Li,Hao Zhang,Narendra Ahuja
Main category: cs.CV
TL;DR: 本文提出了一种仅通过单个RGB视频进行动态场景下相机参数优化的新方法,无需依赖真实运动掩码或其他先验信息。
Details
Motivation: COLMAP在静态场景中占主导地位,但在动态场景中受限于长运行时间和对真实运动掩码的依赖,且多数改进方法需要难以获取的额外监督信号。 Method: 该方法包含三个关键组件:(1) 块状跟踪滤波器,建立RGB视频中的稀疏铰链关系;(2) 异常感知联合优化,自适应降低运动异常值权重;(3) 两阶段优化策略,平衡损失函数中的Softplus限制与凸极小值以提升稳定性与速度。 Result: 在4个真实世界数据集和1个合成数据集上验证了该方法能更高效、准确地估计相机参数,并通过4D重建进一步评估了结果质量。 Conclusion: 所提方法在仅使用单个RGB视频作为监督的情况下,在动态场景中实现了比现有方法更准确且高效的相机参数优化。 Abstract: Although COLMAP has long remained the predominant method for camera parameter optimization in static scenes, it is constrained by its lengthy runtime and reliance on ground truth (GT) motion masks for application to dynamic scenes. Many efforts attempted to improve it by incorporating more priors as supervision such as GT focal length, motion masks, 3D point clouds, camera poses, and metric depth, which, however, are typically unavailable in casually captured RGB videos. In this paper, we propose a novel method for more accurate and efficient camera parameter optimization in dynamic scenes solely supervised by a single RGB video. Our method consists of three key components: (1) Patch-wise Tracking Filters, to establish robust and maximally sparse hinge-like relations across the RGB video. (2) Outlier-aware Joint Optimization, for efficient camera parameter optimization by adaptive down-weighting of moving outliers, without reliance on motion priors. (3) A Two-stage Optimization Strategy, to enhance stability and optimization speed by a trade-off between the Softplus limits and convex minima in losses. We visually and numerically evaluate our camera estimates. To further validate accuracy, we feed the camera estimates into a 4D reconstruction method and assess the resulting 3D scenes, and rendered 2D RGB and depth maps. We perform experiments on 4 real-world datasets (NeRF-DS, DAVIS, iPhone, and TUM-dynamics) and 1 synthetic dataset (MPI-Sintel), demonstrating that our method estimates camera parameters more efficiently and accurately with a single RGB video as the only supervision.[129] MedFact-R1: Towards Factual Medical Reasoning via Pseudo-Label Augmentation
Gengliang Li,Rongyu Chen,Bin Li,Linlin Yang,Guodong Ding
Main category: cs.CV
TL;DR: 提出MEDFACT-R1框架,结合外部知识和强化学习,显著提升医学视觉语言模型的事实准确性。
Details
Motivation: 确保医学视觉语言模型在生成文本时具备事实一致性和可靠推理能力,是当前面临的关键挑战。 Method: 采用两阶段框架:第一阶段通过伪标签监督微调(SFT)引入外部医学知识;第二阶段使用分组相对策略优化(GRPO),结合四种定制的事实性奖励信号,促进自洽推理。 Result: 在三个公开的医学问答基准上,相比先前最优方法,事实准确性最高提升了22.5%。消融实验验证了伪标签SFT冷启动和各GRPO奖励信号的必要性。 Conclusion: 知识 grounding 与强化学习驱动的推理相结合,能有效提升医学AI系统的可信度和事实一致性。 Abstract: Ensuring factual consistency and reliable reasoning remains a critical challenge for medical vision-language models. We introduce MEDFACT-R1, a two-stage framework that integrates external knowledge grounding with reinforcement learning to improve the factual medical reasoning. The first stage uses pseudo-label supervised fine-tuning (SFT) to incorporate external factual expertise; while the second stage applies Group Relative Policy Optimization (GRPO) with four tailored factual reward signals to encourage self-consistent reasoning. Across three public medical QA benchmarks, MEDFACT-R1 delivers up to 22.5% absolute improvement in factual accuracy over previous state-of-the-art methods. Ablation studies highlight the necessity of pseudo-label SFT cold start and validate the contribution of each GRPO reward, underscoring the synergy between knowledge grounding and RL-driven reasoning for trustworthy medical AI. Codes are released at https://github.com/Garfieldgengliang/MEDFACT-R1.[130] Leveraging Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models
Haobo Yang,Minghao Guo,Dequan Yang,Wenyu Wang
Main category: cs.CV
TL;DR: 该论文提出将人类感知中的几何视觉错觉引入图像分类模型的训练中,通过合成的错觉数据集和多任务学习策略,发现这种感知启发的归纳偏置能提升模型在复杂轮廓和纹理上的泛化能力。
Details
Motivation: 现有深度学习模型主要依赖数据中的统计规律,缺乏来自感知心理学的结构化先验知识,本文旨在探索将经典视觉错觉作为归纳偏置以增强模型感知能力的可能性。 Method: 构建了一个参数化的合成几何错觉数据集,并采用三种多源学习策略,将错觉识别任务与ImageNet分类任务结合进行联合训练。 Result: 实验表明,引入几何错觉作为辅助监督信号可系统性提升模型泛化性能,尤其在处理复杂轮廓和精细纹理时表现更优;且该方法对CNN和Transformer架构均有效。 Conclusion: 将感知科学(如视觉错觉)融入机器学习训练框架是可行且有效的,为未来视觉模型设计中嵌入感知先验提供了新方向。 Abstract: Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied phenomena from human perception into standard image-classification training pipelines. Specifically, we introduce a synthetic, parametric geometric-illusion dataset and evaluate three multi-source learning strategies that combine illusion recognition tasks with ImageNet classification objectives. Our experiments reveal two key conceptual insights: (i) incorporating geometric illusions as auxiliary supervision systematically improves generalization, especially in visually challenging cases involving intricate contours and fine textures; and (ii) perceptually driven inductive biases, even when derived from synthetic stimuli traditionally considered unrelated to natural image recognition, can enhance the structural sensitivity of both CNN and transformer-based architectures. These results demonstrate a novel integration of perceptual science and machine learning and suggest new directions for embedding perceptual priors into vision model design.[131] AIP: Subverting Retrieval-Augmented Generation via Adversarial Instructional Prompt
Saket S. Chaturvedi,Gaurav Bagwe,Lan Zhang,Xiaoyong Yuan
Main category: cs.CV
TL;DR: 本文提出了一种针对检索增强生成(RAG)系统的新型攻击方法——对抗性指令提示(AIP),通过操纵被广泛共享但缺乏审计的指令提示来隐蔽地影响RAG输出,实验证明该攻击在保持自然性和功能性的前提下可达到高达95.23%的攻击成功率。
Details
Motivation: 现有对RAG系统的攻击主要依赖于篡改用户查询,但在实际中用户输入往往是固定或受保护的,因此不现实;而广泛复用且未受审查的指令提示成为一个更隐蔽、可行的攻击向量,值得研究其潜在安全风险。 Method: 提出一种基于对抗性指令提示(AIP)的攻击方法,将攻击面转移到指令提示上;设计多样化的查询生成策略以模拟真实用户语言变化,并采用基于遗传算法的联合优化方法,在攻击成功率、功能保持和隐蔽性之间进行权衡,生成高效的对抗性提示。 Result: 实验结果显示,所提出的AIP攻击在多种设置下可实现高达95.23%的攻击成功率,同时保持对正常任务的功能性,且生成的对抗性提示具有良好的泛化能力和自然性,难以被用户察觉。 Conclusion: 研究表明,RAG系统中被广泛共享的指令提示存在严重安全漏洞,可被用于隐蔽操纵系统行为,因此需重新评估和加强对此类提示的安全审查与管理机制。 Abstract: Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources to improve factual accuracy and verifiability. However, this reliance introduces new attack surfaces within the retrieval pipeline, beyond the LLM itself. While prior RAG attacks have exposed such vulnerabilities, they largely rely on manipulating user queries, which is often infeasible in practice due to fixed or protected user inputs. This narrow focus overlooks a more realistic and stealthy vector: instructional prompts, which are widely reused, publicly shared, and rarely audited. Their implicit trust makes them a compelling target for adversaries to manipulate RAG behavior covertly. We introduce a novel attack for Adversarial Instructional Prompt (AIP) that exploits adversarial instructional prompts to manipulate RAG outputs by subtly altering retrieval behavior. By shifting the attack surface to the instructional prompts, AIP reveals how trusted yet seemingly benign interface components can be weaponized to degrade system integrity. The attack is crafted to achieve three goals: (1) naturalness, to evade user detection; (2) utility, to encourage use of prompts; and (3) robustness, to remain effective across diverse query variations. We propose a diverse query generation strategy that simulates realistic linguistic variation in user queries, enabling the discovery of prompts that generalize across paraphrases and rephrasings. Building on this, a genetic algorithm-based joint optimization is developed to evolve adversarial prompts by balancing attack success, clean-task utility, and stealthiness. Experimental results show that AIP achieves up to 95.23% ASR while preserving benign functionality. These findings uncover a critical and previously overlooked vulnerability in RAG systems, emphasizing the need to reassess the shared instructional prompts.[132] Semi-Supervised 3D Medical Segmentation from 2D Natural Images Pretrained Model
Pak-Hei Yeung,Jayroop Ramesh,Pengfei Lyu,Ana Namburete,Jagath Rajapakse
Main category: cs.CV
TL;DR: 本文提出了一种模型无关的框架M&N,通过从2D预训练模型逐步蒸馏知识,提升半监督3D医学图像分割性能。该方法采用迭代协同训练和学习率引导采样策略,有效利用少量标注数据和大量未标注数据,在多个公开数据集上实现了最先进的性能。
Details
Motivation: 在3D医学图像分割中,标注数据稀缺,而2D自然图像上的预训练模型蕴含丰富的视觉知识,如何有效迁移这些知识以提升3D医学图像分割性能是本文的研究动机。 Method: 提出M&N框架,采用模型无关的设计,通过迭代协同训练2D预训练模型和从零开始训练的3D分割模型,互相生成伪掩码进行知识蒸馏,并引入学习率引导采样策略,动态调整训练批次中标注与未标注数据的比例,以减少不准确伪标签的影响。 Result: 在多个公开3D医学图像数据集上实验表明,M&N在不同设置下均优于13种现有半监督分割方法,达到最先进水平;消融实验证明其模型无关性,可适配多种网络架构。 Conclusion: M&N是一种有效的知识迁移框架,能够充分利用2D预训练模型的知识提升3D医学图像半监督分割性能,具有良好的通用性和应用前景。 Abstract: This paper explores the transfer of knowledge from general vision models pretrained on 2D natural images to improve 3D medical image segmentation. We focus on the semi-supervised setting, where only a few labeled 3D medical images are available, along with a large set of unlabeled images. To tackle this, we propose a model-agnostic framework that progressively distills knowledge from a 2D pretrained model to a 3D segmentation model trained from scratch. Our approach, M&N, involves iterative co-training of the two models using pseudo-masks generated by each other, along with our proposed learning rate guided sampling that adaptively adjusts the proportion of labeled and unlabeled data in each training batch to align with the models' prediction accuracy and stability, minimizing the adverse effect caused by inaccurate pseudo-masks. Extensive experiments on multiple publicly available datasets demonstrate that M&N achieves state-of-the-art performance, outperforming thirteen existing semi-supervised segmentation approaches under all different settings. Importantly, ablation studies show that M&N remains model-agnostic, allowing seamless integration with different architectures. This ensures its adaptability as more advanced models emerge. The code is available at https://github.com/pakheiyeung/M-N.[133] A Race Bias Free Face Aging Model for Reliable Kinship Verification
Ali Nazari,Bardiya Kariminia,Mohsen Ebrahimi Moghaddam
Main category: cs.CV
TL;DR: 本文提出了一种新的无种族偏见的面部老化GAN模型RA-GAN,用于解决亲子验证中的年龄差距问题,实验表明其在多个数据集和年龄组中显著提升了验证准确率。
Details
Motivation: 由于亲子照片间存在年龄差异且同龄照片常不可得,同时现有面部老化模型存在种族偏见,影响亲缘关系验证效果,因此需要构建一个无种族偏见的面部老化模型。 Method: 提出RA-GAN模型,包含RACEpSp模块和特征混合器,生成无种族偏见的同龄面部图像,并将其应用于KinFaceW-I和KinFaceW-II数据集上的亲缘关系验证。 Result: RA-GAN在所有年龄段平均比SAM-GAN提升13.14%的种族准确性,在60+年龄组比CUSP-GAN提升9.1%;在身份保持方面也优于对比模型;在KinFaceW-I和II上,同龄转换后亲子验证准确率均有提升,部分关系最高提升5.22%。 Conclusion: RA-GAN能有效减少面部老化过程中的种族偏见,生成更公平的同龄图像,显著提升跨年龄亲缘关系验证性能。 Abstract: The age gap in kinship verification addresses the time difference between the photos of the parent and the child. Moreover, their same-age photos are often unavailable, and face aging models are racially biased, which impacts the likeness of photos. Therefore, we propose a face aging GAN model, RA-GAN, consisting of two new modules, RACEpSp and a feature mixer, to produce racially unbiased images. The unbiased synthesized photos are used in kinship verification to investigate the results of verifying same-age parent-child images. The experiments demonstrate that our RA-GAN outperforms SAM-GAN on an average of 13.14\% across all age groups, and CUSP-GAN in the 60+ age group by 9.1\% in terms of racial accuracy. Moreover, RA-GAN can preserve subjects' identities better than SAM-GAN and CUSP-GAN across all age groups. Additionally, we demonstrate that transforming parent and child images from the KinFaceW-I and KinFaceW-II datasets to the same age can enhance the verification accuracy across all age groups. The accuracy increases with our RA-GAN for the kinship relationships of father-son and father-daughter, mother-son, and mother-daughter, which are 5.22, 5.12, 1.63, and 0.41, respectively, on KinFaceW-I. Additionally, the accuracy for the relationships of father-daughter, father-son, and mother-son is 2.9, 0.39, and 1.6 on KinFaceW-II, respectively. The code is available at~\href{https://github.com/bardiya2254kariminia/An-Age-Transformation-whitout-racial-bias-for-Kinship-verification}{Github}[134] Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Zaiquan Yang,Yuhao Liu,Gerhard Hancke,Rynson W. H. Lau
Main category: cs.CV
TL;DR: 本文提出了一种基于多模态大语言模型(MLLMs)的零样本时空视频定位(STVG)框架,通过解耦查询和时序增强策略提升定位性能。
Details
Motivation: 现有MLLMs在STVG任务中难以充分整合文本查询中的属性和动作线索,导致定位效果不佳,因此需要提升其推理能力。 Method: 提出分解式时空高亮(DSTH)和时序增强组装(TAS)策略:DSTH将查询分解为属性和动作子查询,并通过logit引导的重注意力(LRA)模块生成空间与时间提示;TAS利用原始与增强帧提升时序一致性。 Result: 在多个MLLM上验证了方法的有效性,在三个主流STVG基准上优于当前最优方法。 Conclusion: 所提框架有效释放了MLLM在零样本STVG中的潜力,通过显式建模属性与动作线索并增强时序一致性,显著提升了定位准确率。 Abstract: Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.[135] Maize Seedling Detection Dataset (MSDD): A Curated High-Resolution RGB Dataset for Seedling Maize Detection and Benchmarking with YOLOv9, YOLO11, YOLOv12 and Faster-RCNN
Dewi Endah Kharismawati,Toni Kazic
Main category: cs.CV
TL;DR: 本文介绍了MSDD,一个用于玉米幼苗检测的高质量航拍图像数据集,旨在提升精准农业中的出苗率统计精度,支持早期作物监测、产量预测和田间管理。
Details
Motivation: 由于现有标注数据集稀缺,且传统人工统计方法费时易错,亟需一种基于计算机视觉的高效、准确的玉米幼苗检测方案。 Method: 构建包含单株、双株和三株幼苗三类标签的MSDD数据集,涵盖多种生长阶段、种植条件、土壤类型、光照、相机角度和密度;采用YOLO系列模型进行基准测试,评估不同模型在不同条件下的检测性能。 Result: 实验表明V4-V6生长阶段和正射视角下检测效果最佳;YOLO11推理速度最快(35ms/图),YOLOv9对单株检测精度最高(precision 0.984,recall 0.873);多株检测因样本稀少且形态不规则而准确率较低,且存在类别不平衡问题。 Conclusion: MSDD为玉米出苗率统计提供了可靠的数据基础,推动了农业自动化监测的发展,有助于优化资源分配和实现实时决策,是迈向精准农业的重要一步。 Abstract: Accurate maize seedling detection is crucial for precision agriculture, yet curated datasets remain scarce. We introduce MSDD, a high-quality aerial image dataset for maize seedling stand counting, with applications in early-season crop monitoring, yield prediction, and in-field management. Stand counting determines how many plants germinated, guiding timely decisions such as replanting or adjusting inputs. Traditional methods are labor-intensive and error-prone, while computer vision enables efficient, accurate detection. MSDD contains three classes-single, double, and triple plants-capturing diverse growth stages, planting setups, soil types, lighting conditions, camera angles, and densities, ensuring robustness for real-world use. Benchmarking shows detection is most reliable during V4-V6 stages and under nadir views. Among tested models, YOLO11 is fastest, while YOLOv9 yields the highest accuracy for single plants. Single plant detection achieves precision up to 0.984 and recall up to 0.873, but detecting doubles and triples remains difficult due to rarity and irregular appearance, often from planting errors. Class imbalance further reduces accuracy in multi-plant detection. Despite these challenges, YOLO11 maintains efficient inference at 35 ms per image, with an additional 120 ms for saving outputs. MSDD establishes a strong foundation for developing models that enhance stand counting, optimize resource allocation, and support real-time decision-making. This dataset marks a step toward automating agricultural monitoring and advancing precision agriculture.[136] Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation
Xiaoyu Yue,Zidong Wang,Yuqing Wang,Wenlong Zhang,Xihui Liu,Wanli Ouyang,Lei Bai,Luping Zhou
Main category: cs.CV
TL;DR: 本研究首次系统探讨了将下一个词预测范式应用于视觉领域的机制,提出了自指导训练框架(ST-AR),显著提升了自回归模型的图像理解与生成质量。
Details
Motivation: 自回归模型在图像理解方面存在局限性,难以学习高层视觉语义,需要改进训练机制以提升其表现。 Method: 通过引入自监督目标设计了一种新的训练框架ST-AR,解决了局部和条件依赖、步间语义不一致和空间不变性缺失三个关键问题。 Result: ST-AR在不依赖预训练表示模型的情况下,使LlamaGen-L和LlamaGen-XL的FID分别提升了约42%和49%,生成质量显著提高。 Conclusion: ST-AR有效增强了自回归模型的视觉语义学习能力,为图像生成与理解提供了新的训练范式。 Abstract: Recent studies have demonstrated the importance of high-quality visual representations in image generation and have highlighted the limitations of generative models in image understanding. As a generative paradigm originally designed for natural language, autoregressive models face similar challenges. In this work, we present the first systematic investigation into the mechanisms of applying the next-token prediction paradigm to the visual domain. We identify three key properties that hinder the learning of high-level visual semantics: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency. We show that these issues can be effectively addressed by introducing self-supervised objectives during training, leading to a novel training framework, Self-guided Training for AutoRegressive models (ST-AR). Without relying on pre-trained representation models, ST-AR significantly enhances the image understanding ability of autoregressive models and leads to improved generation quality. Specifically, ST-AR brings approximately 42% FID improvement for LlamaGen-L and 49% FID improvement for LlamaGen-XL, while maintaining the same sampling strategy.[137] Geometric Image Synchronization with Deep Watermarking
Pierre Fernandez,Tomáš Souček,Nikola Jovanović,Hady Elsahar,Sylvestre-Alvise Rebuffi,Valeriu Lacatusu,Tuan Tran,Alexandre Mourachko
Main category: cs.CV
TL;DR: SyncSeal是一种专有的水印方法,用于增强现有水印技术对几何变换的鲁棒性,通过嵌入器和提取器网络实现图像同步。
Details
Motivation: 现有的水印方法在面对几何变换(如裁剪、旋转)时容易失效,因此需要一种能够有效应对这些变换的同步机制。 Method: 提出SyncSeal,利用嵌入器网络对图像进行不可察觉的修改,并使用提取器网络预测图像所经历的几何变换参数,两个网络联合端到端训练,并结合判别器以保持良好的视觉质量。 Result: 实验验证了该方法在多种几何和值变换下的有效性,能够准确同步图像,并显著提升现有水印方法对几何变换的抗性。 Conclusion: SyncSeal为图像水印提供了一种有效的同步解决方案,可广泛应用于增强现有水印技术的鲁棒性。 Abstract: Synchronization is the task of estimating and inverting geometric transformations (e.g., crop, rotation) applied to an image. This work introduces SyncSeal, a bespoke watermarking method for robust image synchronization, which can be applied on top of existing watermarking methods to enhance their robustness against geometric transformations. It relies on an embedder network that imperceptibly alters images and an extractor network that predicts the geometric transformation to which the image was subjected. Both networks are end-to-end trained to minimize the error between the predicted and ground-truth parameters of the transformation, combined with a discriminator to maintain high perceptual quality. We experimentally validate our method on a wide variety of geometric and valuemetric transformations, demonstrating its effectiveness in accurately synchronizing images. We further show that our synchronization can effectively upgrade existing watermarking methods to withstand geometric transformations to which they were previously vulnerable.[138] RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang,Siteng Huang,Shengke Xue,Yaxi Zhao,Jun Cen,Sicong Leng,Kehan Li,Jiayan Guo,Kexiang Wang,Mingxiu Chen,Fan Wang,Deli Zhao,Xin Li
Main category: cs.CV
TL;DR: 本文提出了RynnVLA-001,一种基于大规模人类示范视频生成预训练的视觉-语言-动作(VLA)模型,采用两阶段预训练方法,在下游机器人任务中优于现有最先进模型。
Details
Motivation: 为了提升视觉-语言-动作模型在机器人控制中的表现,需要更有效的预训练策略来联合建模视觉、语言与动作之间的关系。 Method: 提出两阶段预训练方法:第一阶段为以自我为中心的视频生成预训练,使用1200万条操作视频训练图像到视频模型;第二阶段为人本轨迹感知建模,联合预测未来关键点轨迹,并引入ActionVAE将动作序列压缩为紧凑的潜在嵌入以简化输出空间。 Result: 在相同下游机器人数据集上微调后,RynnVLA-001性能优于当前最先进的基线模型。 Conclusion: 所提出的两阶段预训练策略和ActionVAE有效提升了VLA模型的初始化质量,显著增强了其在机器人任务中的表现。 Abstract: This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.[139] Out-of-Sight Trajectories: Tracking, Fusion, and Prediction
Haichao Zhang,Yi Xu,Yun Fu
Main category: cs.CV
TL;DR: 本文提出了Out-of-Sight Trajectory (OST)任务,利用增强的视觉-定位去噪模块,在无监督情况下实现对不可见目标的噪声传感器数据进行去噪和轨迹预测,显著优于现有方法。
Details
Motivation: 现有轨迹预测方法依赖完整且无噪声的观测数据,难以应对实际中视野外目标和传感器噪声的问题,限制了在自动驾驶等场景中的可靠性与安全性。 Method: 提出增强的Vision-Positioning Denoising Module,结合相机标定建立视觉-定位映射,在无监督下实现噪声数据去噪,并扩展OOSTraj任务至行人和车辆轨迹预测。 Result: 在Vi-Fi和JRDB数据集上达到SOTA性能,优于传统方法(如卡尔曼滤波)和最新轨迹预测模型,建立了综合基准。 Conclusion: 本工作首次将视觉-定位投影用于不可见智能体的轨迹去噪与预测,推动了复杂环境下可靠轨迹预测的发展。 Abstract: Trajectory prediction is a critical task in computer vision and autonomous systems, playing a key role in autonomous driving, robotics, surveillance, and virtual reality. Existing methods often rely on complete and noise-free observational data, overlooking the challenges associated with out-of-sight objects and the inherent noise in sensor data caused by limited camera coverage, obstructions, and the absence of ground truth for denoised trajectories. These limitations pose safety risks and hinder reliable prediction in real-world scenarios. In this extended work, we present advancements in Out-of-Sight Trajectory (OST), a novel task that predicts the noise-free visual trajectories of out-of-sight objects using noisy sensor data. Building on our previous research, we broaden the scope of Out-of-Sight Trajectory Prediction (OOSTraj) to include pedestrians and vehicles, extending its applicability to autonomous driving, robotics, surveillance, and virtual reality. Our enhanced Vision-Positioning Denoising Module leverages camera calibration to establish a vision-positioning mapping, addressing the lack of visual references, while effectively denoising noisy sensor data in an unsupervised manner. Through extensive evaluations on the Vi-Fi and JRDB datasets, our approach achieves state-of-the-art performance in both trajectory denoising and prediction, significantly surpassing previous baselines. Additionally, we introduce comparisons with traditional denoising methods, such as Kalman filtering, and adapt recent trajectory prediction models to our task, providing a comprehensive benchmark. This work represents the first initiative to integrate vision-positioning projection for denoising noisy sensor trajectories of out-of-sight agents, paving the way for future advances. The code and preprocessed datasets are available at github.com/Hai-chao-Zhang/OST[140] Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model
Fangjinhua Wang,Qingshan Xu,Yew-Soon Ong,Marc Pollefeys
Main category: cs.CV
TL;DR: 本文提出了一种新的多视图立体(MVS)框架,将扩散模型引入深度估计任务,通过条件扩散过程进行深度图优化,并设计了高效的网络结构和基于置信度的采样策略。基于该框架提出了DiffMVS和CasDiffMVS方法,在效率和精度上均达到先进水平。
Details
Motivation: 现有的学习型MVS方法在效率与精度之间难以平衡,且缺乏有效的深度优化机制。受扩散模型在生成任务中成功应用的启发,探索其在判别式任务如深度估计中的潜力。 Method: 将深度图优化建模为条件扩散过程,设计条件编码器引导扩散;采用轻量级2D U-Net与卷积GRU结合的扩散网络提升效率;提出基于扩散模型输出置信度的自适应采样策略。在此基础上构建单阶段DiffMVS和级联CasDiffMVS两种方法。 Result: DiffMVS在运行时间和GPU内存使用上具有竞争力,而CasDiffMVS在DTU、Tanks & Temples和ETH3D数据集上实现了最先进的性能。 Conclusion: 扩散模型可有效应用于MVS任务中的深度图优化,所提出的框架在效率与精度方面均优于现有方法,验证了扩散模型在三维重建判别任务中的潜力。 Abstract: To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.[141] ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Zhaoyang Liu,JingJing Xie,Zichen Ding,Zehao Li,Bowen Yang,Zhenyu Wu,Xuehui Wang,Qiushi Sun,Shi Liu,Weiyun Wang,Shenglong Ye,Qingyun Li,Zeyue Tian,Gen Luo,Xiangyu Yue,Biqing Qi,Kai Chen,Bowen Zhou,Yu Qiao,Qifeng Chen,Wenhai Wang
Main category: cs.CV
TL;DR: 本文提出了ScaleCUA,一个通过大规模开源数据集和基础模型推动计算机使用代理(CUA)发展的框架。该数据集覆盖6种操作系统和3个任务领域,结合自动化代理与人类专家构建。基于此数据训练的模型在多个基准上显著超越现有方法,展示了数据驱动扩展在通用CUA中的潜力。
Details
Motivation: 现有的视觉语言模型虽然在GUI操作方面展现出潜力,但受限于缺乏大规模、开源的计算机使用数据和基础模型,阻碍了进一步发展。 Method: 提出了一种闭环管道,结合自动化代理和人类专家,构建了涵盖多种操作系统和任务领域的大规模计算机使用数据集,并在此基础上训练跨平台兼容的CUA模型。 Result: ScaleCUA在多个基准测试中显著优于基线模型,如WebArena-Lite-v2上提升+26.6,在ScreenSpot-Pro上提升+10.7,并在MMBench-GUI L1-Hard、OSWorld-G和WebArena-Lite-v2上取得新的SOTA结果(分别为94.4%、60.6%、47.4%)。 Conclusion: 研究表明,通过大规模数据驱动的方法可以有效提升通用计算机使用代理的性能,且作者将公开数据、模型和代码以促进后续研究。 Abstract: Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research: https://github.com/OpenGVLab/ScaleCUA.[142] Depth AnyEvent: A Cross-Modal Distillation Paradigm for Event-Based Monocular Depth Estimation
Luca Bartolomei,Enrico Mannocci,Fabio Tosi,Matteo Poggi,Stefano Mattoccia
Main category: cs.CV
TL;DR: 提出一种跨模态蒸馏范式,利用视觉基础模型(VFM)生成密集代理标签,用于事件相机的单目深度估计,无需昂贵的真实深度标注,在合成和真实数据集上均达到先进性能。
Details
Motivation: 缺乏带有密集真实深度标注的大规模数据集,限制了基于学习的事件数据单目深度估计的发展。 Method: 提出跨模态蒸馏范式,利用与RGB帧空间对齐的事件流和视觉基础模型(如Depth Anything v2或其改进的循环架构)生成密集代理标签,并适应单目事件相机的深度推断。 Result: 在合成和真实世界数据集上的实验表明,该方法在无需真实深度标注的情况下性能与全监督方法相当,并且基于VFM的模型达到了最先进的性能。 Conclusion: 所提出的跨模态蒸馏方法有效解决了事件相机深度估计中缺乏标注数据的问题,通过利用视觉基础模型实现了高性能的深度预测。 Abstract: Event cameras capture sparse, high-temporal-resolution visual information, making them particularly suitable for challenging environments with high-speed motion and strongly varying lighting conditions. However, the lack of large datasets with dense ground-truth depth annotations hinders learning-based monocular depth estimation from event data. To address this limitation, we propose a cross-modal distillation paradigm to generate dense proxy labels leveraging a Vision Foundation Model (VFM). Our strategy requires an event stream spatially aligned with RGB frames, a simple setup even available off-the-shelf, and exploits the robustness of large-scale VFMs. Additionally, we propose to adapt VFMs, either a vanilla one like Depth Anything v2 (DAv2), or deriving from it a novel recurrent architecture to infer depth from monocular event cameras. We evaluate our approach with synthetic and real-world datasets, demonstrating that i) our cross-modal paradigm achieves competitive performance compared to fully supervised methods without requiring expensive depth annotations, and ii) our VFM-based models achieve state-of-the-art performance.[143] Lost in Translation? Vocabulary Alignment for Source-Free Domain Adaptation in Open-Vocabulary Semantic Segmentation
Silvio Mazzucco,Carl Persson,Mattia Segu,Pier Luigi Dovesi,Federico Tombari,Luc Van Gool,Matteo Poggi
Main category: cs.CV
TL;DR: 本文提出了VocAlign,一种专为开放词汇语义分割中的视觉语言模型(VLM)设计的无源域自适应框架。
Details
Motivation: 为了在无需源域数据的情况下提升视觉语言模型在开放词汇语义分割任务中的域自适应性能。 Method: 采用学生-教师范式,结合词汇对齐策略,并利用低秩适应(LoRA)进行高效微调,同时提出Top-K类别选择机制以减少内存消耗。 Result: 在CityScapes数据集上实现了6.11 mIoU的显著提升,并在零样本分割基准上表现出优越性能。 Conclusion: VocAlign在开放词汇设置下的无源域自适应中树立了新标准。 Abstract: We introduce VocAlign, a novel source-free domain adaptation framework specifically designed for VLMs in open-vocabulary semantic segmentation. Our method adopts a student-teacher paradigm enhanced with a vocabulary alignment strategy, which improves pseudo-label generation by incorporating additional class concepts. To ensure efficiency, we use Low-Rank Adaptation (LoRA) to fine-tune the model, preserving its original capabilities while minimizing computational overhead. In addition, we propose a Top-K class selection mechanism for the student model, which significantly reduces memory requirements while further improving adaptation performance. Our approach achieves a notable 6.11 mIoU improvement on the CityScapes dataset and demonstrates superior performance on zero-shot segmentation benchmarks, setting a new standard for source-free adaptation in the open-vocabulary setting.[144] Calibration-Aware Prompt Learning for Medical Vision-Language Models
Abhishek Basu,Fahad Shamshad,Ashshak Sharifdeen,Karthik Nandakumar,Muhammad Haris Khan
Main category: cs.CV
TL;DR: 提出CalibPrompt框架,通过提示调优实现医学视觉-语言模型的校准,在少量标注数据下有效提升预测置信度的可靠性。